Class JcmsTokenizer

  • All Implemented Interfaces:
    java.io.Closeable, java.lang.AutoCloseable

    public final class JcmsTokenizer
    extends org.apache.lucene.analysis.Tokenizer
    A grammar-based tokenizer constructed with JFlex, based on lucene default ClassicTokenizer.

    This should be a good tokenizer for most European-language documents:

    • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
    • Splits words at hyphens, underscore and dash, unless there's a number in the token.
    • Recognizes email addresses and internet hostnames as one token.
    • Nested Class Summary

      • Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

        org.apache.lucene.util.AttributeSource.State
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int ACRONYM  
      static int ACRONYM_DEP  
      static int ALPHANUM  
      static int APOSTROPHE  
      static int CJ  
      static int COMPANY  
      static int EMAIL  
      static int HOST  
      static int NUM  
      static java.lang.String[] TOKEN_TYPES
      String token types that correspond to token type int constants
      • Fields inherited from class org.apache.lucene.analysis.Tokenizer

        input
      • Fields inherited from class org.apache.lucene.analysis.TokenStream

        DEFAULT_TOKEN_ATTRIBUTE_FACTORY
    • Constructor Summary

      Constructors 
      Constructor Description
      JcmsTokenizer()
      Creates a new instance of the JcmsTokenizer.
      JcmsTokenizer​(org.apache.lucene.util.AttributeFactory factory)
      Creates a new JcmsTokenizer with a given AttributeFactory
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void close()  
      void end()  
      int getMaxTokenLength()
      Retrieve the max allowed token length
      boolean incrementToken()  
      void reset()  
      void setMaxTokenLength​(int length)
      Set the max allowed token length.
      • Methods inherited from class org.apache.lucene.analysis.Tokenizer

        correctOffset, setReader
      • Methods inherited from class org.apache.lucene.util.AttributeSource

        addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
      • Methods inherited from class java.lang.Object

        clone, finalize, getClass, notify, notifyAll, wait, wait, wait
    • Constructor Detail

      • JcmsTokenizer

        public JcmsTokenizer()
        Creates a new instance of the JcmsTokenizer. Attaches the input to the newly created JFlex scanner. See http://issues.apache.org/jira/browse/LUCENE-1068
      • JcmsTokenizer

        public JcmsTokenizer​(org.apache.lucene.util.AttributeFactory factory)
        Creates a new JcmsTokenizer with a given AttributeFactory
        Parameters:
        factory - the attribute factory to use
    • Method Detail

      • setMaxTokenLength

        public void setMaxTokenLength​(int length)
        Set the max allowed token length. Any token longer than this is skipped.
        Parameters:
        length - a length, must be greated than zero, default is 255
      • getMaxTokenLength

        public int getMaxTokenLength()
        Retrieve the max allowed token length
        Returns:
        a length greater than 0, default is 255
        See Also:
        setMaxTokenLength(int)
      • incrementToken

        public final boolean incrementToken()
                                     throws java.io.IOException
        Specified by:
        incrementToken in class org.apache.lucene.analysis.TokenStream
        Throws:
        java.io.IOException
      • end

        public final void end()
                       throws java.io.IOException
        Overrides:
        end in class org.apache.lucene.analysis.TokenStream
        Throws:
        java.io.IOException
      • close

        public void close()
                   throws java.io.IOException
        Specified by:
        close in interface java.lang.AutoCloseable
        Specified by:
        close in interface java.io.Closeable
        Overrides:
        close in class org.apache.lucene.analysis.Tokenizer
        Throws:
        java.io.IOException
      • reset

        public void reset()
                   throws java.io.IOException
        Overrides:
        reset in class org.apache.lucene.analysis.Tokenizer
        Throws:
        java.io.IOException