Lucene Ngramtokenizer

From what I found, NgramTokenizer() in lucene is really a character tokenizer. If a gram is a word, then you need ShingleFilter(). And the result of ShingleFilter() will have following behavior:

InputText = “This is the best, buy it.”

For bigram, you will have “best buy” as a bigram.

Here is the demo of how I tested.

NgramTokenizer()

	private static Analyzer WV_ANALYZER = new LocaleAnalyzer(new Locale("en")) {
		protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
			Tokenizer source = new NGramTokenizer(reader, 1, 2);
			TokenStream result = new StandardFilter(matchVersion, source);
			return new TokenStreamComponents(source, result);
		}
	};

That’s the definition for the Analyzer using NgramTokenizer().

ShingleFilter()

	private static Analyzer ANALYZER_NGRAM = new LocaleAnalyzer(new Locale("en")) {
		protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
			Tokenizer source = new LDATokenizer(matchVersion, reader);
			TokenStream result = new StandardFilter(matchVersion, source);
			result = new LowerCaseFilter(matchVersion, result);			
		        ShingleFilter shingle = new ShingleFilter(result, 2); //NOTE: asking for bi-gram
		        shingle.setOutputUnigrams(true);
			result = new StopFilter(matchVersion, shingle, getStopwords());
			return new TokenStreamComponents(source, result);
		}
	};

Now let’s call them:

		String text = "This is the best, buy it.";

		TokenStream stream = ANALYZER_NGRAM.tokenStream(null, new StringReader(text));
		stream.reset();
		System.out.println("ANALYZER_NGRAM:");
		while (stream.incrementToken()) {
			String token = stream.getAttribute(CharTermAttribute.class).toString();
		    System.out.println(token);
		}

		stream = WV_ANALYZER.tokenStream(null, new StringReader(text));
		stream.reset();
		System.out.println("WV_ANALYZER:");
		while (stream.incrementToken()) {
			String token = stream.getAttribute(CharTermAttribute.class).toString();
		    System.out.println(token);
		}

Results are:

ANALYZER_NGRAM:
this is
is the
the best
best buy
buy
buy it
WV_ANALYZER:
Th
hi
is
s 
 i
is
s 
 t
th
he
e 
 b
be
es
st
t,
, 
 b
bu
uy
y 
 i
it
t.

Comments are closed.