What is Lucene Analysis?

In individual of our preceding chapters, we have seen that Lucerne use IndexWriterto analyze the Document(s) using the Analyzer and then creates/open/edit index as optional. In this chapter, we are going to converse the different types of Analyzer substance and other related things which are used during the examination process. Accepting the Analysis procedure and how analyzers work will give you great near over how Lucerne indexes the documents.

Following is the list of things that we'll consider in due path.

S.No. Class & Description
1 Token
Token represents text or word in a document with relevant details like its metadata (position, start offset, end offset, token type and its position increment).
2 TokenStream
TokenStream is an output of the analysis process and it comprises of a series of tokens. It is an abstract class.
3 Analyzer
This is an abstract base class for each and every type of Analyzer.
4 WhitespaceAnalyzer
This analyzer splits the text in a document based on whitespace.
5 SimpleAnalyzer
This analyzer splits the text in a document based on non-letter characters and puts the text in lowercase.
6 StopAnalyzer
This analyzer works just as the SimpleAnalyzer and removes the common words like'a', 'an', 'the',etc.
7 StandardAnalyzer
This is the most sophisticated analyzer and is capable of handling names, email addresses, etc. It lowercases each token and removes common words and punctuations, if any.

