OpenNLP Tokenization - OpenNLP

What is Tokenization?

Tokenization is known as the process of chalking out the given sentence into smaller parts (tokens). While the given raw text is tokenized as per the basis of a set of delimiters (mostly whitespaces).
Tokenization is applied across the tasks like spell-checking, processing searches, identifying parts of speech, sentence detection, document classification of documents, etc.

Tokenizing using OpenNLP

The opennlp.tools.tokenize package includes the classes and interfaces which are used to perform tokenization.
The OpenNLP library provides three different classes to tokenize the given sentences into simpler fragments −
  • SimpleTokenizer − This class tokenizes the given raw text using character classes.
  • WhitespaceTokenizer − This class uses whitespaces to tokenize the given text.
  • TokenizerME − This class converts raw text into separate tokens. It uses Maximum Entropy to make its decisions.

SimpleTokenizer

You can tokenize a sentence with the SimpleTokenizer class, you need to −
  • Create an object of the respective class.
  • Tokenize the sentence using the tokenize() method.
  • Print the tokens.
Below mentioned steps are used to follow to write a program which tokenizes the given raw text.
Step 1 − Instantiating the respective class
Here there are no constructors available to instantiate them in both classes. So we need to create objects of these classes using the static variable INSTANCE.
Step 2 − Tokenize the sentences
Both classes include a method called tokenize(). This method provides a raw text in String format. After invoking, it tokenizes the given String and provides an array of Strings (tokens).
Tokenize the sentence using the tokenizer() method as shown below.
Step 3 − Print the tokens
Once you tokenize the sentence, you can print the tokens using for loop, as mentioned below.

Example

Belwo mentioned program uses the given sentence with SimpleTokenizer class. Now save this program in a file with the name SimpleTokenizerExample.java.
Let’s compile and execute the saved Java file from the Command prompt with below commands –
Once you execute the above program reads the given String (raw text), tokenizes it, and displays the following output –

WhitespaceTokenizer

To tokenize a sentence you need to use the the WhitespaceTokenizer class, you need to −
  • Create an object of the respective class.
  • Tokenize the sentence using the tokenize() method.
  • Print the tokens.
Below mentioned steps are to be followed to write a program which tokenizes the given raw text.
Step 1 − Instantiating the respective class
You can find no constructors available in both the classes to instantiate them. So, we need to create objects of these classes using the static variable INSTANCE.
Step 2 − Tokenize the sentences
Here both these classes include a method called tokenize(). But this method accepts a raw text in String format. After invoking, it tokenizes the given String and returns an array of Strings (tokens).
Now tokenize the sentence with tokenizer() method as mentioned below.
Step 3 − Print the tokens
Once you complete the tokenizing the sentence, then you can print the tokens with for loop, as following.

Example

Below mentioned program which tokenizes the given sentence with WhitespaceTokenizer class. Let’s save this program in a file with the name WhitespaceTokenizerExample.java.
Now compile and execute the saved Java file from the Command prompt with below commands –
Once you execute the above program then it reads the given String (raw text), tokenizes it, and displays the following output.

TokenizerME class

OpenNLP also uses a predefined model to tokenize the sentences with a file named de-token.bin. It is available to tokenize the sentences in a given raw text.
It uses OpenNLP library to tokenize the given text with the TokenizerME class of the opennlp.tools.tokenizerpackage. To do so, you need to −
  • Load the en-token.bin model using the TokenizerModel class.
  • Instantiate the TokenizerME class.
  • Tokenize the sentences using the tokenize() method of this class.
Below mentioned steps are used to write a program which tokenizes the sentences from the given raw text using the TokenizerME class.
Step 1 − Loading the model
The mLoading model is notes by the class named TokenizerModel, which is part of the package opennlp.tools.tokenize.
To load a tokenizer model −
  • Create an InputStream object of the model (Instantiate the FileInputStream and pass the path of the model in String format to its constructor).
  • Instantiate the TokenizerModel class and pass the InputStream (object) of the model as a parameter to its constructor, as shown in the following code block.
Step 2 − Instantiating the TokenizerME class
The TokenizerME class of the package opennlp.tools.tokenize includes various methods to trim the raw text into smaller parts (tokens). It uses Maximum Entropy to make its decisions.
You can instantiate this class and pass the model object created as mentioned in earlier.
Step 3 − Tokenizing the sentence
The tokenize() method of the TokenizerME class is used to tokenize the raw text passed to it. Here this method provides a String variable as a parameter, and returns an array of Strings (tokens).
You can Invoke this method by passing the String format of the sentence to this method, as mentioned below.

Example

Belwo mentioned code is used to tokenize the given raw text. Let’s save this program in a file with the name TokenizerMEExample.java.
Now compile and run the saved Java file from the Command prompt with below commands –
Once you execute the above program then it reads the given String and detects the sentences and displays below output –

Retrieving the Positions of the Tokens

You can also get the positions or spans of the tokens with the tokenizePos() method. You can use this method of the Tokenizer interface of the package opennlp.tools.tokenize. Here all the (three) Tokenizer classes operates this interface among all methods.
Here you can find that this method accepts the sentence or raw text in the form of a string and returns an array of objects of the type Span.
Let’s find the positions of the tokens using the tokenizePos()method, as mentioned below –

Printing the positions (spans)

Use the class named Span of the opennlp.tools.util package to store the start and end integer of sets.
Here you can store the spans returned by the tokenizePos() method in the Span array and print them, as mentioned in the following code block.

Printing tokens and their positions together

You can use the substring() method of the String class accepts the begin and the end offsets and returns the respective string. To take a printout of tokens and their spans (positions) together you can use this method as mentioned in the following code block.

Example(SimpleTokenizer)

Let’s see below program used to retrieve the token spans of the raw text with the SimpleTokenizer class. It prints the tokens along with their positions. Now save this program in a file with named SimpleTokenizerSpans.java.
Let’s compile and execute the saved Java file from the Command prompt with following commands –
Once you execute the above program it reads the given String (raw text), tokenizes it, and displays the following output –

Example (WhitespaceTokenizer)

Below mentioned program retrieves the token spans of the raw text with the help of WhitespaceTokenizer class. It also prints the tokens along with their positions. Save this program in a file with the name WhitespaceTokenizerSpans.java.
Compile and execute the saved java file from the command prompt using the following commands
After executing the above program it reads the given String (raw text), tokenizes it, and displays the following output.

Example (TokenizerME)

Here’s the program used to retrieve the token spans of the raw text using the TokenizerME class. It highlights the prints the tokens along with their positions. Now save this program in a file with the name TokenizerMESpans.java.
Now compile and execute the saved Java file from the Command prompt with below commands –
After executing the above program reads the given String (raw text), tokenizes it, and displays the following output –

Tokenizer Probability

The getTokenProbabilities() method of the TokenizerME class is used to get the probabilities included with the most recent calls to the tokenizePos() method.
Let’s see the program to print the probabilities associated with the calls to tokenizePos() method. Now save this program in a file with the name

TokenizerMEProbs.java.

Now compile and run the saved Java file from the Command prompt using the following commands –
Once you execute the above program it reads the given String and tokenizes the sentences and prints them. It also returns the probabilities associated with the most recent calls to the tokenizerPos() method.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

OpenNLP Topics