OpenNLP Sentence Detection - OpenNLP

What is sentence detection?

Sentence detection is known as deciding the beginning and end of the sentences in languages to be addressed. This process is known as Sentence Boundary Disambiguation (SBD) or simply sentence breaking.

This technique is mainly used to find the sentences in the given text, based on the text in the language.

Sentence Detection Using Java

Sentence detection is possible in Java with Regular Expressions, and a set of simple rules.

Let’s take an example, assume a period, a question mark, or an exclamation mark ends a sentence in the given text, then use the split() method of the String class to split the sentence. You can use a regular expression in String format.

Below mentioned program decides the sentences in a given text with Java regular expressions (split method). Snow save this program in a file with the name SentenceDetection_RE.java.

Let’s compile and execute the saved java file from the command prompt with following commands.

Once you execute the above program then makes a PDF document displaying the following message.

Sentence Detection Using OpenNLP

Use a predefined model in OpenNLP to detect sentences named en-sent.bin. This predefined model is trained to detect sentences in a given raw text.

The opennlp.tools.sentdetect package includes the classes and interfaces to run the sentence detection task.

To detect a sentence using OpenNLP library, you need to −

  • Load the en-sent.bin model using the SentenceModelclass
  • Instantiate the SentenceDetectorME class.
  • Detect the sentences using the sentDetect() method of this class.

Below mentioned steps are to be followed to write a program to detect the sentence from the given raw text.

Step 1: Loading the model

To load the model in sentence detection is known by the class named SentenceModel,it belongs to the package opennlp.tools.sentdetect.

To load a sentence detection model −

  • Create an InputStream object of the model (Instantiate the FileInputStream and pass the path of the model in String format to its constructor).
  • Instantiate the SentenceModel class and pass the InputStream (object) of the model as a parameter to its constructor as shown in the following code block –

Step 2: Instantiating the SentenceDetectorME class

The SentenceDetectorME class of the package opennlp.tools.sentdetect includes various methods to split the raw text into sentences. This class takes the Maximum Entropy model to find end-of-sentence characters in a string to know if they signify the end of a sentence.

Instantiate this class and pass the model object created in the previous step, as shown below.

Step 3: Detecting the sentence

The sentDetect() method of the SentenceDetectorME class is mainly useful to find the sentences in the raw text passed to it. This method uses the String variable as a parameter.

Invoke this method by passing the String format of the sentence to this method.

Example

Let’s see the below program to detect the sentences in a given raw text. Save this program in a file with named SentenceDetectionME.java.

Use following commands to compile and execute the saved Java file from the Command prompt:

Upon reading the program while executing it detects the sentences in it and displays the following output.

Detecting the Positions of the Sentences

You can use sentPosDetect() method of the SentenceDetectorME class to detect the positions of the sentences,

Use below mentioned steps write a program on detection of the sentences from the given raw text.

Step 1: Loading the model

The model for sentence detection is presented by the class named SentenceModel to the package opennlp.tools.sentdetect.

To load a sentence detection model −

  • Create an InputStream object of the model (Instantiate the FileInputStream and pass the path of the model in String format to its constructor).
  • Instantiate the SentenceModel class and pass the InputStream (object) of the model as a parameter to its constructor, as shown in the following code block.

Step 2: Instantiating the SentenceDetectorME class

The SentenceDetectorME class of the package opennlp.tools.sentdetect includes different methods to separate the raw text into sentences. Usually this class uses the Maximum Entropy model to provide end-of-sentence characters in a string to know if they signify the end of a sentence.

Instantiate this class and pass the model object created in the previous step.

Step 3: Detecting the position of the sentence

The sentPosDetect() method of the SentenceDetectorMEclass is used to detect the positions of the sentences in the raw text passed to it. Here this method accepts a String variable as a parameter.

You can Invoke this method by passing the String format of the sentence as a parameter to this method.

Step 4: Printing the spans of the sentences

The sentPosDetect() method of the SentenceDetectorMEclass gives back the array of objects of the type Span. The class named Span of the opennlp.tools.util package is designed to save and start and end integer of sets.

Here you can save the spans returned by the sentPosDetect()method in the Span array and print them, which is mentioned in the following code block.

Example

Below mentioned program finds the sentences in the given raw text. You can save this program in a file with named SentenceDetectionME.java.

Let’s compile and execute the saved Java file from the Command prompt using the following commands –

Upon executing the above program it reads the given String and finds the sentences in it and explores below output.

Sentences along with their Positions

The substring() method of the String class works on the beginand the end offsets and returns the respective string. This method is used to print the sentences and their spans (positions) at a time as mentioned in below code.

Below mentioned code is used to detect the sentences from the given raw text and show them with their positions. Let’s save this program in a file with name SentencesAndPosDetection.java.

You cam compile and run the saved Java file from the Command prompt with below commands –

Once you execute the above program it reads the given String and detects the sentences with their positions and provides following output.

Sentence Probability Detection

The getSentenceProbabilities() method of the SentenceDetectorME class gives the probabilities included with the most recent calls to the sentDetect() method.

Below mentioned program is used to print the probabilities related to the calls to the sentDetect() method. Let’s save this program in a file by naming SentenceDetectionMEProbs.java.

Now compile and run the saved Java file from the Command prompt with below commands –

Once you execute the above program then it reads the given String and detects the sentences and prints them. Along with this it returns the probabilities associated with the most recent calls to the sentDetect() method, as mentioned below.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

OpenNLP Topics