OpenNLP Referenced API - OpenNLP

What is referenced API?

Let’s discuss about the classes and methods that to be used in the subsequent chapters of this tutorial.

Sentence Detection

SentenceModel class

This class is known as the predefined model which is used to find the sentences in the given raw text. This class belongs to the package opennlp.tools.sentdetect.

While the constructor of this class applicable to an InputStream object of the sentence detector model file (en-sent.bin).

SentenceDetectorME class

This class pertains to the package opennlp.tools.sentdetect and it includes methods to split the raw text into sentences. This class maximum uses entropy model to evaluate end-of sentence characters in a string to specify the significance at the end of a sentence.

Following are the important methods of this class.

S.No Methods and Description
1

sentDetect()

This method is used to detect the sentences in the raw text passed to it. It accepts a String variable as a parameter and returns a String array which holds the sentences from the given raw text.

2

sentPosDetect()

This method is used to detect the positions of the sentences in the given text. This method accepts a string variable, representing the sentence and returns an array of objects of the typeSpan.

The class namedSpanof theopennlp.tools.utilpackage is used to store the start and end integer of sets.

3

getSentenceProbabilities()

This method returns the probabilities associated with the most recent calls tosentDetect()method.

Tokenization

TokenizerModel class

This class is known as the predefined model to be used to tokenize the given sentence. This class belongs to the package opennlp.tools.tokenizer.

The constructor of this class accepts an InputStream object of the tokenizer model file (entoken.bin).

Classes

To perform tokenization, the OpenNLP library provides three main classes. All the three classes implement the interface called Tokenizer.

S.No Classes and Description
1

SimpleTokenizer

This class tokenizes the given raw text using character classes.

2

WhitespaceTokenizer

This class uses whitespaces to tokenize the given text.

3

TokenizerME

This class converts raw text in to separate tokens. It uses Maximum Entropy to make its decisions.

These classes contain the following methods.

S.No Methods and Description
1

tokenize()

This method is used to tokenize the raw text. This method accepts a String variable as a parameter, and returns an array of Strings (tokens).

2

sentPosDetect()

This method is used to get the positions or spans of the tokens. It accepts the sentence (or) raw text in the form of the string and returns an array of objects of the typeSpan.

In addition to the above two methods, the TokenizerME class has the getTokenProbabilities() method.

S.No Methods and Description
1

getTokenProbabilities()

This method is used to get the probabilities associated with the most recent calls to thetokenizePos()method.

NameEntityRecognition

TokenNameFinderModel class

This class is known as the predefined model used to explore the named entities in the given sentence. This class belongs to the package opennlp.tools.namefind.

The constructor of this class accepts an InputStream object of the name finder model file (enner-person.bin).

NameFinderME class

The class is also part of the package opennlp.tools.namefind and it includes various methods to operate the NER tasks. This class uses a maximum entropy model to explore the named entities in the given raw text.

S.No Methods and Description
1

find()

This method is used to detect the names in the raw text. It accepts a String variable representing the raw text as a parameter and, returns an array of objects of the type Span.

2

probs()

This method is used to get the probabilities of the last decoded sequence.

Finding the Parts of Speech

POSModel class

This class uses the predefined model to be used to tag the parts of speech of the given sentence. This class belongs to the package opennlp.tools.postag.

The constructor of this class makes accepts an InputStream object of the pos-tagger model file (enpos-maxent.bin).

POSTaggerME class

This class includes the package opennlp.tools.postag and it is used to estimate the parts of speech of the said raw text. It uses Maximum Entropy to make its decisions.

S.No Methods and Description
1

tag()

This method is used to assign the sentence of tokens POS tags. This method accepts an array of tokens (String) as a parameter, and returns a tags (array).

2

getSentenceProbabilities()

This method is used to get the probabilities for each tag of the recently tagged sentence.

Parsing the Sentence

ParserModel class

This class presents the predefined model in which is used to parse the given sentence. This class belongs to the package opennlp.tools.parser.

The constructor of this class accepts an InputStream object of the parser model file (en-parserchunking.bin).

Parser Factory class

This class belongs to the package opennlp.tools.parser and it is used to create parsers.

S.No Methods and Description
1

create()

This is a static method and it is used to create a parser object. This method accepts the Filestream object of the parser model file.

ParserTool class

This class is part of the opennlp.tools.cmdline.parserpackage and, it is used to parse the content.

S.No Methods and Description
1

parseLine()

This method of theParserToolclass is used to parse the raw text in OpenNLP. This method accepts −

  • A String variable representing the text to be parsed.
  • A parser object.
  • An integer representing the no.of parses to be carried out.

Chunking

ChunkerModel class

This class belongs to the predefined model which is used to segregate the sentences into smaller chunks. This class belongs to the package opennlp.tools.chunker.

The constructor of this class classifies a InputStream object of the chunker model file (enchunker.bin).

ChunkerME class

This class is also part of the package named opennlp.tools.chunker and it is used to make parts of the given sentence in to smaller chunks.

S.No Methods and Description
1

chunk()

This method is used to divide the given sentence in to smaller chunks. It accepts tokens of a sentence andPartsOfSpeech tags as parameters.

2

probs()

This method returns the probabilities of the last decoded sequence.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

OpenNLP Topics