TIKA Language Detection - Apache Tika

What is the need of Language Detection?

A language detection tool is used to classify the documents based on the language they are written in a multilingual website. Language detection tool accept documents without language annotation (metadata) and adds that information in the metadata of the document by detecting the language.

Algorithms for Profiling Corpus

What is Corpus?

A corpus is known as a text set of these languages and these are used to detect the language of a document, a language profile is constructed and compared with the profile of the known languages.
A corpus is also known as a collection of texts of a written language that discusses about how the language is used in real situations.
The corpus is built from the books, transcripts, and other data resources like the Internet.

What are Profiling Algorithms?

Use dictionaries to detect the languages with algorithoms. The words will be used in the text will be compared with those that are in the dictionaries.
The most of the common words used in a language are different articles like a, an, the in English. These are the most simple and effective corpus for detecting a particular language.

Using Word Sets as Corpus

You can use word sets as a simple algorithm used to find the distance between two corpora. Corpus will be equal to the sum of differences among the frequencies of matching words.
The main challenges faced are as follows:
  • As the frequency of matching words is very less, the algorithm cannot efficiently work with small texts having few sentences. It requires lot of text for accurate match.
  • While it cannot find the word boundaries for languages having compound sentences, and those having no word dividers like spaces or punctuation marks.
Here you can overcome these difficulties by considering the word sets as corpus, individual characters or character groups.

Using Character Sets as Corpus

You can use the characters that are commonly used in a language are finite in number. You can apply an algorithm based on word frequencies rather than characters. Let’s check that this algorithm works even better in case of certain character sets used in one or very few languages.
Usually algorithm suffers from the following drawbacks:
  • You cannot differentiate two languages having similar character frequencies.
  • As there is no specific tool or algorithm to identify a language with the help of (as corpus) the character set used by multiple languages.

N-gram Algorithm

Based on the above drawbacks you can draw a new approach of using character sequences of a given length for profiling corpus. Such sequence of characters are called as N-grams in general, where N represents the length of the character sequence.
  • N-gram algorithm is an effective approach for language detection, especially in case of European languages like English.
  • This algorithm works fine with short texts.
  • Though there are advanced language profiling algorithms to detect multiple languages in a multilingual document having more attractive features, Tika uses the 3-grams algorithm, as it is suitable in most practical situations.

Language Detection in Tika

Tika detects only 18 languages as there are 184 standard languages standardized by ISO 639-1. Language detection in Tika is performed with getLanguage() method of the LanguageIdentifier class. This method returns the code name of the language in String format. Given below is the list of the 18 language-code pairs detected by Tika:
da—Danish de—German et—Estonian el—Greek
en—English es—Spanish fi—Finnish fr—French
hu—Hungarian is—Icelandic it—Italian nl—Dutch
no—Norwegian pl—Polish pt—Portuguese ru—Russian
sv—Swedish th—Thai
While instantiating the LanguageIdentifier class, you should pass the String format of the content to be extracted, or a LanguageProfile class object.
Given below is the example program for Language detection in Tika.
Now save the above code as LanguageDetection.java and run it from the command prompt using the following commands:
It gives the following output:

Language Detection of a Document

You can detect the language of a given document with the parse() method. The parse() method parses the content and stores it in the handler object, which was passed to it as one of the arguments. Pass the String format of the handler object to the constructor of the LanguageIdentifier class as shown below:
Let’s see the below complete program that describes how to detect the language of a given document:
Save the above code as SetMetadata.java and run it from the command prompt:
Note : Assume sample.txt is having the following content:
It gives the following output:
Along with the Tika jar, Tika provides a Graphical User Interface application (GUI) and a Command Line Interface (CLI) application. You can execute a Tika application from the command prompt too like other Java applications.

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Apache Tika Topics