Mahout Classification Mahout

What is Classification?

Classification is a machine learning technique that uses known data to define how the new data should be classified into a set of existing categories. For instance,

  • iTunes application uses classification to prepare playlists.
  • Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder.

How classification Works

While classifying a given set of facts, the classifier system performs the following moves:

  • Initially, a new information model is ready the use of any of the learning algorithms.
  • Then the prepared statistics version is examined.
  • Thereafter, this data model is used to evaluate the new data and to determine its class.

classification_works

Applications of Classification

  • Credit card fraud detection - The Classification mechanism is used to guess credit card frauds. Using old information of earlier frauds, the classifier can predict which future transactions may turn into frauds.
  • Spam e-mails - Depending on the characteristics of previous spam mails, the classifier determines whether a newly encountered e-mail should be sent to the spam folder.

Naive Bayes Classifier

Mahout uses the Naive Bayes classifier algorithm. It uses two implementations:

  • Distributed Naive Bayes classification
  • Complementary Naive Bayes classification

Naive Bayes is a modest method for constructing classifiers. It is not a single algorithm for training such classifiers, but a family of algorithms. A Bayes classifier constructs models to classify problem instances. These classifications are made using the available data.

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification.
For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting.

Despite its oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.

Procedure of Classification

The following steps are to be followed to implement Classification:

  • Generate example data
  • Create sequence files from data
  • Convert sequence files to vectors
  • Train the vectors
  • Test the vectors

Step1: Generate Example Data

Generate or download the data to be classified. For example, you can get the20 newsgroups example data from the following link:http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Create a directory for storing input data. Download the example as shown below.

Step 2: Create Sequence Files

Create sequence file from the example using seqdirectory utility. The syntax to generate sequence is given below:

Step 3: Convert Sequence Files to Vectors

Create vector files from sequence files using seq2parse utility. The options ofseq2parse utility are given below:

Step 4: Train the Vectors

Train the generated vectors using the trainnb utility. The options to use trainnb utility are given below:

Step 5: Test the Vectors

Test the vectors using testnb utility. The options to use testnb utility are given below:

All rights reserved © 2018 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Mahout Topics