# Mahout Classification Mahout

## What is Classification?

Classification is a machine learning technique that uses known data to define how the new data should be classified into a set of existing categories. For instance,

• iTunes application uses classification to prepare playlists.
• Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder.

## How classification Works

While classifying a given set of facts, the classifier system performs the following moves:

• Initially, a new information model is ready the use of any of the learning algorithms.
• Then the prepared statistics version is examined.
• Thereafter, this data model is used to evaluate the new data and to determine its class.

## Applications of Classification

• Credit card fraud detection - The Classification mechanism is used to guess credit card frauds. Using old information of earlier frauds, the classifier can predict which future transactions may turn into frauds.
• Spam e-mails - Depending on the characteristics of previous spam mails, the classifier determines whether a newly encountered e-mail should be sent to the spam folder.

## Naive Bayes Classifier

Mahout uses the Naive Bayes classifier algorithm. It uses two implementations:

• Distributed Naive Bayes classification
• Complementary Naive Bayes classification

Naive Bayes is a modest method for constructing classifiers. It is not a single algorithm for training such classifiers, but a family of algorithms. A Bayes classifier constructs models to classify problem instances. These classifications are made using the available data.

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification.
For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting.

Despite its oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.

## Procedure of Classification

The following steps are to be followed to implement Classification:

• Generate example data
• Create sequence files from data
• Convert sequence files to vectors
• Train the vectors
• Test the vectors

### Step1: Generate Example Data

Generate or download the data to be classified. For example, you can get the20 newsgroups example data from the following link:http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Create a directory for storing input data. Download the example as shown below.

### Step 2: Create Sequence Files

Create sequence file from the example using seqdirectory utility. The syntax to generate sequence is given below:

### Step 3: Convert Sequence Files to Vectors

Create vector files from sequence files using seq2parse utility. The options ofseq2parse utility are given below:

### Step 4: Train the Vectors

Train the generated vectors using the trainnb utility. The options to use trainnb utility are given below:

### Step 5: Test the Vectors

Test the vectors using testnb utility. The options to use testnb utility are given below: