# Entropy and Information - Artificial Intelligence

In information theory,entropy provides a means for measuring the information transmitted between communicating agents down a communication channel.Entropy is the average number of bits required to communicate a message between two agents down a communication channel.The fundamental coding theorem (Shannon) states that the lower bound to the average number of bits per symbol needed to encode a message (i.e. a sequence of symbols such as text) sent down a communication channel is given by its entropy:

where there are k possible symbols with probability distribution and where the probabilities are independent and sum to 1.If the base of the logarithm is 2, then the unit of measurement is given in bits.For example,suppose that an alphabet contains five symbols White, Black, Red, Blue and Pink with probabilities 2

The entropy is a measure of how much uncertainty is involved in the selection of a symbol – the greater the entropy,the greater the uncertainty.It can also be considered a measure of the “information content” of the message – more probable messages convey less information than less probable ones.Entropy can be directly related to compression.The entropy is a lower bound on the average number of bits per symbol required to encode a long string of text drawn from a particular source language (Brown et al.Bell,Cleary & Witten (1990) show that arithmetic coding,a method of assigning codes to symbols with a known probability distribution,achieves an average code length arbitrarily close to the entropy.

Hence,compression can be used directly to estimate an upper bound to the entropy in the following manner.Given a sequence of n symbols x x xn 1 2 , , the entropy can be estimated by summing the code lengths required to encode each symbol:

Here the code length for each symbol xi is calculated by using the formula log2 p xi

The entropy calculated in this manner is relevant as it provides a measure of how well the statistical model is doing compared to other models.This is done by computing the entropy for each statistical model,and the model with the smallest entropy is inferred to be the “best”.