Language Modelling - Artificial Intelligence

The method of making decisions based on which statistical model has been best at predicting the past sequences of events (for example, using compression code length calculations as above) has a wide range of applications for agent decision making. In this chapter, we have been focusing on communication and language, and we can examine some further examples that illustrate this, specifically in the area of language modelling.

In a statistical (or probabilistic) model of language, the assumption is made that the symbols (e.g. words or characters) can be characterized by a set of conditional probabilities. A language model is a computer mechanism for determining these conditional probabilities. It assigns a probability to all possible sequences of symbols.

The most successful types of language models for sequences of symbols that occur in natural languages such as English and Welsh are word n-gram models (which base the probability of a word on the preceding n words) and part-of-speech n-gram models (which base the probability on the preceding words and parts of speech; these are also called n-pos models).Character n-gram models (models based on characters) have also been tried,although they do not feature as prominently in the literature as the other two classes of model. The probabilities for the models are estimated by collecting frequency statistics from a large corpus of text, called the training text, in a process called training. The size of the training text is usually very large containing many millions (and in some cases billions) of words.

These models are often referred to as “Markov models” because they are based on the assumption that language is a Markov source. An n-gram model is called an order n -1 Markov model (for example, a trigram model is an order 2 Markov model). In an n-gram model, the probabilities are conditioned on the previous words in the text. Formally, the probability of a sequence S, of n words, w w wn1 2 , is given by: Here, wi is called the prediction and 1 2 1 , , iw w w the history. Building a language model that uses the full history can be computationally expensive.n-gram language models make the assumption that the history is equivalent to the previous n -1 words (called the conditioning context).

For example, bigram models make the following approximation: In other words, only the previous word is used to condition the probability. Trigram models condition the probability on the two previous words: The assumptions which these approximations are based upon are called “Markov assumptions.” For example, the bigram model makes the following Markov assumption: It might seem that such drastic assumptions would adversely affect the performance of the statistical models, but in practice, bigram and trigram models have been applied successfully to a wide range of domains especially machine translation and speech recognition. A high percentage of English speech and writing consists of stock phrases that reappear again and again; if someone is halfway through one of them, we know with near-certainty what his next few words will be”. The decision as to which assumptions yield better performance is an empirical issue rather than a theoretical one.

Artificial Intelligence Topics