Did you “backoff” from “Interpolation” to avoid “perplexity”?

Kaivalya Vanguri
4 min readNov 22, 2023

--

Picture Courtesy: Kaivalya Vanguri

If you can’t read this fully. Click on the friend link.

If you are familiar with Natural Language Processing, Information Retrieval and other related domains, you might have heard of these key terms. There are several websites and blogs explaining these concepts to you in much more detail but this blog serves as a brief overview for a head start. These terms can be quite intriguing sometimes and I will try my best to explain them in easier words.

Let us understand by relating them to our day-to-day English words.

Backoff

To retreat on a particular idea or course of action.

By using Backoff in N-GRAM, we do a similar thing. When dealing with real time datasets the approach to find the apt N value for your N-gram is mostly trial and error. Using Backoff we discard the idea of using higher N value for the N-gram if, it fails to give a better accuracy score and choose a lower N value in search of efficiency. It is necessary when the dataset requires a more generalized N-gram and is sufficiently large. This technique is a subset of smoothing that helps in tackling the issues with probability estimation.

For instance, you may start with a 4-gram and end up at 2-gram to improve the probability estimation of your Language Model.

4-gram -> 3-gram -> 2-gram

Hence Backoff can be summarized as a smoothing technique to tackle the problems of sparsity and improve the generalization power of NLP models by using lesser context to generalize for contexts that the model doesn’t know enough.

Here is a generalized Mathematical form of Backoff Smoothing

P(wn​∣wn−1​,wn−2​,…,w1​) = λP(wn​∣wn−1​,wn−2​,…,wnN+1​)+(1−λ)P(wn​∣wn−1​,wn−2​,…,wnN+2​)

where λ is the backoff weight, it is a constant that ranges between 0 and 1.

Interpolation

This is one of the most intriguing terms if you aren't a Mathlete. Nevertheless, it simply is a process of extending a known concept to find the unknown with the help of pattern recognition and combining the existing knowledge.

This is also a smoothing technique just like Backoff but is intended for smaller datasets. It makes use of all the knowledge by combining multiple types of N-grams for the total computation of the probabilities. Interpolation is good with smaller datasets because it gives you a better picture of how much estimation is required for your language Model.

Conditional Interpolation

You can modify the classic Interpolation formula by identifying a part of your dataset that gives you better accuracy with a certain N-gram. Computing a value for λ is mostly done using the Expectation Maximization Algorithm that uses iteration to converge on an optimal value. The computed weightage λ is included next to the respective n-gram probability in the linear form of Interpolation. The weightage λ values tend to be different for each probability but the summation of all the weights must be equal to 1 in all cases.

Generalized Linear form of Interpolation:

Perplexity

a complicated or baffling situation or thing

Perplexity in NLP is the measurement of the surprise your model experiences when your model is given new data. Perplexity is a metric that measures the efficiency of your model with test data. Lower the perplexity better the model. It can be used to evaluate how well the model has learned the distribution of the trained text. It is a positive value that has a range of 0 to infinity.

Mathematically, Perplexity (PP) of a language model on a test set can be defined as the inverse probability of the test set, normalized by the number of words. It is mostly used to compare different language models.

Perplexity is also known as PP

Hope this explanation is easier to understand. Namaste and Thank you for giving this a read.

--

--