Information Theory in Machine Learning


Information can be represented as bits. One bit of information allows us to choose between two equally probable, or equiprobable, alternatives. In other words, in a scenario with 2 equiprobable choices, if you get an instruction to choose one of the choices, that is one bit of information. Say, each instruction can be represented as a binary digit (0=‘choice 1’ and 1=‘choice 2’) then this binary digit provides you with one bit of information. In the case of ‘n’ sequential forks, with ‘m’ final choices (destinations), then m= 2n. In other words, n= log2m. Information is thus, an ordered symbol of sequences to interpret its meaning.
Please note: A binary digit is the value of a binary variable, whereas a bit is an amount of information. A binary digit (when averaged over both of its possible states) can convey between zero and one bit of information.
By now, we know what information means. It is a mathematical representation of uncertainty. Let us now understand what information theory is.
Figure 1: A binary symmetric channel showing probability of incorrect transmission of message given the transmitted symbol is x and the received symbol is y
When we transmit data, for instance, a string of bits, over a communication channel or any computational medium, there is a probability (say ‘p’) that the received message will not be identical to the transmitted message. An ideal channel is where this probability § is zero (or almost 0). However, most real world channels have a non zero probability ‘f’ of incorrect transmission of information and a probability of (1 minus f) that each bit of information will be transmitted correctly. Given this probability of error, one needs to find solutions to reduce the same in order to transmit information with minimum error. One can use the ‘physical solution’ where one needs to revamp the physical attributes of the communication channel, for instance the circuitry. However, this might increase the operational cost. An alternative solution is to update the system using ‘information theory’ and ‘coding theory’. Under these solutions, we accept the given noisy physical channel in its current form. We then add communication systems to it so that we can detect and correct the errors introduced by the channel. This system normally comprises of an encoder and decoder. The ‘system’ solutions can help you design a reliable communication channel with the only additional cost of computations of an encoder and a decoder. While coding theory helps you design the appropriate decoder and encoder, information theory helps you define the quality of information or content you can transmit. It can help you study the theoretical limitations and potentials of a system.
Information theory treats information as a physical entity, like energy or mass. It deals with theoretical analyses of how information can be transmitted over any channel: natural or man-made. Thus, it defines a few laws of information. Let us assume a basic system for information flow as follows:
Figure 2: Information Channel: A message (data) is encoded before being fed into a communication channel, which adds noise. The channel output has the encoded message with noise that is decoded by a receiver to recover the message.
There are a few laws on information that can be derived from the above channel:
  • There is a definite upper limit, the channel capacity, to the amount of information that can be communicated through that channel.
  • This limit shrinks as the amount of noise in the channel increases.
  • This limit can very nearly be reached by judicious packaging, or encoding, of data.
Essentially, information theory entails two broad techniques:
  1. Data Compression (source coding): More frequent events should have shorter encodings
  2. Error Correction (channel coding): Should be able to infer encoded event even if message is corrupted by noise
Both these methods require you to build probabilistic models of the data sources. This is why information theory is relevant to machine learning and data analytics. It not only helps you measure the accuracy of information contained in a data source, but also helps you improve results of predictive models that might be built on this data. Before we study how information theory can be applied to, let us study a few basic terms.

Few important terms of Information Theory and Machine Learning


Information theory represents data in the form of codewords. These codewords are representation of the actual data elements in the form of sequence of binary digits or bits. There are various techniques to map each symbol (data element) with the corresponding codeword.

Variable Length Codes

One traditional way of assigning codewords to data elements would be to assign codes with fixed lengths, or every data element gets assigned a code of the same code length. Let us look at a visual representation of the same. If we have 4 values of a given variable, say ‘Discount type’ with the following 4 values: ‘Promotion’, ‘Priority Customer’, ‘Repeat buy’ and ‘None’. If we assume that the 4 types of discounts are equally probable, they can safely be mapped to codewords of equal length, say 00, 01, 10 and 11, as shown in the image below:
Figure 3: Fixed Length Encoding: Every data element here is assumed to have equal likelhood of occurrence. Hence, every code word have equal and fixed length (i.e. 2)
Now, if we know that discounts under ‘Promotion’ are more frequent than the rest and have been assigned the following probability values.
Figure 4: Varying probabilities of each data element indicating the likelihood of occurrence of the event
We can then, design codes weighted according to these probability values. We design the code to reduce the total length of the message. Since one would transmit the most frequent terms (data elements) most number of times, you would want to associate the least length with the most frequent term. Hence, we design a code with the least number of bits for the most frequent data element. Let us look at an example of the same.
Figure 5: Variable Length Encoding: Every data element has been mapped to a sequence of binary codes of varying length according to the probability of occurrence
This variable-length code will now represent each data element in a unique and exclusive manner. Moreover, if you consider the vertical axis to visualize the probability of each word, p(x), and the horizontal axis to visualize the length of the corresponding codeword, L(x), let us compute the area covered by the encoding. This amounts to 1.75 bits, which is a reduction from a fixed-length encoding of 2 bits each for all the above terms. The fixed-length encoding would amount to 2 bits.

Optimal Encoding

The most optimal encoding has to strike a balance between the length of the message and the cost of the codeword. Let us look at how one affects the other. The cost of buying a codeword of length 0 is 1. This is because you are buying all possible codewords, and hence, if you want to have a codeword of length 0, you can’t have any other codeword. The cost of a codeword of length 1, like “0” or “1”, is (1/2) because half of possible codewords start with “0” or “1” respectively. The cost of a codeword of length 2, like “01”, is (1/4) because a quarter of all possible codewords start with “01”. In general, the cost of codewords decreases exponentially with the length of the codeword.
Refer to the image below:
Figure 6: Cost of a codeword as a function of the length of the codeword
We know that the average length of the total code string is the function of the probability of each word and the length of the corresponding codeword. It can be represented as . This average length contribution is related to the cost through the length of the codeword. The amount we pay decides the length of the codeword. The length of the codeword controls how much it adds to the average message length. We can picture the two of these together in the following way:
Figure 7: Relationship between Average Length Contribution and Cost of the Codeword
Thus, we can clearly establish that shorter codewords cost higher and vice versa.
Figure 8: Costs of long and short codewords

Shannon Information Content

Let us consider a set of discrete random variables ‘X’ which contains a possible set of events { } where ‘i’ represents the range of elements contained in the set ‘X’. Shannon Information Content or Information Content is the measure of information contained in the event: . Information content actually quantifies the surprisal of each outcome of any experiment. Higher the surprisal of an outcome, higher is the information content.
Thus, the information you can gain when an event occurs which had some probability value associated with it, can be represented as:
The above equation will show that as the value of probability ‘p’ increases, the information content decreases. You get a curve that looks something like this:
Figure 9: The information content function versus probability values
The information content is measured in bits. It is also known as self information.


While designing an optimal codeword for our data elements above, we see that the average length contribution of a message and the cost of the same line up well. In other words, the height of the rectangle denoting the average length contribution of messages is almost equal to the maximum height of the exponential curve denting the cost of the codeword. Thus, if we slightly increase the length of the codeword, the message length contribution will increase in proportion to its height at the boundary, while the cost will decrease in proportion to its height at the boundary.
Figure 10: Proportionality of cost with the height of the message length contribution
We can also say that the cost to make the codeword for a data element ‘a’ shorter is p(a). Now, let us generalize this to any message ‘X’. For a message ‘X’ with the likelihood of occurrence , the cost can be approximated as too. We also know that the cost of a message of length ‘L’ is . If we invert this to obtain the length of a message that costs a given amount (cost), we get: . If we substitute the value of cost by , we derive ‘L’ as . Thus, the length of a message ‘X’ can be denoted in terms of its probability in the following equation:
As we discussed earlier, there is a limit to how short an average message can get to communicate events effectively, from a particular probability distribution ‘p’ of events ‘X’. This limit, the average message length using the best possible code, is called the entropy or Shannon entropy of ‘p’, . In other words, entropy is a measure of unpredictability of the state (X), or equivalently, of its average information content. The formula to represent entropy is as follows:
can also be denoted by where ‘p’ is the vector containing probabilities of each of the discrete events. Entropy is measured in bits too.

Conditional Entropy

The average uncertainty about the output value (say ‘Y’) given an input value (‘X’) is the conditional entropy . Here, is,
‘the residual uncertainty (entropy) of ‘Y’ given that we know the value of ‘X’’. Let us assume a channel with noise and input ‘X’ and output ‘Y’. In this case, can also be represented as . It can be represented in the following form:
The above equation can further be condensed into the following form:
You could modify this equation for a given value of the input, say in the following manner:

Cross Entropy or Joint Entropy

If X and Y are discrete random variables and p(X,Y) is the value of their joint probability distribution at (, y), then the joint entropy or cross entropy of X and Y is:
Let’s say you have two discrete distributions: p(x) and q(x) with common elements. One way of defining cross entropy would be:
the average length of code for communicating an event from one distribution with the optimal code for another distribution is called the cross-entropy. It can be represented so:
For continuous distributions, you can simply replace the discrete summation with integral functions. Cross-entropy is an important measure. It gives us a way to show how different two probability distributions are. The more different the distributions ‘p’ and ‘q’ are, the more the cross-entropy of ‘p’ with respect to ‘q’ will be bigger than the entropy of ‘p’.

Mutual Information

Let us consider a channel with inputs ‘x’, output ‘y’ and noise .
Figure 11: Channel design demonstrating an ideal communication channel
In the above channel and are the marginal entropies of the inputs and outputs respectively. and are the conditional entropies of the inputs given the outputs and vice versa respectively. We can now derive mutual information from this system. The mutual information between two variables, such as a channel input ‘x’ and output ‘y’, is the average amount of information that each value of ‘x’ provides about ‘y’. It can represented so:
In terms of 2 two random variables ‘X’ and ‘Y’ , whose joint distribution is defined by , mutual information can be represented in the following form:
Another metric that can be derived from the mutual information is the variation of information. The variation of information is the information which isn’t shared between these two variables (‘x’ and ‘y’). We can define it like so:
The variation of information between two variables is zero if knowing the value of one tells you the value of the other and increases as they become more independent.

Conditional Mutual Information

The conditional mutual information is defined as the expected value of the mutual information of two random variables given the value of a third. For random variables X, Y, and Z, it can be represented as . Let us look at a representation of these three variables in the form of a Venn diagram:
Figure 12: Channel design demonstrating an ideal communication channel
Mathematically, it can be defined as:
This can be simplified to:
The above entity is greater than ‘0’ for discrete, jointly distributed random variables X, Y and Z. The above probability mass functions represented by ‘p’ can be evaluated using conventional definitions of probability.

Multivariate mutual information

Multivariate mutual information (MMI) or Multi-information is the amount of information about Z which is yielded by knowing both X and Y together is the information that is mutual to Z and the X,Y pair, written as . In the above Venn diagram, it is the central part of the diagram. Mathematically, a general form of multi-information can be represented as a sum of entropies:
Positive MMI is typical of common-cause structures. For example, clouds cause rain and also block the sun; therefore, the correlation between rain and darkness is partly accounted for by the presence of clouds, . The result is positive MMI . A condition where indicates a negative MMI .

Channel Capacity

The measure channel capacity basically helps us answer how fast can we transmit information over a communication channel. One of the most common and convenient channels used is the additive channel. This channel adds noise ( ) to the encoded input values as they pass through the channel, so that the channel output is a noisy version ‘y’ of the channel input ‘x’. It looks something like this:

The channel capacity ‘C’ is the maximum amount of information that a channel can provide at its output about the input. It is often represented as:
where is the mutual information of the inputs and the output.
The rate at which information is transmitted through a channel can be determined using the entropies of three variables:
  1. the entropy of the input,
  2. the entropy of the output,
  3. the entropy of the noise in the channel.
A output entropy is high then this provides a large potential for information transmission. It depends on the input entropy and the level of noise. If the noise is low, then the output entropy can be as close to the channel capacity. his further explains and reinforces equation [12].However, channel capacity gets progressively smaller as the noise increases. Capacity is usually expressed in bits-per-usage (bits per output), or bits-per-second (bits/s). The Shannon-Hartley theorem states that the channel capacity © is given by:
Here, is the signal-to-noise ratio of the channel. ‘B’ is the bandwidth of the channel in Hertz and C is represented in bits-per-second.

Information Divergence

Information divergence is a measure of dissimilarity of content of two probability distributions. This is a general concept that is used across various domains to compare distributions and the information content of the same. In a lot of machine learning objectives, it can be used as an approximation of the form , where is the observed data (input) and is the approximation given by the model.
There is a large variety of information divergences. Most of them are distance metrics. We are covering one of them in the next section.

Relative entropy or Kullback–Leibler divergence

KL divergence stands for Kullback-Leibler divergence. If we have two distributions and , defined for the same set of events, KL divergence helps you determine the difference between these distributions. It is closely related to relative entropy, and information divergence. It is a non-symmetric measure of the difference between two probability distributions. This means that the divergence of from , denoted by or , is a measure of the information lost when q(x) is used to approximate p(x). It implies that . For two discrete distributions, such that and , it can be represented in the following way:
Thus, KL divergence or relative entropy measures the expected number of extra bits required to code samples from when using a code based on , instead of using a code based on . Since KL divergence is a measure of dissimilarity or a distance, it is closely related to cross entropy and mutual information. If you remember from the last section, ‘variation of information’ quantifies the notion of distance, between different variables. However, it differs slightly from KL divergence.
KL divergence gives us a distance between two distributions over the same variable or set of variables. In contrast, variation of information gives us distance between two jointly distributed variables. KL divergence talks about divergence between distributions, while variation of information does the same within a distribution.
KL divergence is not a symmetric measure (i.e. ), we get two objective functions to optimize while approximating p(x) in a loss-less manner efficiently. These functions are:
  1. Minimizing the forward KL divergence:
  2. Minimizing the reverse KL divergence:
Both the above optimization functions actually cause different types of approximations. Let us look at each one of them in a little detail:

Forward KL

Let us start with the base equation of Forward KL:
Let us start expanding equation [14]. On substituting the value of KL divergence, with its expansion from equation [12]:

This can also be written as:
In the above equation, the first element is the cross entropy between P and Q (also denoted by H(p,q) while the second element is the entropy of ‘P’(also represented as ). Since the entropy of is fixed the final equation takes the following form:
The above equation looks similar to the maximum likelihood estimation objective in machine learning exercises. This objective will sample points from p(X) and try to maximize the probability of occurrence of these points under q(X). mean-seeking behaviour, because the approximate distribution Q must cover all the modes (frequent events) and regions of high probability in P.
In short, **“Wherever P has high probability, Q must also have high probability.”**Let us look at an example of an approximate distribution for the same:
Figure 13: Approximate distribution for minimization of forward KL divergence
In the above diagram, the approximate distribution Q centers itself between the two modes of P, so that it can have high coverage of both. The ‘forward KL divergence’ optimization technique does not penalize Q for having high probability mass where P does not.

Application of forward KL divergence

Now, in supervised learning techniques employing ‘empirical risk minimization’, we use a dataset of samples from some ground-truth data distribution . In supervised learning we aim to learn a model that maps data elements X to Y in the following manner: . This model should eventually minimize the empirical risk of the model, which is determined by a parameter defining loss of information, also called *loss function"" . A loss function is a mapping function that associates an event to a real number or a value with some “cost”. We will talk about the loss functions in detail later. We need to optimize the loss function (reduce the cost) over some distribution of models . This creates an optimization objective function as follows:

Optimizing this objective is equivalent to minimizing the divergence from an approximate distribution( ) to the true data distribution( ). This concept can be extended to both regression and classification in the following ways:
  1. Regression with Mean-Squared Error Loss: In regression, one of the significant loss functions is the mean-squared error. We aim to reduce the loss in order to get the most accurate predictions. If the distribution of your estimations can be represented as which is normally distributed. The negative log-likelihood of this normal distribution can be defined as: , where are the results from the regression function and are the actuals. Minimizing the negative log-likelihood of this normal distribution is hence, equivalent to the mean-squared error loss.
  2. Classification with Cross Entropy Loss: In this case, the approximate distribution (Q) is the result of the classification model. It is represented as a discrete event distribution parameterized by a probability vector (probability of an event belonging to a particular class). In classification you aim to reduce the cross entropy loss(or log loss) of the predicted results against the ground-truth.
The above logic can be applied to any loss function. This way, we can improve the accuracy of the machine learning models. This concept is widely applied to supervised learning techniques.

Reverse KL

Let us start with the base equation of Reverse KL:
The equation for reverse KL can be written as:

The above equation will sample points from Q and try to maximize the probability of these points under P. The entropy term encourages the approximate distribution to be as wide as possible. A good approximation under the reverse KL objective thus satisfies the below condition: "Wherever Q has high probability, P must also have high probability."
It is known as the mode-seeking behaviour because any sample from the approximate distribution Q must lie within a mode of P (since it’s required that samples from Q are highly probable to occur under P). Let us look at an approximate distribution to visualize the same.
Figure 14: Approximate distribution for minimization of reverse KL divergence
As we see here, the approximate distribution essentially encompasses the right mode of P. The reverse KL divergence does not penalize Q for not placing probability mass on the other mode of P.

Application of reverse KL divergence

Reverse KL divergence finds its application in reinforcement learning. The maximum-entropy reinforcement learning objective which is used in reinforcement learning models uses this principle extensively.

Differential Entropy

While applying statistics to information theory, we come across a concept of maximum entropy probability distributions. Let us begin with a binomial event of flipping a coin. If a random variable ‘X’ represents the toss of a fair coin we have, and with . If we were to compute the entropy of all possible probability values, we will obtain a distribution.

Loss functions

Loss functions are basically just objective functions that are applied to a lot of machine learning models. The functions take from the concepts of information theory. While a lot of loss functions are designed on the basis of information content or the entropy of datasets, there are a few others that use simple mathematical operations to measure the accuracy and performance of machine learning models. Let us refer to this image below:
Figure 15: Some key loss functions in classification and regression models
Let us talk about a few loss functions in detail.

Cross Entropy Loss and Log Loss

To understand the definition and application of this loss function, let us first understand that it is used in classification problems. This entropy-based loss function is widely used to measure the performance of classification algorithms that give the probability of a record belonging to a particular class as an output. Cross entropy is the more generic form of another loss function, called the logarithmic loss or log loss, when it comes to machine learning algorithms. While log loss is used for binary classification algorithms, cross-entropy serves the same purpose for multiclass classification problems. In other words, log loss is used when there are 2 possible outcomes and cross-entropy is used when there are more than 2 possible outcomes.
These loss functions quantify the price paid for the inaccuracy of predictions in classification problems by penalizing false classifications by taking into account the probability of classification. The generic equation of cross entropy loss looks like the following:
Here, ‘M’ is the number of outcomes or labels that are possible for a given situation and ‘N’ is the number of samples or instances in the dataset. ‘ ’ is the model’s probability of assigning label ‘j’ to instance ‘i’ and ‘ ’ depicts the outcome of the i-th instance with respect to the possible labels (j). Now, if you see the equations looks similar to the cross entropy equation above. Let us take an example for the same.
Let us assume a problem statement where one has to predict the range of grades a student will score in an exam given his attributes. If there are three possible outcomes: High, Medium and Low represented by [(1,0,0) (0,1,0) (0,0,1)]. Now, for a particular student, the predicted probabilities are (0.2, 0.7, 0.1). This indicates the predicted range of scores will most likely be ‘Medium’ as the probability is the highest there. Hence, the cross-entropy loss would be given as .
The equation for cross-entropy loss (equation [22]) can be converted into log loss, by accounting for only two possible outcomes. Thus the formula looks something like this:

Following the same conventions as equation [eq:1], here depicts the outcome (the class) of the i-th instance. is the probability (outcome of the classifier) of the i-th instance assuming the value ‘ ’. So if assumes the value ‘1’, would be equal to 0: the two classes of a binary classification problem.
We have already spoken about KL divergence as a loss function for both regression and classification problems. Let us look at a few other loss functions.

Focal Loss

Focal loss is an improvement of cross-entropy loss often used for object detection and neural networks. In problems like object detection, a major issue is caused by class imbalance due to the presence of large number of easily-classified background examples. Focal loss is one such loss function that tries to accommodate for this class imbalance.

Here, is defined the following way:

This basically indicates the two possible outcomes of a binary classifier.
This loss function is a dynamically scaled log loss function, where the scaling factor decays to zero as confidence in the correct class increases. This scaling factor can automatically down-weight the contribution of easy examples (examples present in large numbers) during training and rapidly focus the model on hard examples (sparse examples). Focal loss enables the user to train a high-accuracy, one-stage detector that significantly outperforms the alternatives of training with hard example mining. Focal loss can be depicted used the below graph:
Figure 16: Focal loss adds a scaling factor to standard cross entropy loss
Here, we see that the focal loss function adds a factor to the standard binary loss function. The relative loss for well-classified easy examples (having ) is reduced for all . As the value of increases this loss further reduces towards the tail of the plots. In the above curve, the line corresponding to represents the cross entropy (CE) loss or more specifically, the log loss function.
There are several other functions that use probability-based concepts, not necessarily based on entropy and information content. You can explore their use and application according to your requirements and the kind of model you are trying to evaluate.

Learning Rate

Learning rate, also called step-size, is a model-hyperparameter that tells us how to adjust the weights of our model network with respect to the loss gradient function. This concept comes from the optimization functions we saw previously. When applied to machine learning, learning rate helps determine how to change the parameters of a model for it to converge the best, or have the most accuracy. In other words, if we have a hypothesis or model represented in the following manner:

We have to alter the weights