Shannon Entropy

Shannon Entropy

Shannon Entropy, a Measure of Uncertainty


In information theory entropy is a measure of uncertainty about the outcomes of a random experiment, the values of a random variable, or the dispersion or a variance of a probability distribution. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message.

The formula for information entropy was introduced by Claude E. Shannon in his 1948 paper "A Mathematical Theory of Communication".

$$ H_b(X) = E[I_b(X)] = - \sum_{i=1}^n p(x_i) \log_b p(x_i)$$
If $$p(x_i)$$ is replaced by its surprise $$\frac{1}{p(x_i)}$$ and the basis b is set to b=2, the entropy H(X) of random variable X can be rewritten to

$$ H_2(X) = E[I_2(X)] = \sum_{i=1}^n p(x_i) \log_2 \frac{1}{p(x_i)} $$

, where n is the number of values of X, and I2(xi) is the information content in bits of value or state X = xi

$$I_2(X=x_i) = \log_2 \frac{1}{p(x_i)} $$

In the case of uniform probability distributions the semantics of I2(xi) can be defined as the number of binary Yes/No-questions an omniscient and benevolent agent has to answer so that the questioning agent can identify the state or value xi of the random variable X.

In the case of a uniform probability distribution H2(X) simplifies to

$$ H_2(X) = E[I_2(X)] = p(x_i) \sum_{i=1}^n \log_2 \frac{1}{p(x_i)} = \frac{1}{n} \sum_{i=1}^n \log_2 n = \frac{1}{n} n \log_2 n = \log_2 n$$

We provide some examples in the following table:

$$\text{Entropy of Some Simple Uniform Probability Distributions}$$

$$\text{experimental device,}$$

$$\text{random variable X, and sample space }\Omega$$

$$n=\frac{1}{p(x_i)}$$ $$p(x_i)=\frac{1}{n}$$ $$H_2(X)=\log_2 n$$

$$\text{fair coin }X:\Omega \rightarrow\left\{\frac{1}{2},\frac{1}{2}\right\}$$


$$\text{2}$$ $$\frac{1}{2}$$


$$\text{two fair i.i.d. coins or two flips of one fair coin}$$

$$\Omega = \Omega_1 \times \Omega_2=\{(h,h),(h,t),(t,h),(t,t)\}$$

$$\text{4}$$ $$\frac{1}{2}\frac{1}{2}=\frac{1}{4}$$


$$\text{fair 8-sided dice}$$

$$\text{fair Octahedron}$$


$$\text{8}$$ $$\frac{1}{8}$$


$$\text{fair 6-sided dice}$$

$$\text{fair Cube}$$


$$\text{6}$$ $$\frac{1}{6}$$



The last example 'fair 6-side dice' needs a more detailled explanation. H(X) = ld 6 = 2.585 means that after a random experiment (throw of the dice) the agent has to answer at least in the mean 2.585 yes/no-questions to guess the upper side (=state) of the dice. Because this is an expectation it can be a real number. To guess each possible state the total number of questions needed is at least 2.585 * n = 2.585 * 6 = 15.51 questions.

If we ask the agent only 15 questions we make at least one error in the identification task. So we have to ask at least 16 appropriate questions. Even the knowledge about the constraint 'values of opposite sides sum to seven' does not reduce the number of necessary questions. Entropy determines an upper bound of the expected information a probability distribution can provide or determines a lower bound on the necessary number of yes/no questions to identify all values of a random experiment controlled by the distribution. This relation between entropy and shortest coding length is a general one. The 'noiseless coding theorem' (Shannon, 1948) states that the entry is a lower bound on the number of bits needed to transmit the state of a random variable (Bishop, 8/e, 2009, p.50).

The sequence of binary yes/no questions can be mapped to a binary 1/0 code word w. The expected length of code words E(length(W)) is greater than or equal to the entropy of the underlying distribution. 

We give three examples : a sequential binary strategy (Fig. 1) and two strategies controlled by binary trees (Figs. 2, 3) generated by the Huffman algorithm (Abelson, Sussman, and Sussman, 1996, ch. 2.3.4). Take for example strategy 1 (Fig. 1). To identify X = 1 we need 1 question. Whereas the identification of X = 6 requires 5 questions. The expectation is 3.333... bits. This is also the expected word length E(Wl) of the variable length prefix code generated by that tree. These are codes with varying length of code words. No shorter word is the prefix of a longer word.

Following the two other strategies (Fig. 2, Fig. 3) the expected number of questions is with 2.666... bits shorter than the expectation of strategy 1. Of course this value is greater than the entropy with 2.585 bits.

References and Further Reading

Shannon, C. E., A Mathematical Theory of Communication. Bell System Technical Journal, 1948, 27: 379-423, 623-656. - This paper was the birth of information theory -

Bishop, Christopher M., Pattern Recognition and Machine Learning, Heidelberg&New York: Springer, 8/e, 2009 - a standard reference for machine learning -

Dümbgen, Lutz, Stochastik für Informatiker, Heidelberg-New York: Springer, 2003 - Stochastic theory written by a mathematical statistician (unfortunately only in German) with some examples from CS (eg. entropy, coding theory, Hoeffding bounds, ...) -

MacKay, David J.C., Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003, (visited, 2018/08/22) - The view of a well respected and authorative (see e.g. Pearl, 2018, The Book of Why, p.126) physicist on early Bayesian machine learning, coding theory, and neural nets; interesting how a physicist treats stochastics -

Abelson, H., Sussman, G.J., & Sussman, Julie, Structure and Interpretation of Computer Programs, Cambrige, MA: MIT Press, 1996, 2/e - the 'bible' of software engineers, lays the foundation for functional probabilistic programming languages like CHURCH, WebCHURCH, and WebPPL ; provides a chapter on Huffman codes -




WebChurch Code

(Changed: 2020-01-23)