Shannon Entropy
Shannon Entropy
Shannon Entropy, a Measure of Uncertainty
$$\text{experimental device,}$$ $$\text{random variable X, and sample space }\Omega$$ | $$n=\frac{1}{p(x_i)}$$ | $$p(x_i)=\frac{1}{n}$$ | $$H_2(X)=\log_2 n$$ |
---|---|---|---|
$$\text{fair coin }X:\Omega \rightarrow\left\{\frac{1}{2},\frac{1}{2}\right\}$$ $$\Omega=\{head,tail\}$$ | $$\text{2}$$ | $$\frac{1}{2}$$ | $$\text{1}$$ |
$$\text{two fair i.i.d. coins or two flips of one fair coin}$$ $$\Omega = \Omega_1 \times \Omega_2=\{(h,h),(h,t),(t,h),(t,t)\}$$ | $$\text{4}$$ | $$\frac{1}{2}\frac{1}{2}=\frac{1}{4}$$ | $$\text{2}$$ |
$$\text{fair 8-sided dice}$$ $$\text{fair Octahedron}$$ $$\Omega=\{1,2,3,4,5,6,7,8\}$$ | $$\text{8}$$ | $$\frac{1}{8}$$ | $$\text{3}$$ |
$$\text{fair 6-sided dice}$$ $$\text{fair Cube}$$ $$\Omega=\{\cdot,\cdot\cdot,\cdot\cdot\cdot,::,:\cdot:,:::\}$$ | $$\text{6}$$ | $$\frac{1}{6}$$ | $$\text{2.585}$$ |
The last example 'fair 6-side dice' needs a more detailled explanation. H(X) = ld 6 = 2.585 means that after a random experiment (throw of the dice) the agent has to answer at least in the mean 2.585 yes/no-questions to guess the upper side (=state) of the dice. Because this is an expectation it can be a real number. To guess each possible state the total number of questions needed is at least 2.585 * n = 2.585 * 6 = 15.51 questions. If we ask the agent only 15 questions we make at least one error in the identification task. So we have to ask at least 16 appropriate questions. Even the knowledge about the constraint 'values of opposite sides sum to seven' does not reduce the number of necessary questions. Entropy determines an upper bound of the expected information a probability distribution can provide or determines a lower bound on the necessary number of yes/no questions to identify all values of a random experiment controlled by the distribution. This relation between entropy and shortest coding length is a general one. The 'noiseless coding theorem' (Shannon, 1948) states that the entry is a lower bound on the number of bits needed to transmit the state of a random variable (Bishop, 8/e, 2009, p.50). The sequence of binary yes/no questions can be mapped to a binary 1/0 code word w. The expected length of code words E(length(W)) is greater than or equal to the entropy of the underlying distribution. We give three examples : a sequential binary strategy (Fig. 1) and two strategies controlled by binary trees (Figs. 2, 3) generated by the Huffman algorithm (Abelson, Sussman, and Sussman, 1996, ch. 2.3.4). Take for example strategy 1 (Fig. 1). To identify X = 1 we need 1 question. Whereas the identification of X = 6 requires 5 questions. The expectation is 3.333... bits. This is also the expected word length E(Wl) of the variable length prefix code generated by that tree. These are codes with varying length of code words. No shorter word is the prefix of a longer word. Following the two other strategies (Fig. 2, Fig. 3) the expected number of questions is with 2.666... bits shorter than the expectation of strategy 1. Of course this value is greater than the entropy with 2.585 bits. References and Further Reading Shannon, C. E., A Mathematical Theory of Communication. Bell System Technical Journal, 1948, 27: 379-423, 623-656. - This paper was the birth of information theory - Bishop, Christopher M., Pattern Recognition and Machine Learning, Heidelberg&New York: Springer, 8/e, 2009 - a standard reference for machine learning - Dümbgen, Lutz, Stochastik für Informatiker, Heidelberg-New York: Springer, 2003 - Stochastic theory written by a mathematical statistician (unfortunately only in German) with some examples from CS (eg. entropy, coding theory, Hoeffding bounds, ...) - MacKay, David J.C., Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003, http://www.inference.org.uk/itprnn/book.pdf (visited, 2018/08/22) - The view of a well respected and authorative (see e.g. Pearl, 2018, The Book of Why, p.126) physicist on early Bayesian machine learning, coding theory, and neural nets; interesting how a physicist treats stochastics - Abelson, H., Sussman, G.J., & Sussman, Julie, Structure and Interpretation of Computer Programs, Cambrige, MA: MIT Press, 1996, 2/e - the 'bible' of software engineers, lays the foundation for functional probabilistic programming languages like CHURCH, WebCHURCH, and WebPPL ; provides a chapter on Huffman codes -
| |||



WebPPL Code

