# Shannon Entropy

## Shannon Entropy, a Measure of Uncertainty

$$\text{experimental device,}$$ $$\text{random variable X, and sample space }\Omega$$ | $$n=\frac{1}{p(x_i)}$$ | $$p(x_i)=\frac{1}{n}$$ | $$H_2(X)=\log_2 n$$ |
---|---|---|---|

$$\text{fair coin }X:\Omega \rightarrow\left\{\frac{1}{2},\frac{1}{2}\right\}$$ $$\Omega=\{head,tail\}$$ | $$\text{2}$$ | $$\frac{1}{2}$$ | $$\text{1}$$ |

$$\text{two fair i.i.d. coins or two flips of one fair coin}$$ $$\Omega = \Omega_1 \times \Omega_2=\{(h,h),(h,t),(t,h),(t,t)\}$$ | $$\text{4}$$ | $$\frac{1}{2}\frac{1}{2}=\frac{1}{4}$$ | $$\text{2}$$ |

$$\text{fair 8-sided dice}$$ $$\text{fair Octahedron}$$ $$\Omega=\{1,2,3,4,5,6,7,8\}$$ | $$\text{8}$$ | $$\frac{1}{8}$$ | $$\text{3}$$ |

$$\text{fair 6-sided dice}$$ $$\text{fair Cube}$$ $$\Omega=\{\cdot,\cdot\cdot,\cdot\cdot\cdot,::,:\cdot:,:::\}$$ | $$\text{6}$$ | $$\frac{1}{6}$$ | $$\text{2.585}$$ |

The last example 'fair 6-side dice' needs a more detailled explanation. H(X) = ld 6 = 2.585 means that after a random experiment (throw of the dice) the agent has to answer at least in the mean 2.585 yes/no-questions to guess the upper side (=state) of the dice. Because this is an If we ask the agent only 15 questions we make at least one error in the identification task. So we have to ask at least 16 appropriate questions. Even the knowledge about the constraint ' The sequence of binary yes/no questions can be mapped to a binary 1/0 code word w. The expected length of code words E(length(W)) is greater than or equal to the entropy of the underlying distribution. We give three examples : a sequential binary strategy (Fig. 1) and two strategies controlled by binary trees (Figs. 2, 3) generated by the Huffman algorithm (Abelson, Sussman, and Sussman, 1996, ch. 2.3.4). Take for example strategy 1 (Fig. 1). To identify X = 1 we need 1 question. Whereas the identification of X = 6 requires 5 questions. The expectation is 3.333... bits. This is also the expected word length E(Wl) of the Following the two other strategies (Fig. 2, Fig. 3) the expected number of questions is with 2.666... bits shorter than the expectation of strategy 1. Of course this value is greater than the entropy with 2.585 bits.
Shannon, C. E., A Mathematical Theory of Communication. Bell System Technical Journal, 1948, 27: 379-423, 623-656. - This paper was the birth of information theory - Bishop, Christopher M., Pattern Recognition and Machine Learning, Heidelberg&New York: Springer, 8/e, 2009 - a standard reference for machine learning - Dümbgen, Lutz, Stochastik für Informatiker, Heidelberg-New York: Springer, 2003 - Stochastic theory written by a mathematical statistician (unfortunately only in German) with some examples from CS (eg. entropy, coding theory, Hoeffding bounds, ...) - MacKay, David J.C., Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003, http://www.inference.org.uk/itprnn/book.pdf (visited, 2018/08/22) - The view of a well respected and authorative (see e.g. Pearl, 2018, The Book of Why, p.126) physicist on early Bayesian machine learning, coding theory, and neural nets; interesting how a physicist treats stochastics - Abelson, H., Sussman, G.J., & Sussman, Julie, Structure and Interpretation of Computer Programs, Cambrige, MA: MIT Press, 1996, 2/e - the 'bible' of software engineers, lays the foundation for functional probabilistic programming languages like CHURCH, WebCHURCH, and WebPPL ; provides a chapter on Huffman codes -
| |||

## WebPPL Code