# Cross Entropy

## Cross-entropy and the Cross-entropy Error Function

##### Cross-entropy

Cross entropy "is the average number of bits needed to encode data coming from a source with distribution $$p$$ when we use model $$q$$" (Murphy, 2012, p.58).

In contrast to the formula for Shannon information content ("entropy") the formula for cross entropy involves two distributions $$p(X)$$ and $$q(Y)$$ with same support or set of events ("alphabet") $$x, y \in \{1, 2, ..., k\}$$:

$$H_b(X,Y) = H_b(p,q) := - \sum_{j=1}^m p(x_j) \log_b q(y_j).$$

The identifiers "alphabet" and "ensemble" are used by McKay (2003, p.22). "Ensemble" is what is usually (Pollard, 2010, p.18) known as a probability space  $$(\Omega, \mathcal{F}, \mathbb{P})$$. $$\Omega$$ is the set of outcomes, $$\mathcal{F}$$ the sigma algebra or sigma field of events, $$\mathbb{P}$$ the probability measure. The "alphabet" is the set of basic events and therefor a subset of $$\mathcal{F}$$.

The derivation of $$H_b(p,q)$$ is similar to the entropy. If the probability of the outcome or realization $$y_j$$

$$q(y_j)$$ is replaced by its surprisal $$\frac{1}{q(y_j)}$$

and the basis $$b$$ is set to $$b=2$$, the information content $$I_2(y_j)$$ of an outcome $$y_j$$ is defined to be

$$h_2(y_j) = I_2(Y=y_j) := \log_2 \frac{1}{q(y_j)}$$

bits.

The cross entropy $$H_2(X, Y)$$ of the probability mass function (PMF) $$P(X, Y)$$ is then the expected mean

$$H_2(X,Y) = E_p[I_2(X, Y)] := \sum_{j=1}^m p(x_j) \log_2 \frac{1}{q(y_j)} = \sum_{j=1}^m p(x_j) (\log_2 1 - log_2 q(y_j))$$

$$= \sum_{j=1}^m p(x_j) (0 - log_2 q(y_j)) = \sum_{j=1}^m - p(x_j) log_2 p(x_j) = -\sum_{j=1}^m p(x_j) \log_2 q(y_j),$$

where $$m$$ is the number of categories in the "alphabet" (set of events, values or realizations in $$X$$ or $$Y$$. Here the expectation is not over $$q(X)$$ but over $$p(Y)$$.

If model fit is perfect cross entropy is identical to (self) entropy

$$H_b(p,q) = H_b(p, p) = H_b(p).$$

##### Cross-entropy Error Function

Cross-entropy can be used as a loss or error function in e.g. logistic regression or neural networks when used as classification models.

Cross-entropy is based on Bernoulli variables and Bernoulli experiments. We begin with a single binary variable. Usually this is understood as the sample space of coin flipping experiment. Here we think of a binary $$\textit{target}$$ variable $$t \in \{0, 1\}$$ designing class membership. $$t = 0$$ designates samples from class $$0$$ and $$t = 1$$ from class $$1$$.

The derivation of the $$\textit{cross entropy error function}$$ from the negative likelihood of the linear logistic regression can be found in Bishop (Bishop, 2009, p.206).

##### References

Bishop, Chr.M., Pattern Recognition and Machine Learning, Springer, 2009

MacKay, David J.C., Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003, http://www.inference.org.uk/itprnn/book.pdf (visited, 2018/08/22)

Murphy, K.P., Machine Learning - A Probabilistic Perspective, MIT Press, 2012

Pollard, D., A User's Guide to Measure Theoretic Probability, Cambridge University Press, 2010

------------------------------------------------------------------------------