# Cross Entropy

# Cross Entropy

## Cross-entropy and the Cross-entropy Error Function

**Cross-entropy**

*Cross entropy "is the average number of bits needed to encode data coming from a source with distribution \(p\) when we use model \(q\)*" (Murphy, 2012, p.58).

In contrast to the formula for Shannon information content ("entropy") the formula for cross entropy involves two distributions \(p(X)\) and \(q(Y)\) with same support or set of events ("*alphabet*") \(x, y \in \{1, 2, ..., k\}\):

$$ H_b(X,Y) = H_b(p,q) := - \sum_{j=1}^m p(x_j) \log_b q(y_j).$$

The identifiers "*alphabet*" and "*ensemble*" are used by McKay (2003, p.22). "*Ensemble*" is what is usually (Pollard, 2010, p.18) known as a *probability space* \( (\Omega, \mathcal{F}, \mathbb{P})\). \(\Omega\) is the *set of outcomes*, \(\mathcal{F}\) the *sigma algebra* or *sigma field of events*, \(\mathbb{P}\) the *probability measure*. The "*alphabet*" is the *set of basic events* and therefor a subset of \(\mathcal{F}\).

The derivation of \(H_b(p,q)\) is similar to the entropy. If the probability of the *outcome* or *realization* \(y_j\)

$$q(y_j)$$ is replaced by its **surprisal** $$\frac{1}{q(y_j)}$$

and the basis \(b\)* is set to *\(b=2\)*, *the * information content *\(I_2(y_j)\) of an outcome \(y_j\)is defined to be

$$h_2(y_j) = I_2(Y=y_j) := \log_2 \frac{1}{q(y_j)} $$

bits*. *

The* cross entropy *\(H_2(X, Y)\)of the probability mass function (PMF) \(P(X, Y)\)is then the expected mean

$$ H_2(X,Y) = E_p[I_2(X, Y)] := \sum_{j=1}^m p(x_j) \log_2 \frac{1}{q(y_j)} = \sum_{j=1}^m p(x_j) (\log_2 1 - log_2 q(y_j)) $$

$$= \sum_{j=1}^m p(x_j) (0 - log_2 q(y_j)) = \sum_{j=1}^m - p(x_j) log_2 p(x_j) = -\sum_{j=1}^m p(x_j) \log_2 q(y_j), $$

where \(m\) is the number of categories in the "alphabet" (set of events, values or realizations in\(X\) or \(Y\). Here the expectation is not over \(q(X)\) but over \(p(Y)\).

If model fit is perfect cross entropy is identical to (self) entropy

$$ H_b(p,q) = H_b(p, p) = H_b(p). $$

**Cross-entropy Error Function**

Cross-entropy can be used as a loss or error function in e.g. logistic regression or neural networks when used as classification models.

Cross-entropy is based on Bernoulli variables and Bernoulli experiments. We begin with a single binary variable. Usually this is understood as the sample space of coin flipping experiment. Here we think of a binary \(\textit{target}\) variable \(t \in \{0, 1\}\) designing class membership. \(t = 0\) designates samples from class \(0\) and \(t = 1\) from class \(1\).

The derivation of the \(\textit{cross entropy error function}\) from the negative likelihood of the linear logistic regression can be found in Bishop (Bishop, 2009, p.206).

**References**

Bishop, Chr.M., Pattern Recognition and Machine Learning, Springer, 2009

MacKay, David J.C., Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003, http://www.inference.org.uk/itprnn/book.pdf (visited, 2018/08/22)

Murphy, K.P., Machine Learning - A Probabilistic Perspective, MIT Press, 2012

Pollard, D., A User's Guide to Measure Theoretic Probability, Cambridge University Press, 2010

------------------------------------------------------------------------------

This is a draft. Any feedback or bug report is welcome. Please contact:

------------------------------------------------------------------------------