# Cross Entropy

# Cross Entropy

## Cross-entropy and the Cross-entropy Error Function

**Cross-entropy**

*Cross entropy "is the average number of bits needed to encode data coming from a source with distribution \(p\) when we use model \(q\)*" (Murphy, 2012, p.58).

In contrast to the formula for Shannon information content ("entropy") the formula for cross entropy involves two distributions \(p(X)\) and \(q(Y)\) with same support or set of events ("*alphabet*") \(x, y \in \{1, 2, ..., k\}\):

$$ H_b(X,Y) = H_b(p,q) := - \sum_{j=1}^m p(x_j) \log_b q(y_j).$$

The identifiers "*alphabet*" and "*ensemble*" are used by McKay (2003, p.22). "*Ensemble*" is what is usually (Pollard, 2010, p.18) known as a *probability space* \( (\Omega, \mathcal{F}, \mathbb{P})\). \(\Omega\) is the *set of outcomes*, \(\mathcal{F}\) the *sigma algebra* or *sigma field of events*, \(\mathbb{P}\) the *probability measure*. The "*alphabet*" is the *set of basic events* and therefor a subset of \(\mathcal{F}\).

The derivation of \(H_b(p,q)\) is similar to the entropy. If the probability of the *outcome* or *realization* \(y_j\)

$$q(y_j)$$ is replaced by its **surprisal** $$\frac{1}{q(y_j)}$$

and the basis \(b\)* is set to *\(b=2\)*, *the * information content *\(I_2(y_j)\) of an outcome \(y_j\)

*is defined to be*

$$h_2(y_j) = I_2(Y=y_j) := \log_2 \frac{1}{q(y_j)} $$

bits*. *

The* cross entropy *\(H_2(X, Y)\)* *of the probability mass function (PMF) \(P(X, Y)\)* *is then the expected mean * *

$$ H_2(X,Y) = E_p[I_2(X, Y)] := \sum_{j=1}^m p(x_j) \log_2 \frac{1}{q(y_j)} = \sum_{j=1}^m p(x_j) (\log_2 1 - log_2 q(y_j)) $$

$$= \sum_{j=1}^m p(x_j) (0 - log_2 q(y_j)) = \sum_{j=1}^m - p(x_j) log_2 p(x_j) = -\sum_{j=1}^m p(x_j) \log_2 q(y_j), $$

where \(m\) is the number of categories in the "alphabet" (set of events, values or realizations in* *\(X\) or \(Y\). Here the expectation is not over \(q(X)\) but over \(p(Y)\).

If model fit is perfect cross entropy is identical to (self) entropy

$$ H_b(p,q) = H_b(p, p) = H_b(p). $$

**Cross-entropy Error Function**

Cross-entropy can be used as a loss or error function in e.g. logistic regression or neural networks when used as classification models.

Cross-entropy is based on Bernoulli variables and Bernoulli experiments. We begin with a single binary variable. Usually this is understood as the sample space of coin flipping experiment. Here we think of a binary \(\textit{target}\) variable \(t \in \{0, 1\}\) designing class membership. \(t = 0\) designates samples from class \(0\) and \(t = 1\) from class \(1\).

The derivation of the \(\textit{cross entropy error function}\) from the negative likelihood of the linear logistic regression can be found in Bishop (Bishop, 2009, p.206).

**References**

Bishop, Chr.M., Pattern Recognition and Machine Learning, Springer, 2009

MacKay, David J.C., Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003, www.inference.org.uk/itprnn/book.pdf (visited, 2018/08/22)

Murphy, K.P., Machine Learning - A Probabilistic Perspective, MIT Press, 2012

Pollard, D., A User's Guide to Measure Theoretic Probability, Cambridge University Press, 2010

------------------------------------------------------------------------------

This is a draft. Any feedback or bug report is welcome. Please contact:

------------------------------------------------------------------------------