Cross Entropy
Cross Entropy
Cross-entropy and the Cross-entropy Error Function
Cross-entropy
Cross entropy "is the average number of bits needed to encode data coming from a source with distribution \(p\) when we use model \(q\)" (Murphy, 2012, p.58).
In contrast to the formula for Shannon information content ("entropy") the formula for cross entropy involves two distributions \(p(X)\) and \(q(Y)\) with same support or set of events ("alphabet") \(x, y \in \{1, 2, ..., k\}\):
$$ H_b(X,Y) = H_b(p,q) := - \sum_{j=1}^m p(x_j) \log_b q(y_j).$$
The identifiers "alphabet" and "ensemble" are used by McKay (2003, p.22). "Ensemble" is what is usually (Pollard, 2010, p.18) known as a probability space \( (\Omega, \mathcal{F}, \mathbb{P})\). \(\Omega\) is the set of outcomes, \(\mathcal{F}\) the sigma algebra or sigma field of events, \(\mathbb{P}\) the probability measure. The "alphabet" is the set of basic events and therefor a subset of \(\mathcal{F}\).
The derivation of \(H_b(p,q)\) is similar to the entropy. If the probability of the outcome or realization \(y_j\)
$$q(y_j)$$ is replaced by its surprisal $$\frac{1}{q(y_j)}$$
and the basis \(b\) is set to \(b=2\), the information content \(I_2(y_j)\) of an outcome \(y_j\) is defined to be
$$h_2(y_j) = I_2(Y=y_j) := \log_2 \frac{1}{q(y_j)} $$
bits.
The cross entropy \(H_2(X, Y)\) of the probability mass function (PMF) \(P(X, Y)\) is then the expected mean
$$ H_2(X,Y) = E_p[I_2(X, Y)] := \sum_{j=1}^m p(x_j) \log_2 \frac{1}{q(y_j)} = \sum_{j=1}^m p(x_j) (\log_2 1 - log_2 q(y_j)) $$
$$= \sum_{j=1}^m p(x_j) (0 - log_2 q(y_j)) = \sum_{j=1}^m - p(x_j) log_2 p(x_j) = -\sum_{j=1}^m p(x_j) \log_2 q(y_j), $$
where \(m\) is the number of categories in the "alphabet" (set of events, values or realizations in \(X\) or \(Y\). Here the expectation is not over \(q(X)\) but over \(p(Y)\).
If model fit is perfect cross entropy is identical to (self) entropy
$$ H_b(p,q) = H_b(p, p) = H_b(p). $$
Cross-entropy Error Function
Cross-entropy can be used as a loss or error function in e.g. logistic regression or neural networks when used as classification models.
Cross-entropy is based on Bernoulli variables and Bernoulli experiments. We begin with a single binary variable. Usually this is understood as the sample space of coin flipping experiment. Here we think of a binary \(\textit{target}\) variable \(t \in \{0, 1\}\) designing class membership. \(t = 0\) designates samples from class \(0\) and \(t = 1\) from class \(1\).
The derivation of the \(\textit{cross entropy error function}\) from the negative likelihood of the linear logistic regression can be found in Bishop (Bishop, 2009, p.206).
References
Bishop, Chr.M., Pattern Recognition and Machine Learning, Springer, 2009
MacKay, David J.C., Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003, www.inference.org.uk/itprnn/book.pdf (visited, 2018/08/22)
Murphy, K.P., Machine Learning - A Probabilistic Perspective, MIT Press, 2012
Pollard, D., A User's Guide to Measure Theoretic Probability, Cambridge University Press, 2010
------------------------------------------------------------------------------
This is a draft. Any feedback or bug report is welcome. Please contact:
------------------------------------------------------------------------------