Cross Entropy
Cross Entropy
Cross-entropy and the Cross-entropy Error Function
Cross-entropy
Cross entropy "is the average number of bits needed to encode data coming from a source with distribution p when we use model q" (Murphy, 2012, p.58).
In contrast to the formula for Shannon information content ("entropy") the formula for cross entropy involves two distributions p(X) and q(Y) with same support or set of events ("alphabet") x,y∈{1,2,...,k}:
Hb(X,Y)=Hb(p,q):=−m∑j=1p(xj)logbq(yj).
The identifiers "alphabet" and "ensemble" are used by McKay (2003, p.22). "Ensemble" is what is usually (Pollard, 2010, p.18) known as a probability space (Ω,F,P). Ω is the set of outcomes, F the sigma algebra or sigma field of events, P the probability measure. The "alphabet" is the set of basic events and therefor a subset of F.
The derivation of Hb(p,q) is similar to the entropy. If the probability of the outcome or realization yj
q(yj) is replaced by its surprisal 1q(yj)
and the basis b is set to b=2, the information content I2(yj) of an outcome yj is defined to be
h2(yj)=I2(Y=yj):=log21q(yj)
bits.
The cross entropy H2(X,Y) of the probability mass function (PMF) P(X,Y) is then the expected mean
H2(X,Y)=Ep[I2(X,Y)]:=m∑j=1p(xj)log21q(yj)=m∑j=1p(xj)(log21−log2q(yj))
=m∑j=1p(xj)(0−log2q(yj))=m∑j=1−p(xj)log2p(xj)=−m∑j=1p(xj)log2q(yj),
where m is the number of categories in the "alphabet" (set of events, values or realizations in X or Y. Here the expectation is not over q(X) but over p(Y).
If model fit is perfect cross entropy is identical to (self) entropy
Hb(p,q)=Hb(p,p)=Hb(p).
Cross-entropy Error Function
Cross-entropy can be used as a loss or error function in e.g. logistic regression or neural networks when used as classification models.
Cross-entropy is based on Bernoulli variables and Bernoulli experiments. We begin with a single binary variable. Usually this is understood as the sample space of coin flipping experiment. Here we think of a binary target variable t∈{0,1} designing class membership. t=0 designates samples from class 0 and t=1 from class 1.
The derivation of the cross entropy error function from the negative likelihood of the linear logistic regression can be found in Bishop (Bishop, 2009, p.206).
References
Bishop, Chr.M., Pattern Recognition and Machine Learning, Springer, 2009
MacKay, David J.C., Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003, www.inference.org.uk/itprnn/book.pdf (visited, 2018/08/22)
Murphy, K.P., Machine Learning - A Probabilistic Perspective, MIT Press, 2012
Pollard, D., A User's Guide to Measure Theoretic Probability, Cambridge University Press, 2010
------------------------------------------------------------------------------
This is a draft. Any feedback or bug report is welcome. Please contact:
------------------------------------------------------------------------------