Cross Entropy

Cross entropy is closely related to Shannon Entropy:

Shannon entropy is defined for a given discrete probability distribution; it measures how much information is required, on average, to identify random samples from that distribution.

Shannon entropy does not define any coding scheme, but it does define how much information an optimal coding scheme for a given distribution would use, on average, to identify random samples from that distribution.

Consider two distributions, `D` and `\bar D`, each over the same set of states (or symbols), but with distinct probabilities assigned to those states (the letter 'D' is re-used to emphasise that these are distributions over the same set of states).

Cross entropy is defined for the discrete probability distributions `D`, `\bar D`; it measures how much information is required, on average, to identify random samples from `\bar D` when using an optimal coding scheme constructed for `D`. To represent this mathematically we substitute `D` and `\bar D` into the Shannon entropy equation:

$$ H(\bar D, D) = - \sum_i{P(\bar D_i) \log{P(D_i)}} \tag{Cross entropy} $$

Noting that the log expression in the Shannon entropy equation gives the optimal code length for a given probability, hence `D` is substituted in (i.e. the optimal coding scheme for `D` is in use); and the left-hand probability expression represents the probability (or relative frequency) of each state, hence `\bar D` is substituted in there.

Cross Entropy as a Distance or Error Metric

If `\bar D` is identical to `D` then samples from `\bar D` are encoded optimally. If however `\bar D` differs from `D` then the average amount of information per sample increases above the minimum/optimal level defined by the Shannon entropy. Hence, the difference between the Shannon entropy and the cross entropy can be considered a distance metric with respect to two discrete distributions over the same set of states. The unit of distance is 'information per sample', i.e. the amount of additional information (in whatever units are preferred, e.g. bits, trits, etc.) required, on average, to identify a sample from `\bar D`. As such, cross entropy is the definitive measure of quality for a generative model.

I.e. if some model can reproduce a desired target distribution exactly then that model has encoded everything necessary about the system being modelled in order to recreate the target distribution. If however the cross entropy is higher than the target distribution's Shannon entropy, that is an indication that the model is wrong to some degree, and the difference describes the scale of that error in the fundamental units of how much information is required to correct it.

Colin,
October 20th, 2016