Gradient of Auto-Normalized Probability Distributions (Annotated)

Gradient of Auto-Normalized Probability Distributions
(Annotated)

This is an annotated version of Gradient of Auto-Normalized Probability Distributions by Brandyn Webb.

Content from the original source post is shown italicised within grey blocks. The original equation tag numbers are maintained; additional supplementary equations have been given dotted decimal numbers.

Consider a model which is an auto-normalized probability distribution over the states of a discrete variable D:

$$\mathrm{P}(D_k) \equiv \frac{F_k}{\sum_m{F_m}}\tag{1}$$

where each `F_k` is a scalar valued function of some (implicit) parameters.

$\mathrm{P}(D_k)$ describes a discrete probability distribution consisting of `k` elements. `F_k` gives the relative probability of the `k`th element, i.e. `F` is a non-normalised discrete probability distribution and equation (1) normalizes the `F_k`s such that the elements of $\mathrm{P}(D_k)$ sum to one.

E.g. consider a generative model whereby the nodes of a graph are arranged into groups, each group representing a discrete variable (see [1] [2] [3]). Each node within a group represents one of the variable's possible `k` states, and the incoming signals to a node inform the probability of the state represented by that node.

Each `F_k` is a scalar value that is the result of combining all signals arriving at node `k`; the maths presented is independent of the signal combining scheme; one possible scheme is explored later.

Figure 1.
Each `D_k` is a node in a group (bounding rectangle).
`F_k`, `D_k` are node `k`'s pre and post-normalised probability respectively.

if `del_i` is the partial derivative with respect to some variable that only affects `F_i` then:

$$ \partial_i \log \mathrm{P}(D_k) = \frac{\partial_i F_k}{F_k} - \frac{\partial_i F_i}{\sum_m F_m} \tag{2}$$

where `del_i F_k = 0` when `i != k`.

Equation (2) derivation:

\begin{align} \log{\mathrm{P}(D_k)} & = \log{\frac{F_k}{\sum_m F_m}} \tag{1.1} \\\\ & = \log{F_k} - \log{\sum_m{F_m}} \tag{1.2} \\\\ \partial_i \log{\mathrm{P}(D_k)} & = \partial_i \log{F_k} - \partial_i \log{\sum_m{F_m}} \tag{1.3} \\\\ & = \frac{\partial_i F_k}{F_k} - \frac{\partial_i \sum_m{F_m}}{\sum_m{F_m}} \tag{1.4} \end{align}

Equation (1.4) is obtained by applying $\partial_x \log{x} = 1/x$ and the chain rule.

NB. `\partial_i F_m = 0` when `m \ne i`. I.e. above it was stated: `del_i` is the partial derivative with respect to some variable (`i`) that only affects `F_i`. Therefore the rightmost term of equation (1.4) can be simplified to give:

$$ \partial{_i} \log{\mathrm{P}(D_k)} = \frac{\partial_i F_k}{F_k} - \frac{\partial_i F_i}{\sum_m F_m} \tag{2}$$

Separately, if we have some true data `\bar{D}` (a large set of observed states of `D`, like throws of a die), expressed as a normalized distribution by $\mathrm{\bar P}(\bar{D})$, then we can compute our model's probability of generating that observed data (per-sample, so independent of the size of `\bar{D}`) as:

$$ \mathrm{P}(\bar{D}) = \prod_k \mathrm{P}(D_k)^{ \mathrm{\bar P}(\bar{D}_k)} \tag{3}$$

Notation:

$M$ A generative model that can be sampled to produce instances of `D`, with the distribution defined by $\mathrm{P}(D_k)$

$\mathrm{P}(D_k)$ The probability that variable `D` takes state `k`. Shorthand for $\mathrm{P}(D_k | M)$

$\bar D$ A set of observations.

$\bar D_k$ The subset of observations that are in state `k`

$|\bar D_k|$ The number of observations that are in state `k`. The cardinality of `\bar D_k`

$\mathrm{\bar P}(\bar D_k)$ The proportion of observations in `\bar D` that are in state `k`.

$\mathrm{P}(\bar D)$ The normalised probability (see next section) of the model `M` generating the data set $\bar D$. Shorthand for $\mathrm{P}(\bar D | M)$

$\mathrm{\hat P}(\bar D)$ Alternative notation for $\mathrm{P}(\bar D)$, used briefly in the below derivation.

Normalised Probability

The probability of model `M` generating the observations `\bar D` is the product of each state's probability ($\mathrm{P}(D_k)$) raised to the power of the number of observations in that state:

$$ \mathrm{P}(\bar{D}) = \prod_k \mathrm{P}(D_k)^{|\bar{D}_k|} \tag{2.9}$$

Equation (2.9) is effectively equation (1) from [4] extended to `k` states rather than the two possible states of a coin flip. Note that $\mathrm{P}(\bar D)$ is dependent on the number of observations in `\bar D` and will generally diminish rapidly with an increasing total number of observations. However, the geometric mean of $\mathrm{P}(\bar D)$ effectively gives a normalised probability:

$$ \mathrm{\hat P}(\bar D) = \mathrm{P}(\bar D)^{1/n} $$

Or equivalently, the probability of model `M` generating a given data set with `n` observations can be obtained with:

$$ \mathrm{P}(\bar D) = \mathrm{\hat P}(\bar D)^n $$

Taking the geometric mean of (2.9):

\begin{align} \mathrm{\hat P}(\bar{D}) &= \left(\prod_k \mathrm{P}(D_k)^{|\bar{D}_k|}\right)^{1/n} \\\\ \mathrm{\hat P}(\bar{D}) &= \prod_k \mathrm{P}(D_k)^{\frac{|\bar{D}_k|}{n}} \\\\ \mathrm{\hat P}(\bar{D}) &= \prod_k \mathrm{P}(D_k)^{\mathrm{\bar P}(\bar D_k)} \tag{3} \end{align}

NB. In the source document $\mathrm{P}(\bar D)$ refers to the normalised probability; that notation adopted from here on instead of the $ \mathrm{\hat P}(\bar D)$ notation.

The log of which is the cross entropy:

$$ \log \mathrm{P}(\bar{D}) = \sum_k \left[\mathrm{\bar P}(\bar{D}_k) \log \mathrm{P}(D_k) \right] \tag{4}$$

Equation (4) is obtained by applying the following two log laws to equation (3):

\begin{align} \log a^b &= b \log a \\ \log ab &= \log a + \log b \end{align}

NB. The cross entropy in equation (4) has its sign flipped in comparison to the usually stated form.

Taking the partial derivative and pulling in $\partial_i \log \mathrm{P}(D_k)$ from above:

\begin{align} \partial_i \log \mathrm{P}(\bar D) & = \sum_k{ \left[ \mathrm{\bar P}(\bar{D}_k) \partial_i \log \mathrm{P}(D_k) \right]} \tag{4.5} \\\\ & = \sum_k{\left[ \mathrm{\bar P}(\bar{D}_k) \left(\frac{\partial_i F_k}{F_k} - \frac{\partial_i F_i}{\sum_m F_m}\right) \right]} \tag{4.6} \\\\ & = \sum_k{\left[ \mathrm{\bar P}(\bar{D}_k) \frac{\partial_i F_k}{F_k} \right]} - \sum_k{\left[\mathrm{\bar P}(\bar{D}_k) \frac{\partial_i F_i}{\sum_m F_m} \right]} \tag{4.7} \\\\ & = \mathrm{\bar P}(\bar{D}_i) \frac{\partial_i F_i}{F_i} - \frac{\partial_i F_i}{\sum_m F_m} \tag{4.8} \\\\ & = \frac{\partial_i F_i}{F_i} \left[\mathrm{\bar P}(\bar{D}_i) - \frac{F_i}{\sum_m F_m}\right] \tag{4.9} \\\\ & = \frac{\partial_i F_i}{F_i} \left[\mathrm{\bar P}(\bar{D}_i) - \mathrm{P}(D_i) \right] \tag{5} \end{align}

At (4.5) we take the derivative w.r.t. `i`; $\mathrm{\bar P}(\bar D_k)$ is a constant factor w.r.t. `i` therefore it can be moved outside of the partial derivative.

At (4.6) equation (2) is substituted in.

At (4.8) the first sum term has simplified because $\partial_i F_k$ is zero for all `k` except when `k=i`. In the right hand term we note that in (4.7) there is no `k` in $\frac{\partial_i F_i}{\sum_m{F_m}}$, therefore $\mathrm{\bar P}(\bar D_k)$ sums to one and the entire term simplifies to $\frac{\partial_i F_i}{\sum_m{F_m}}$.

At (4.9) $\partial_i F_i / F_i$ is factored out from both terms.

At (5) the right hand term is substituted for the LHS of equation (1).

In sum:

\begin{align} \mathrm{P}(D_k) &\equiv \frac{F_k}{\sum_m{F_m}} \tag{1} \\\\ \mathrm{P}(\bar D) &= \prod_k{\mathrm{P}(D_k)^{\mathrm{\bar P}(\bar D_k)}} \tag{3} \\\\ \partial_i \log \mathrm{P}(\bar{D}) & = \frac{\partial_i F_i}{F_i} \left[\mathrm{\bar P}(\bar{D}_i) - \mathrm{P}(D_i) \right] \tag{5} \end{align}

So, what's it mean and why is it interesting?

Equation (1) is a fairly common scenario when you have some ability to model the relative likelihood or frequency of events. The denominator normalizes those relative votes into an actionable probability distribution.

Equation (3) in most cases is the definitive measure of the quality of a model, and is thus usually the quantity you want to maximize.

Equation (5) puts them together with the interesting result that the optimizing gradient is proportional to the residual error we get by subtracting our predicted distribution from the observed one. Note because these are probabilities, the components of that residual are nicely bounded from -1 to 1, so it has the makings of a very well-behaved gradient. It is intuitively appealing because it is zero when our prediction matches observation. And it's just surprisingly clean and simple.

Where it goes from here depends on the nature of the underlying F functions.

One case to consider is where the F distribution is a product of distributions (where each $G_j$ is a non-normalized pseudo probability distribution over i):

$$ F_i = \prod_j{G_{j,i}} \tag{7}$$

Continuing with the generative model example from figure 1; figure 2 extends the model upwards to show a parent layer of groups above group `D`:

The two adjacent layers are fully connected resulting in four non-normalised vectors arriving at `D`; two each from group `B` and `C` respectively. In generative mode only one node in a group is selected per sample (per top-down stochastic generation), therefore the average top-down 'implication' from one parent group is given by multiplying each node's connection weights by it's probability and summing over all nodes; i.e. for group B consisting of `k` nodes, connecting to group `D` with `i` nodes:

$$ G_{B,i} = \sum_k B_k w_{k,i} \tag{6.9} $$

Hence, the top-down implication from the `j`th parent group is vector `G_{j}`, consisting of `i` elements.

Equation (7) further combines the `G_{j}` vectors by taking their pointwise product, resulting in the non-normalised distribution `F`. Noting that if the parent groups are considered to represent independent variables, taking their product gives their joint probability. One consequence of this scheme is that a single parent group can 'veto' all other groups, i.e. if a group's top-down implication for some node is a zero or near zero probability, this eliminates any high probability implications from one or more other groups.

Figure 3 shows the two stages of combining for the network from figure 2. The squares in dotted groups represent the intermediate combining stages; the bold lines are the weighted connections between groups B-D and C-D. All other connections show only the flow of signals in the combining calculations.

Figure 3.
The two stages of combining of vectors.

Then:

$$ \partial_{j,i}F_i = \frac{F_i}{G_{j,i}} \partial_{j,i}{G_{j,i}} \tag{8} $$

The `j,i` subscript in `\partial_{j,i}` is shorthand for some model variable that affects `G_{j,i}`, e.g. in the current model under consideration this could be one of the weights connecting into `G_{j,i}`.

Equation (8) derivation:

\begin{align} F_i &= \prod_j{G_{j,i}} \tag{7} \\[0.7em] \partial_{j,i} F_i &= \partial_{j,i} \prod_j{G_{j,i}} \tag{7.8} \\[0.7em] \partial_{j,i}F_i &= \left(\prod_{k \ne j}G_{k,i} \right) \partial_{j,i}G_{j,i} \tag{7.9} \\[0.7em] \partial_{j,i}F_i &= \frac{F_i}{G_{j,i}} \partial_{j,i}G_{j,i} \tag{8} \\[0.7em] \end{align}

I.e. most of the `G_{j,i}` are constant terms w.r.t $\partial_{j,i}F_i$, hence their product is moved outside of the partial derivative.

Applying (5), note $F_i$ cancels out, so curiously we end up with essentially (5) again:

$$ \partial_{j,i}{log{\mathrm{P}(\bar{D})}} = \frac{\partial_{j,i}G_{j,i}}{G_{j,i}} \left[\mathrm{\bar P}(\bar{D_i}) - \mathrm{P}(D_i)\right] \tag{9} $$

Equation (8) has been substituted into the $\frac{\partial_i{F_i}}{F_i}$ term of (5). $F_i$ then cancels out to give equation (9).

The interesting point here being that even though many $G_j$s are being combined into the final predicted distribution, the "explaining away" competition amongst them is coordinated entirely via the consolidated $\mathrm{P}(D_i)$ predictions (as expressed in the error residual). That is, other than the common error residual, the gradient with respect to a parameter of some $G_{j,i}$ is local.

Note this is analogous to the process that causes singular vectors to become orthogonal in the incremental SVD approach (when training them all at once as opposed to one by one). There we are removing the net projection of the current vectors from the input before training. Here we are removing the net prediction. This reverse inhibition is a recurring theme in various learning algorithms, including some brain-inspired ones.

Colin,
November 8th, 2016

References

Resources

$M$	A generative model that can be sampled to produce instances of `D`, with the distribution defined by $\mathrm{P}(D_k)$
$\mathrm{P}(D_k)$	The probability that variable `D` takes state `k`. Shorthand for $\mathrm{P}(D_k \| M)$
$\bar D$	A set of observations.
$\bar D_k$	The subset of observations that are in state `k`
$\|\bar D_k\|$	The number of observations that are in state `k`. The cardinality of `\bar D_k`
$\mathrm{\bar P}(\bar D_k)$	The proportion of observations in `\bar D` that are in state `k`.
$\mathrm{P}(\bar D)$	The normalised probability (see next section) of the model `M` generating the data set $\bar D$. Shorthand for $\mathrm{P}(\bar D \| M)$
$\mathrm{\hat P}(\bar D)$	Alternative notation for $\mathrm{P}(\bar D)$, used briefly in the below derivation.