Notes of BackPropagation & Levenberg–Marquardt & Bayesian Regularization Algorithm

Preface

Recording the derivation of BackPropagation & Levenberg–Marquardt & Bayesian Regularization Algorithm, to have a deeper understanding of the Neural Networks.

BackPropagation

Differentiation in Computational Graph

In fact, there are two types of differential methods in neural networks computing: forward-mode differentiation and backward-mode differentiation(Automatic Differentiation). Consider the computational graph below, we need to compute the gradient of the output with respect to the input, namely \(\frac{\partial L}{\partial x}\), where \(x, y, z\) can be multi-dimensional vectors. backpropagation0

Why we choose backpropagation? Let’s compare two methods:

It’s obvious that we should choose Automatic Differentiation to accelerate neural networks.

BackPropagation Process

BackPropagation algorithm can be divided into 4 steps:

  1. Forward-propagate Input Signal To illustrate the backpropagation process, we use the three layers neural network with two inputs and one output as below shows, the difference between the output signal and the target is called error signal \(\delta\) of output layer neuron. backpropagation1
  2. Back-propagate Error Singal Propagate error signal \(\delta\) (computed in single teaching step) back to all neurons, which output signals were input for discussed neuron. (Omit some computation process here) backpropagation2 backpropagation3
  3. Calculate Parameters Gradients
  4. Update Parameters When the error signal for each neuron is computed, the weights coefficients of each neuron input node may be modified. In formulas below \(\frac{d f(e)}{d e}\) represents derivative of neuron activation function (which weights are modified). Coefficient \(\eta\) affects network teaching speed. (Omit some computation process here) backpropagation4 backpropagation5 backpropagation6

Levenberg–Marquardt

The Levenberg–Marquardt algorithm, which was independently developed by Kenneth Levenberg and Donald Marquardt, provides a numerical solution to the problem of minimizing a nonlinear function. It is fast and has stable convergence. In the artificial neural-networks field, this algorithm is suitable for training small- and medium-sized problems.

Jacobian matrix & Hessian matrix

Jacobian matrix Suppose \(F : \mathbb{R}^{n} \rightarrow \mathbb{R}^{m}\) is a function which has form below, w.r.t \(\mathbf{x}=\left(x_{1}, \cdots, x_{n}\right)^{T}\) \(F(\mathbf{x})=\left[ \begin{array}{c}{f_{1}\left(x_{1}, \cdots, x_{n}\right)} \\ {\vdots} \\ {f_{m}\left(x_{1}, \cdots, x_{n}\right)}\end{array}\right]\) Then the Jacobian matrix \(\mathbf{J}\) of \(\mathbf{f}\) is an \(m×n\)matrix, usually defined and arranged as follows: \(\mathbf{J}=\left[ \begin{array}{ccc}{\frac{\partial \mathbf{f}}{\partial x_{1}}} & {\cdots} & {\frac{\partial \mathbf{f}}{\partial x_{n}}}\end{array}\right]=\left[ \begin{array}{ccc}{\frac{\partial f_{1}}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{1}}{\partial x_{n}}} \\ {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial f_{m}}{\partial x_{1}}} & {\cdots} & {\frac{\partial f_{m}}{\partial x_{n}}}\end{array}\right]\) where, component-wise \(\mathbf{J}_{i j}=\frac{\partial f_{i}}{\partial x_{j}}\) Jacobian determinant Assume \(x=X(u, v), y=Y(u, v)\), we have \(\int_{F(R)} f(x, y) d x d y=\int_{R} f(X(u, v), Y(u, v))|\operatorname{det} J(u, v)| d u d v\) Hessian matrix Suppose \(f : \mathbb{R}^{n} \rightarrow \mathbb{R}\) is quadratic differentiable function, \(\mathbf{x}=\left(x_{1}, \cdots, x_{n}\right)^{T} \in \mathbb{R}^{n}\), define \(\quad n \times n\) real symmetric matrix \(H(\mathbf{x})=\left[h_{i j}(\mathbf{x})\right]\) is Hessian matrix as below shows: \(H(\mathbf{x})=\left[ \begin{array}{cccc}{\frac{\partial^{2} f}{\partial x_{1} \partial x_{1}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}} \\ {\frac{\partial^{2} f}{\partial x_{2} \partial x_{1}}} & {\frac{\partial^{2} f}{\partial x_{2} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{2} \partial x_{n}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial^{2} f}{\partial x_{n} \partial x_{1}}} & {\frac{\partial^{2} f}{\partial x_{n} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{n} \partial x_{n}}}\end{array}\right]\) The Jacobian of the \(\nabla f\) is the Hessian.

Steepest Descent Algorithm

Sum square error (SSE) is defined to evaluate the training process. For all training patterns and network outputs, it is calculated by \(E(\boldsymbol{x}, \boldsymbol{w})=\frac{1}{2} \sum_{p=1}^{P} \sum_{m=1}^{M} \boldsymbol{e}_{p, m}^{2}\) where \(\boldsymbol{x}\) is the input vector, \(\boldsymbol{w}\) is the weight vector, \(\boldsymbol{e}\) is the training error. Normally, gradient \(\boldsymbol{g}\) is defined as the first-order derivative of total error function: \(g=\frac{\partial E(\boldsymbol{x}, \boldsymbol{w})}{\partial \boldsymbol{w}}=\left[ \frac{\partial E}{\partial w_{1}} \quad \frac{\partial E}{\partial w_{2}} \quad \dots \quad \frac{\partial E}{\partial w_{N}} \right]^{T}\) Then the update rule of the steepest descent algorithm could be written as \(\boldsymbol{w}_{k+1}=\boldsymbol{w}_{k}-\alpha \boldsymbol{g}_{k}\)

Newton’s Method

Newton’s method assumes that all the gradient components \(g_{1}, g_{2}, \dots, g_{N}\) are functions of weights and all weights are linearly independent: \(\left\{\begin{array}{c}{g_{1}=F_{1}\left(w_{1}, w_{2} \cdots w_{N}\right)} \\ {g_{2}=F_{2}\left(w_{1}, w_{2} \cdots w_{N}\right)} \\ {\cdots} \\ {g_{N}=F_{N}\left(w_{1}, w_{2} \cdots w_{N}\right)}\end{array}\right.\) Unfold each \(g_{i}\) by Taylor series and take the first-order approximation: \(\left\{\begin{array}{l}{g_{1} \approx g_{1,0}+\frac{\partial g_{1}}{\partial w_{1}} \Delta w_{1}+\frac{\partial g_{1}}{\partial w_{2}} \Delta w_{2}+\cdots+\frac{\partial g_{1}}{\partial w_{N}} \Delta w_{N}} \\ {g_{2} \approx g_{2,0}+\frac{\partial g_{2}}{\partial w_{1}} \Delta w_{1}+\frac{\partial g_{2}}{\partial w_{2}} \Delta w_{2}+\cdots+\frac{\partial g_{2}}{\partial w_{N}} \Delta w_{N}} \\ {\cdots} \\ {g_{N} \approx g_{N, 0}+\frac{\partial g_{N}}{\partial w_{1}} \Delta w_{1}+\frac{\partial g_{N}}{\partial w_{2}} \Delta w_{2}+\cdots+\frac{\partial g_{N}}{\partial w_{N}} \Delta w_{N}}\end{array}\right.\) where \(\frac{\partial g_{i}}{\partial w_{j}}=\frac{\partial\left(\frac{\partial E}{\partial w_{i}}\right)}{\partial w_{j}}=\frac{\partial^{2} E}{\partial w_{i} \partial w_{j}}\), thus we have \(\left[ \begin{array}{c}{-g_{1}} \\ {-g_{2}} \\ {\cdots} \\ {-g_{N}}\end{array}\right]=\left[ \begin{array}{c}{-\frac{\partial E}{\partial w_{1}}} \\ {-\frac{\partial E}{\partial w_{2}}} \\ {\cdots} \\ {-\frac{\partial E}{\partial w_{N}}}\end{array}\right]=\left[ \begin{array}{cccc}{\frac{\partial^{2} E}{\partial w_{1}^{2}}} & {\frac{\partial^{2} E}{\partial w_{1} \partial w_{2}}} & {\cdots} & {\frac{\partial^{2} E}{\partial w_{1} \partial w_{N}}} \\ {\frac{\partial^{2} E}{\partial w_{2} \partial w_{1}}} & {\frac{\partial^{2} E}{\partial w_{2}^{2}}} & {\cdots} & {\frac{\partial^{2} E}{\partial w_{2} \partial w_{N}}} \\ {\cdots} & {\cdots} & {\cdots} & {\cdots} \\ {\frac{\partial^{2} E}{\partial w_{N} \partial w_{1}}} & {\frac{\partial^{2} E}{\partial w_{N} \partial w_{2}}} & {\cdots} & {\frac{\partial^{2} E}{\partial w_{N}^{2}}}\end{array}\right] \times \left[ \begin{array}{c}{\Delta w_{1}} \\ {\Delta w_{2}} \\ {\cdots} \\ {\Delta w_{N}}\end{array}\right]\) \(-\boldsymbol{g}=\boldsymbol{H} \Delta \boldsymbol{w}\) so, \(\Delta \boldsymbol{w}=-\boldsymbol{H}^{-1} \boldsymbol{g}\) Therefore, the update rule for Newton’s method is \(\boldsymbol{w}_{k+1}=\boldsymbol{w}_{k}-\boldsymbol{H}_{k}^{-1} \boldsymbol{g}_{k}\)

Gauss–Newton Algorithm

Since the second-order derivatives of total error function have to be calculated and it could be very complicated, in order to simplify the calculating process, Jacobian matrix \(\mathbf{J}\) is introduced as \(\mathbf{J}=\left[ \begin{array}{cccc}{\frac{\partial e_{1,1}}{\partial w_{1,}}} & {\frac{\partial e_{1,1}}{\partial w_{2}}} & {\cdots} & {\frac{\partial e_{1,1}}{\partial w_{N}}} \\ {\frac{\partial e_{1,2}}{\partial w_{1}}} & {\frac{\partial e_{1,2}}{\partial w_{2}}} & {\cdots} & {\frac{\partial e_{1,2}}{\partial w_{N}}} \\ {\cdots} & {\cdots} & {\cdots} & {\cdots} \\ {\frac{\partial e_{1, M}}{\partial w_{1}}} & {\frac{\partial e_{1, M}}{\partial w_{2}}} & {\cdots} & {\frac{\partial e_{1, M}}{\partial w_{N}}}\\{\cdots} & {\cdots} & {\cdots} & {\cdots} \\ {\frac{\partial e_{P, M}}{\partial w_{1}}} & {\frac{\partial e_{P, M}}{\partial w_{2}}} & {\cdots} & {\frac{\partial e_{P, M}}{\partial w_{N}}}\end{array}\right]\) The elements of gradient vector can be calculated as \(g_{i}=\frac{\partial E}{\partial w_{i}}=\frac{\partial\left(\frac{1}{2} \sum_{p=1}^{P} \sum_{m=1}^{M} e_{p, m}^{2}\right)}{\partial w_{i}}=\sum_{p=1}^{P} \sum_{m=1}^{M}\left(\frac{\partial e_{p, m}}{\partial w_{i}} e_{p, m}\right)\) Thus, the relationship between Jacobian matrix \(\mathbf{J}\) and gradient vector \(\mathbf{g}\) would be \(\boldsymbol{g}=J e , where \ \ \boldsymbol{e}=\left[ \begin{array}{c}{e_{1,1}} \\ {\ldots} \\ {e_{1, M}} \\ {\cdots} \\ {e_{P, 1}} \\ {\cdots} \\ {e_{P, M}}\end{array}\right]\)

And the element at ith row and jth column of Hessian matrix can be calculated as \(h_{i, j}=\frac{\partial^{2} E}{\partial w_{i} \partial w_{j}}=\frac{\partial^{2}\left(\frac{1}{2} \sum_{p=1}^{P} \sum_{m=1}^{M} e_{p, m}^{2}\right)}{\partial w_{i} \partial w_{j}}=\sum_{p=1}^{P} \sum_{m=1}^{M} \frac{\partial e_{p, m}}{\partial w_{i}} \frac{\partial e_{p, m}}{\partial w_{j}}+S_{i, j}\) where \(S_{i, j}\) is equal to \(S_{i, j}=\sum_{p=1}^{P} \sum_{m=1}^{M} \frac{\partial^{2} e_{p, m}}{\partial w_{i} \partial w_{j}} e_{p, m}\) As the basic assumption of Newton’s method is that \(S_{i, j}\) is closed to zero, thus \(\boldsymbol{H} \approx \boldsymbol{J}^{T} \boldsymbol{J}\) and the update rule become \(\boldsymbol{w}_{k+1}=\boldsymbol{w}_{k}-\left(\boldsymbol{J}_{k}^{T} \boldsymbol{J}_{k}\right)^{-1} \boldsymbol{J}_{k} \boldsymbol{e}_{k}\)

Levenberg–Marquardt Algorithm

In order to make sure that the approximated Hessian matrix \(\boldsymbol{J}^{T} \boldsymbol{J}\) is invertible, Levenberg–Marquardt algorithm introduces another approximation to Hessian matrix: \(\boldsymbol{H} \approx \boldsymbol{J}^{T} \boldsymbol{J}+\mu \boldsymbol{I}\) where \(μ\) is always positive, called combination coefficient, \(\boldsymbol{I}\) is the identity matrix. Therefore, the update rule of Levenberg–Marquardt algorithm can be presented as \(\boldsymbol{w}_{k+1}=\boldsymbol{w}_{k}-\left(\boldsymbol{J}_{k}^{T} \boldsymbol{J}_{k}+\boldsymbol{\mu} \boldsymbol{I}\right)^{-1} \boldsymbol{J}_{k} \boldsymbol{e}_{k}\) When the combination coefficient \(μ\) is very small (nearly zero), L-M is approaching to Gauss–Newton algorithm. When the combination coefficient \(μ\) is very big, it can be interpreted as the learning coefficient in the steepest descent method: \(\alpha=\frac{1}{\mu}\) .

Bayesian Regularization

Bayesian regularized artificial neural networks (BRANNs) are more robust than standard back-propagation nets and can reduce or eliminate the need for lengthy cross-validation. They are difficult to overtrain, since evidence procedures provide an objective Bayesian criterion for stopping training. They are also difficult to overfit, because the BRANN calculates and trains on a number of effective network parameters or weights, effectively turning off those that are not relevant.

Inference

Typically, training aims to reduce the sum of squared errors \(F=E_{D}\). However, regularization adds an additional term, thus the objective function becomes \(F=\alpha E_{W}+\beta E_{D}\) where the simplest choice of regularizer is the weight decay regularizer \(E_{W}(\mathbf{w})=\frac{1}{2} \sum_{i} w_{i}^{2}\) In the Bayesian framework the weights of the network are considered random variables. After the data is taken, the density function for the weights can be updated according to Bayes’ rule: \(P(\mathbf{w} | D, \alpha, \beta, M)=\frac{P(D | \mathbf{w}, \beta, M) P(\mathbf{w} | \alpha, M)}{P(D | \alpha, \beta, M)}\) \(In\ word, \text { Posterior}=\frac{\text { Likelihood } \times \text { Prior }}{\text { Evidence }}\) where \(D\) represents the data set, \(M\) is the particular neural network model used. Assume that the noise in the training set data is Gaussian and that the prior distribution for the weights is Gaussian, the probability densities can be written as: \(P(D | \mathbf{w}, \beta, M)=\frac{1}{Z_{D}(\beta)} \exp \left(-\beta E_{D}\right)\) \(P(\mathbf{w} | \alpha, M)=\frac{1}{Z_{W}(\alpha)} \exp \left(-\alpha E_{W}\right)\) where \(Z_{D}(\beta)=\left(\frac{\pi}{\beta}\right)^{\frac{n}{2}}\) \(Z_{W}(\alpha)=\left(\frac{\pi}{\alpha}\right)^{\frac{N}{2}}\) Then the posterior probability of the parameters \(\mathbf{w}\) is: \(P(\mathbf{w} | D, \alpha, \beta, M)=\frac{\frac{1}{Z_{W}(\alpha)} \frac{1}{Z_{D}(\beta)} \exp \left(-\left(\beta E_{D}+\alpha E_{W}\right)\right)}{\text { Normalization Factor }}=\frac{1}{Z_{F}(\alpha, \beta)} \exp (-F(\mathbf{w}))\)

Optimizing \(\alpha \text { and } \beta\)

Applying Bayesian rule to optimize the objective function parameters \(\alpha \text { and } \beta\) , we have: \(P(\alpha, \beta | D, M)=\frac{P(D | \alpha, \beta, M) P(\alpha, \beta | M)}{P(D | M)}\) Assume \(P(\alpha, \beta | M)\) is flat prior (which is uninformative to select \(\alpha \text { and } \beta\)), then maximizing the posterior is achieved by maximizing the likelihood function, which can obtain from previous bayesian formula: \(P(D | \alpha, \beta, M)=\frac{P(D | \mathbf{w}, \beta, M) P(\mathbf{w} | \alpha, M)}{P(\mathbf{w} | D, \alpha, \beta, M)}=\frac{Z_{F}(\alpha, \beta)}{Z_{D}(\beta) Z_{W}(\alpha)}\) We can exactly evaluate \(Z_{F}(\alpha, \beta)\) by Taylor series expansion around the minimum point of the posterior density \(\mathbf{w}_{\mathrm{MP}}\), since \(\ E_{D}\ and \ E_{W}\) are quadratic functions and let \(\mathbf{H}=\beta \nabla^{2} E_{D}+\alpha \nabla^{2} E_{W}\) is the Hessian matrix of the objective function. Then we have: \(F(\mathbf{w}) \approx F\left(\mathbf{w}_{\mathrm{MP}}\right)+\frac{1}{2}\left(\mathbf{w}-\mathbf{w}_{\mathrm{MP}}\right)^{T} \mathbf{H}\left(\mathbf{w}-\mathbf{w}_{\mathrm{MP}}\right)\) Since \(\mathbf{H}\) is semi-symmetry, it can be written as \(\mathbf{H}=\left(\mathbf{H}^{\frac{1}{2}}\right)^{T}\mathbf{H}^{\frac{1}{2}}\), we introduce a variable \(\mathbf{u}=\mathbf{H}^{\frac{1}{2}}\left(\mathbf{w}-\mathbf{w}_{\mathrm{MP}}\right)\) and substitute into \(\frac{1}{Z_{F}(\alpha, \beta)} \exp (-F(\mathbf{w}))\) and made both sides integral, we have: \(1=\frac{1}{Z_{F}(\alpha, \beta)} e^{-F\left(\mathbf{w}_{\mathrm{MP}}\right)} \int_{R^{N}} e^{-u^{T} u / 2}\left[\operatorname{det}\left(\mathbf{H}\left(\mathbf{w}_{\mathrm{MP}}\right)^{-1}\right)\right]^{\frac{1}{2}} d u\) Therefore, we have: \(Z_{F}(\alpha, \beta)=(2 \pi)^{\frac{N}{2}} e^{-F\left(\mathbf{w}_{\mathrm{MP}}\right)}\left[\operatorname{det}\left(H\left(\mathbf{w}_{\mathrm{MP}}\right)^{-1}\right)\right]^{\frac{1}{2}}\) Then we substitute it into \(\frac{Z_{F}(\alpha, \beta)}{Z_{D}(\beta) Z_{W}(\alpha)}\) and take the log of the evidence: \(\log P(D | \alpha, \beta, M)=-\alpha E_{W}^{M P}-\beta E_{D}^{M P}-\frac{1}{2} \log \operatorname{det} \mathbf{H}+\frac{n}{2} \log \beta+\frac{N}{2} \log \alpha+\frac{N}{2} \log 2 \pi\) We can solve the optimal values of \(\alpha \text { and } \beta\) at the minimum point by taking the derivative with respect to \(\alpha \text { and } \beta\) and set them equal to zero. First, differentiating with respect to \(\alpha\), we need to evaluate \(\frac{1}{2} \log \operatorname{det} \mathbf{H}\): \(\frac{d}{d \alpha} \log \operatorname{det} \mathbf{H}=\operatorname{Trace}\left(\mathbf{H}^{-1} \frac{d \mathbf{H}}{d \alpha}\right)=\operatorname{Trace}\mathbf{H}^{-1} \mathbf{I}=\operatorname{Trace} \mathbf{H}^{-1}\) Then we obtain the following condition for the most probable value of \(\alpha\): \(2 \alpha E_{W}^{M P}=N-\alpha \operatorname{Trace} \mathbf{H}^{-1}\) The quantity on the right of equation above is called the number of good parameter measurements \(\gamma\). Similarly, we differentiate the log evidence with respect to \(\beta\) and obtain: \(2 \beta E_{D}^{M P}=n-N+\alpha \operatorname{Trace} \mathbf{H}^{-1}\) Thus, we solve the optimal values of \(\alpha \text { and } \beta\) : \(\alpha^{\mathrm{MP}}=\frac{\gamma}{2 E_{W}\left(\mathbf{w}^{\mathrm{MP}}\right)} \text { and } \beta^{\mathrm{MP}}=\frac{n-\gamma}{2 E_{D}\left(\mathbf{w}^{\mathrm{MP}}\right)}\) where \(\gamma=N-2 \alpha^{\mathrm{MP}} \operatorname{Trace}\left(\mathbf{H}\right)^{-1}\), which is called the effective number of parameters, and \(N\) is the total number of parameters in the network.

Training Steps

  1. Initialize \(\alpha , \beta\) and weights. We choose to set \(\alpha=0 , \beta=1\) and use the Nguyen- Widrow method of initializing the weighs. After the first training step, the objective function parameters will recover from the initial setting.
  2. Take one step of the Levenberg-Marquardt algorithm to minimize the objective function \(F=\alpha E_{W}+\beta E_{D}\)
  3. Compute the effective number of parameters \(\gamma=N-2 \alpha^{\mathrm{MP}} \operatorname{Trace}\left(\mathbf{H}\right)^{-1}\), making use of the Gauss-Newton approximation to the Hessian available in the Levenberg-Marquardt training algorithm: \(\mathbf{H}=\nabla^2 F(\mathbf{w}) \approx 2 \beta \mathbf{J}^{\mathrm{T}} \mathbf{J}+2 \alpha \mathbf{I}_{N}\)
  4. Compute new estimates for the objective function parameters \(\alpha^{\mathrm{MP}}=\frac{\gamma}{2 E_{W}\left(\mathbf{w}^{\mathrm{MP}}\right)} \text { and } \beta^{\mathrm{MP}}=\frac{n-\gamma}{2 E_{D}\left(\mathbf{w}^{\mathrm{MP}}\right)}\)
  5. Now iterate steps 2 through 4 until convergence.

Reference

  1. Principles of training multi-layer neural network using backpropagation;
  2. Automatic Differentiation;
  3. Levenberg–Marquardt Training;
  4. Artificial Neural Networks Methods and Applications
  5. Bayesian interpolation
  6. Information theory, inference and learning algorithms
  7. Gauss-Newton approximation to Bayesian learning