2.4 Nonlinear Least Squares Approximation
A Linearizing Technique and Gradient Methods
Let $\,\{(t_i,y_i)\}_{i=1}^N\,$ be a given data set. Consider the following minimization problem:
Find a sinusoid that best approximates the data, in the least-squares sense. That is, find real constants $\,A\ne 0\,,$ $\,\phi\,,$ and $\,\omega\ne 0\,$ so that the function $\,f\,$ given by
$$ f(t) = A\sin(\omega t + \phi) $$minimizes the sum of the squared errors between $\,f(t_i)\,$ and $\,y_i\,$:
$$ \begin{align} SSE &:= \sum_{i=1}^N \bigl( y_i - f(t_i) \bigr)^2\cr &= \sum_{i=1}^N \bigl( y_i - A\sin(\omega t_i + \phi) \bigr)^2 \end{align} $$This problem differs from the least-squares problem considered in Section 2.2, in that the parameters to be ‘adjusted’ ($\,A\,,$ $\,\phi\,,$ and $\,\omega\,$) do not all appear linearly in the formula for the approximate $\,f\,.$ (This notion of linear dependence on a parameter is made precise below.)
Consequently, the resulting problem is called nonlinear least-squares approximation, and is exceedingly more difficult to solve, because it does not lend itself to exact solution by matrix methods.
Indeed, nonlinear least-squares problems must generally be solved by iterative techniques. Nonlinear least-squares approximation is the subject of this section.
First, a definition:
Suppose $\,f\,$ is a function that depends on time, a fixed parameter $\,b\,,$ and possibly other parameters. This dependence is denoted by $\,f = f(t,b,x)\,,$ where $\,x\,$ represents all other parameters.
The function $\,f\,$ is said to be linear with respect to the parameter $\,b\,$ if and only if the function $\,f\,$ has the form
$$ f(t,b,x) = b\cdot f_1(t,x) + f_2(t,x) $$for functions $\,f_1\,$ and $\,f_2\,$ that may depend on $\,t\,$ and $\,x\,,$ but not on $\,b\,.$
This definition is generalized in the obvious way. For example, a function $\,f = f(t,b_1,b_2,x)\,$ is linear with respect to the parameters $\,b_1\,$ and $\,b_2\,$ if and only if $\,f\,$ has the form
$$ f(t,b_1,b_2,x) = b_1\cdot f_1(t,x) + b_2\cdot f_2(t,x) + f_3(t,x) $$for functions $\,f_1\,,$ $\,f_2\,,$ and $\,f_3\,$ that may depend on $\,t\,$ and $\,x\,,$ but not on $\,b_1\,$ and $\,b_2\,.$
Example
The function $\,f(t,A,\phi,\omega) = A\sin(\omega t + \phi)\,$ is linear with respect to the parameter $\,A\,,$ but not linear with respect to $\,\phi\,$ or $\,\omega\,.$
If $\,f\,$ depends linearly on a parameter $\,b\,,$ then $\,\frac{\partial SSE}{\partial b}$ also depends linearly on $\,b\,,$ as follows:
$$ \begin{align} SSE &= \sum_{i=1}^N \bigl( y_i - b\cdot f_1(t_i,x) - f_2(t_i,x) \bigr)^2\,;\cr\cr \frac{\partial SSE}{\partial b} &= \sum_{i=1}^N\ 2 \bigl( y_i - b\cdot f_1(t_i,x)\cr &\qquad\qquad - f_2(t_i,x) \bigr) \cdot \bigl( -f_1(t_i,x) \bigr)\cr\cr &= b\cdot\sum_{i=1}^N\,2\bigl(f_1(t_i,x)\bigr)^2\cr &\qquad - 2\sum_{i=1}^N y_if_1(t_i,x) - f_1(t_i,x)f_2(t_i,x) \end{align} $$Similarly, if $\,f\,$ depends linearly on parameters $\,b_1,\ldots,b_m\,,$ then so do $\,\frac{\partial SSE}{\partial b_i}\,$ for all $\,i = 1,\ldots,m\,.$ Consequently, as seen in Section 2.2, matrix methods can be used to find parameters that make the partials equal to zero.
Unfortunately, when $\,f\,$ does NOT depend linearly on a parameter $\,b\,,$ the situation is considerably more difficult, as illustrated next:
Example
Let $\,f(t) = A\sin(\omega t + \phi)\,,$ so that, in particular, $\,f\,$ does not depend linearly on $\,\omega\,.$ In this case, differentiation of
$$ SSE = \sum_{i=1}^N \bigl( y_i - A\sin(\omega t_i + \phi) \bigr)^2 $$with respect to $\,\omega\,$ yields:
$$ \begin{align} \frac{\partial SSE}{\partial\omega} &= \sum_{i=1}^N\ 2 \bigl( y_i - A\sin(\omega t_i + \phi) \bigr)\cr &\qquad\qquad \cdot \bigl( -At_i\cos(\omega t_i + \phi) \bigr) \end{align} $$Even if $\,A\,$ and $\,\phi\,$ are held constant, the equation $\,\frac{\partial SSE}{\partial\omega} = 0\,$ is not easy to solve for $\,\omega\,.$
Since a linear dependence on parameters is desirable, it is reasonable to try and replace the class of approximating functions,
$$ \begin{align} {\cal F} &= \{ f\ |\ f(t) = A\sin(\omega t + \phi)\,,\cr &\qquad A\ne 0\,, \phi\in\Bbb R\,, \omega \ne 0 \} \end{align} \tag{1} $$with another class $\,\overset{\sim }{\cal F}\,,$ that satisfies the following properties:
- $\cal F = \overset{\sim }{\cal F}\,,$ and
- The new class $\,\overset{\sim }{\cal F}\,$ involves functions that have more linear dependence on the approximating parameters than the original class.
If it is possible to find such a class $\,\overset{\sim }{\cal F}\,,$ then the process of replacing $\,\cal F\,$ by $\,\overset{\sim }{\cal F}\,$ is called a linearizing technique.
Example
For example, the next lemma shows that the set
$$ \begin{align} \overset{\sim}{\cal F} &:= \{ f\ |\ f(t) = C\sin\omega t + D\cos\omega t\,,\cr &\qquad \omega\ne 0\,, C\ \text{ and } D \text{ not both zero} \} \end{align} $$is equal to the set $\,\cal F\,$ given in ($1$). Thus, any function representable in the form $\,A\sin(\omega t + \phi)\,$ is also representable in the form $\,C\sin\omega t + D\cos\omega t\,,$ and vice versa.
Both classes $\,\cal F\,$ and $\,\overset{\sim}{\cal F}\,$ involve three parameters: $\,\cal F\,$ involves $\,A\,,$ $\,\phi\,$ and $\,\omega\,$; and $\,\overset{\sim}{\cal F}\,$ involves $\,C\,,$ $\,D\,$ and $\,\omega\,.$
However, a function $\,f(t) = C\sin\omega t + D\cos\omega t\,$ from $\,\overset{\sim}{\cal F}\,$ depends linearly on both $\,C\,$ and $\,D\,,$ whereas $\,f(t) = A\sin(\omega t + \phi)\,$ only depends linearly on $\,A\,.$ Thus, the class $\,\overset{\sim}{\cal F}\,$ is more desirable.
Let $\,\cal F\,$ and $\,\overset{\sim }{\cal F}\,$ be classes of functions of one variable, defined by:
$$ \begin{align} {\cal F} &= \{ f\ |\ f(t) = A\sin(\omega t + \phi)\,,\cr &\qquad A\ne 0\,, \phi\in\Bbb R\,, \omega \ne 0 \} \end{align} $$ $$ \text{and} $$ $$ \begin{align} \overset{\sim}{\cal F} &:= \{ f\ |\ f(t) = C\sin\omega t + D\cos\omega t\,,\cr &\qquad \omega\ne 0\,, C\ \text{ and } D \text{ not both zero} \} \end{align} $$Then, $\,\cal F = \overset{\sim}{\cal F}\,.$
Proof
Let $\,f\in \cal F\,.$ For all $\,t\,,$ $\,A\,,$ $\,\phi\,$ and $\,\omega\,$:
$$ \begin{align} f(t) &= A\sin(\omega t + \phi)\cr\cr &= A(\sin\omega t\cos\phi + \cos\omega t\sin\phi)\cr\cr &= \overbrace{(A\cos\phi)}^{C}\sin\omega t + \overbrace{(A\sin\phi)}^{D}\cos\omega t \tag{*} \end{align} $$Define $\, C := A\cos\phi\,$ and $\,D := A\sin\phi\,.$ Since $\,f\in\cal F\,,$ one has $\,A\ne 0\,.$ Therefore, the only way that both $\,C\,$ and $\,D\,$ can equal zero is if $\,\cos\phi = \sin\phi = 0\,,$ which is impossible. Thus, $\,f\in \overset{\sim}{\cal F}\,.$
Now, let $\,f\in\overset{\sim}{\cal F}\,,$ so that $\,f(t) = C\sin\omega t + D\cos \omega t\,$ for $\,\omega\ne 0\,,$ for numbers $\,C\,$ and $\,D\,$ that are not both zero. It may be useful to refer to the flow chart below while reading the remainder of the proof.

If $\,C = 0\,,$ then $\,D\ne 0\,.$ In this case:
$$ \begin{align} &C\sin\omega t + D\cos\omega t\cr &\quad = D\cos\omega t\cr &\quad = D\sin(\omega t + \frac\pi 2) \end{align} $$Take $\,A := D\,,$ and $\,\phi = \frac\pi 2\,$ to see that $\,f\in\cal F\,.$
Now, let $\,C\ne 0\,.$ If there exist constants $\,A\ne 0\,$ and $\,\phi\,$ for which $\,C = A\cos\phi\,$ and $\,D = A\sin\phi\,,$ then $\,A\,$ and $\,\phi\,$ must satisfy certain requirements, as follows:
Since $\,C\ne 0\,,$ it follows that $\,\cos\phi \ne 0\,.$ Thus:
$$ \frac DC = \frac{A\sin\phi}{A\cos\phi} = \tan\phi $$There are an infinite number of choices for $\,\phi\,$ that satisfy this requirement. Choose $\,\phi := \tan^{-1}\frac DC\,,$ where $\,\tan^{-1}\,$ denotes the arctangent function, defined by:
$$ \begin{gather} y = \tan^{-1}x\cr \iff\cr \bigl( \tan y = x\ \text{ and }\ y\in (-\frac\pi 2,\frac\pi 2)\, \bigr) \end{gather} $$
Also note that
$$ \begin{align} &C^2 + D^2\cr &\quad = (A\cos\phi)^2 + (A\sin\phi)^2\cr &\quad = A^2\,, \end{align} $$so that $\,A\,$ must equal either $\,\sqrt{C^2 + D^2}\,$ or $\,-\sqrt{C^2 + D^2}\,.$
The table below gives the correct formulas for
$$ \cos\bigl(\tan^{-1}\frac DC\bigr) \ \ \text{and}\ \ \sin\bigl(\tan^{-1}\frac DC\bigr)\ , $$for all possible sign choices of C and D. In this table:
$$ K := \sqrt{C^2 + D^2} $$$D$ | $C$ | $\sin\bigl(\tan^{-1}\frac DC\bigr)$ | $\cos\bigl(\tan^{-1}\frac DC\bigr)$ |
$+$ | $+$ | $\frac DK$ | $\frac CK$ |
$+$ | $-$ | $-\frac DK$ | $-\frac CK$ |
$-$ | $+$ | $\frac DK$ | $\frac CK$ |
$-$ | $-$ | $-\frac DK$ | $-\frac CK$ |
If $\,C \gt 0\,,$ then take $\,A := \sqrt{C^2 + D^2}\,.$ Using (*):
$$ \begin{align} &A\sin(\omega t + \phi)\cr\ &\quad = \sqrt{C^2 + D^2}\,\cos\bigl(\tan^{-1}\frac DC\bigr)\sin\omega t\cr &\qquad + \sqrt{C^2+D^2}\,\sin\bigl(\tan^{-1}\frac DC\bigr)\cos\omega t\cr &= \sqrt{C^2 + D^2}\,\frac{C}{\sqrt{C^2+D^2}}\,\sin\omega t\cr &\qquad + \sqrt{C^2+D^2}\,\frac{D}{\sqrt{C^2+D^2}}\,\cos\omega t\cr &\quad = C\sin\omega t + D\cos\omega t \end{align} $$If $\,C \lt 0\,,$ take $\,A := -\sqrt{C^2 + D^2}\,.$ Then:
$$ \begin{align} &A\sin(\omega t + \phi)\cr\ &\quad = -\sqrt{C^2 + D^2}\,\cos\bigl(\tan^{-1}\frac DC\bigr)\sin\omega t\cr &\qquad - \sqrt{C^2+D^2}\,\sin\bigl(\tan^{-1}\frac DC\bigr)\cos\omega t\cr &= -\sqrt{C^2 + D^2}\,\frac{-C}{\sqrt{C^2+D^2}}\,\sin\omega t\cr &\qquad - \sqrt{C^2+D^2}\,\frac{-D}{\sqrt{C^2+D^2}}\,\cos\omega t\cr &\quad = C\sin\omega t + D\cos\omega t \end{align} $$In both cases, it has been shown that $\,f\in\cal F\,.$ $\blacksquare$
The following iterative approach can be used to find a choice of parameters $\,C\,,$ $\,D\,$ and $\,\omega\,$ so that the function $\,f(t) = C\sin\omega t + D \cos\omega t\,$ best approximates the data, in the least-squares sense. Observe that, once a value of $\,\omega\,$ is fixed, the remaining problem is just linear least-squares approximation, treated in Section $2.2\,.$
- Estimate a value of $\,\omega\,,$ based perhaps on initial data analysis.
- Use the methods from Section $2.2\,$ to find optimal parameters $\,C\,$ and $\,D\,$ corresponding to $\,\omega\,.$
- Compute the sum of squared errors, $\,SSE(C,D,\omega)\,,$ corresponding to $\,C\,,$ $\,D\,$ and $\,\omega\,.$
- Try to decrease $\,SSE\,$ by adjusting $\,\omega\,,$ and repeating the process.
One algorithm for ‘adjusting’ $\,\omega\,$ is based on the following result from multivariable calculus:
Let $\,h\,$ be a function from $\,\Bbb R^m\,$ into $\,\Bbb R\,,$ and let $\,\boldsymbol{\rm x} = (x_1,\ldots,x_m)\,$ be an element of $\,\Bbb R^m\,.$
If $\,h\,$ has continuous first partials in a neighborhood of $\,\boldsymbol{\rm x}\,,$ then the function $\,\triangledown h\ :\ \Bbb R^m\rightarrow\Bbb R^m\,$ defined by
$$ \triangledown h(x_1,\ldots,x_m) := \bigl( \frac{\partial h}{\partial x_1}(\boldsymbol{\rm x}),\ldots, \frac{\partial h}{\partial x_m}(\boldsymbol{\rm x}) \bigr) $$has the following property:
At $\,\boldsymbol{\rm x}\,,$ the function $\,h\,$ increases most rapidly in the direction of the vector $\,\triangledown h(\boldsymbol{\rm x})\,$ (the rate of change is then $\,\|\triangledown h(\boldsymbol{\rm x})\|\,$), and decreases most rapidly in the direction of $\,-\triangledown h(\boldsymbol{\rm x})\,$ (the rate of change is then $\,-\|\triangledown h(\boldsymbol{\rm x})\|\,$).
The function $\,\triangledown h\,$ is called the gradient of $\,h\,.$
This theorem will be used to ‘adjust’ parameter(s) so that the function $\,SSE\,$ will decrease most rapidly.
- Let $\,\omega\,$ be a current parameter value.
-
Compute $\,\triangledown SSE(\omega)\,.$ One must move in the direction of $\,-\triangledown SSE(\omega)\,$ to decrease $\,SSE\,$ most rapidly. Thus, let $\,\epsilon\,$ be a small positive number, and compute:
$$ \omega_{\text{new}} = \omega - \epsilon\cdot\frac{\triangledown SSE(\omega)}{\|\triangledown SSE(\omega)\|} $$
When $\,\|\triangledown SSE(\omega)\|\,$ is large, the function $\,SSE\,$ is changing rapidly at $\,\omega\,$; when $\,\|\triangledown SSE(\omega)\|\,$ is small, the function $\,SSE\,$ is changing slowly at $\,\omega\,.$
The MATLAB implementation given in this section allows the user to choose the value of $\,\epsilon\,.$ ‘Optimal’ values of $\,\epsilon\,$ will depend on both the function $\,SSE\,,$ and the current value of $\,\omega\,,$ as suggested by the sketches below (where, for convenience, $\,SSE\,$ is a function of one variable only).
These sketches also illustrate how ‘bouncing’ can occur, and how $\,\omega\,$ can converge to a non-optimal minimum. In general, experience and experimentation by the analyst will play a role in the success of this technique.

Far away from minimum; large $\,\epsilon\,$ will speed process initially

Convergence to ‘wrong’ minimum

‘Bouncing’
The previous theorem is now applied to a general nonlinear least-squares problem. MATLAB implementation, for a special case of the situation discussed here, will follow.
Let $\,f_1,f_2,\ldots,f_m\,$ be functions that depend on time, and on parameters $\,\boldsymbol{\rm c} := (c_1,\ldots,c_p)\,.$ That is, $\,f_i = f_i(t,\boldsymbol{\rm c})\,$ for $\,i = 1,\ldots,m\,.$ The dependence on any of the inputs may be nonlinear.
Define a function $\,f\,$ that depends on time, the parameters $\,\boldsymbol{\rm c}\,,$ and parameters $\,\boldsymbol{\rm b} = (b_1,\ldots,b_m)\,,$ via:
$$ f(t,c,\boldsymbol{\rm b}) = b_1f_1(t,\boldsymbol{\rm c}) + \cdots + b_mf_m(t,\boldsymbol{\rm c}) $$Thus, $\,f\,$ is linear with respect to the parameters in $\,\boldsymbol{\rm b}\,,$ but not in general with respect to the parameters in $\,\boldsymbol{\rm c}\,.$
Given a data set $\,\{(t_i,y_i)\}_{i=1}^N\,,$ let $\,\boldsymbol{\rm t} = (t_1,\ldots,t_N)\,$ and $\,\boldsymbol{\rm y} = (y_1,\ldots,y_N)\,.$ The sum of the squared errors between $\,f(t_i,\boldsymbol{\rm c},\boldsymbol{\rm b})\,$ and $\,y_i\,$ is
$$ SSE = \sum_{i=1}^N \bigl( y_i - f(t_i,\boldsymbol{\rm c},\boldsymbol{\rm b}) \bigr)^2\,, $$which has partials
$$ \frac{\partial SSE}{\partial c_j} = \sum_{i=1}^N 2\bigl( y_i - f(t_i,\boldsymbol{\rm c},\boldsymbol{\rm b}) \bigr) \bigl( -\frac{\partial f}{\partial c_j}(t_i,\boldsymbol{\rm c},\boldsymbol{\rm b}) \bigr)\,, $$for $\,j = 1,\ldots,p\,.$ Note that:
$$ \begin{align} &\frac{\partial f}{\partial c_j}(t_i,\boldsymbol{\rm c},\boldsymbol{\rm b})\cr &\quad = b_1\frac{\partial f_1}{\partial c_j}(t_i,\boldsymbol{\rm c}) + \cdots + b_m\frac{\partial f_m}{\partial c_j}(t_i,\boldsymbol{\rm c}) \end{align} $$Then, $\,\triangledown SSE = (\frac{\partial SSE}{\partial c_1},\ldots, \frac{\partial SSE}{\partial c_p})\,. $
The algorithm:
- Choose an initial value of $\,(c_1,\ldots,c_p)\,.$
- Use the methods in Section $2.2$ to find corresponding optimal values of $\,(b_1,\ldots,b_m)\,.$
-
Adjust:
$$ (c_1,\ldots,c_p)_{\text{new}} = (c_1,\ldots,c_p) - \epsilon\cdot \frac{\triangledown SSE}{\|\triangledown SSE\|} $$The analyst may need to adjust $\,\epsilon\,$ based on an analysis of the values of $\,SSE\,.$ If $\,SSE\,$ is large, but decreasing slowly, make $\,\epsilon\,$ larger. If $\,SSE\,$ is ‘bouncing’, make $\,\epsilon\,$ smaller.
- Repeat with the new values in $\,\boldsymbol{\rm c}\,.$
MATLAB IMPLEMENTATION: Nonlinear Least-Squares, Gradient Method
The following MATLAB function uses the gradient method discussed in this section to fit data with a sum of the form:
$$ f(t) = A + Bt + \sum_{i=1}^m C_i\sin\omega_i t + D_i\cos\omega_i t $$The fundamental period of $\,\sin\omega_i t\,$ is $\,P_i := \frac{2\pi}{\omega_i}\,.$ To use the function, type:
[E,f] = nonlin(P,D,epsil,tol);
The required inputs are:
- P $\,= [P_1\cdots P_m]\,$ is a matrix containing the initial estimates for the unknown periods. P may be a row or a column matrix.
- D is an $\,N\times 2\,$ matrix containing the data set $\,\{(t_i,y_i)\}_{i=1}^N\,.$ The time values are in the first column; the corresponding data values are in the second column.
-
epsil is used to control the search in parameter space. Indeed:
$$ (P_1,\ldots,P_m)_{\text{new}} = (P_1,\ldots,P_m) - \epsilon \frac{\triangledown SSE}{\|\triangledown SSE\|} $$ - tol is used to help determine when the algorithm stops. The algorithm will stop when either $\,25\,$ iterations have been made, or when $\,SSE \lt \,$ tol .
The function returns the following information:
-
E is a matrix containing each iteration of ‘best’ parameters and the associated error. Each row of E is of the form:
$$ \begin{gather} A\ \ B\ \ P_1\ \ C_1\ \ D_1\ \ P_2\ \ C_2\ \ D_2\cr \cdots\ \ P_m\ \ C_m\ \ D_m\ \ SSE \end{gather} $$ - f contains the values of the approximation to the data, using the parameter values in the last row of the matrix E , and using the time values in t . Then, the command plot(t,y,'x',t,f,'.') can be used to compare the actual data set with the approximate.
The MATLAB function nonlin is given next.


For Octave, the line
X(:,1) = ones(t);
must be changed to:
X(:,1) = ones(size(t));
Also, endfunction must be added as the last line.