Last time^{[1]} we took a look at linear regression which finds the linear function that minimises the differences between its results and values at a set of points that are presumed, possibly after applying some specified transformation, to be random deviations from a straight line or, in multiple dimensions, a flat plane. The purpose was to reveal the underlying relationship between the independent variable represented by the points and the dependent variable represented by the values at them.
This time we shall see how we can approximate the function that defines the relationship between them without actually revealing what it is.
We also have
This reduction of the variance together with the preservation of the mean suggests that averaging might be an effective way of reducing the size of the random errors in the values at the points. Specifically, to estimate the value at a point we can use the average value of points that are close to it, with the assumption that the noise has the same distribution everywhere with a mean of zero and that the average of the underlying function at those nearby points is close to the value at the point of interest.
For example, if the underlying function is \(f(x)\) and the error is \(\epsilon_x\) then the value at a point \(x\) will be
If the error introduced by the sum is smaller than that eliminated by averaging the noise, which will be the case if the points are close and the derivatives are small, then this is an effective way to approximate the value at their midpoint with a smaller expected error than they have themselves.
A simple way to do this is to use a weighted average of the values at the given points \(x_i\)
This weighted averaging is called smoothing and a frequently used kernel is the Gaussian which is defined by
The effectiveness of the Gaussian smoother strongly depends upon the choice of \(\sigma\); too small and the noise will have a significant influence upon its results, too great and the values at distant points will do so, as demonstrated by program 2.
Unfortunately, the optimal choice depends upon both the unknown distribution of the errors and the unknown underlying relationship between the points and their true values. One way to think about it is in terms of the density of points
Another way to define a kernel in terms of the number of points that contribute to its value is to take the average value of the smoother's \(k\) nearest points to \(x\), as illustrated by program 3.
This is mathematically equivalent to using the kernel
There's nothing about a kernel smoother that requires the points to be scalars, provided that the kernel can calculate the scalar weight for the argument of the smoother and each of its points. In fact the same is true of the values, provided that they support the required arithmetic operations, although for the sake of efficiency and simplicity our implementation will assume that the values at the points are numbers.
As usual, these are initialised by dispatching to a
Finally, when returning the nodes from the
Program 4 demonstrates how to create a Gaussian smoother using
Using the nearest neighbour kernel would be rather inefficient if we were to calculate the neighbourhood \(N_k(x)\) every time the kernel is called. Fortunately, we can improve matters by only calculating it when the argument changes, such as is done by the
Program 5 illustrates the results of
Whilst this is admittedly still a lot less efficient than explicitly taking the average values of the \(k\) nearest neighbours and, for scalar points at least, could be improved by sorting the nearest neighbour points and using a binary search to look them up, it's not a wholly unreasonable way to implement a nearest neighbour smoother.
One way to combine the ideas behind nearest neighbour and weighted smoothers is with a moving weighted average. With these we limit the number of points that contribute to the approximation at a given point to those that lie within a certain distance of it, being the moving aspect, lending greatest weight to those of them that are closest and least to those that are furthest, being the weighted aspect.
For example, to restrict the smoother's points to those that are within a distance of \(\delta\) of its argument and base their weights upon the normal PDF, program 6 uses the kernel
In the next post we shall see how we can fit an arbitrary scalar valued functional form that depends upon a set of controlling parameters to a set of observed values.
This time we shall see how we can approximate the function that defines the relationship between them without actually revealing what it is.
Great Expectations
Given a pair of independent random variables \(X\) and \(Y\), we have the expectations
\[
E[X+Y] = E[X] + E[Y]
\]
and
\[
\begin{align*}
E\left[(X+Y  E[X+Y])^2\right]
&= E\left[((XE[X])+(YE[Y]))^2\right]\\
&= E\left[(XE[X])^2 + 2 \times (XE[X]) \times (YE[Y]) + (YE[Y])^2\right]\\
&= E\left[(XE[X])^2\right] + 2 \times E[XE[X]] \times E[YE[Y]] + E\left[(YE[Y])^2\right]\\
&= E\left[(XE[X])^2\right] + E\left[(YE[Y])^2\right]
\end{align*}
\]
meaning that means and variances of sums of independent random variables accumulate additively. The sum of \(n\) independent observations of a random variable with a mean of \(\mu\) and a variance of \(\sigma^2\) must therefore have a mean of \(n \times \mu\) and a variance of \(n \times \sigma^2\).We also have
\[
\begin{align*}
E\left[\frac{X}{n}\right] &= \frac{E[X]}{n}\\
E\left[\left(\frac{X}{n}E\left[\frac{X}{n}\right]\right)^2\right] &= \frac{E[(XE[X])^2]}{n^2}
\end{align*}
\]
so that their average has a mean of \(\mu\) and a variance of \(\frac1n \sigma^2\).This reduction of the variance together with the preservation of the mean suggests that averaging might be an effective way of reducing the size of the random errors in the values at the points. Specifically, to estimate the value at a point we can use the average value of points that are close to it, with the assumption that the noise has the same distribution everywhere with a mean of zero and that the average of the underlying function at those nearby points is close to the value at the point of interest.
For example, if the underlying function is \(f(x)\) and the error is \(\epsilon_x\) then the value at a point \(x\) will be
\[
\hat{f}(x) = f(x) + \epsilon_x
\]
and the average of the values at \(x_0\) and \(x_1\) is
\[
\frac{\hat{f}(x_0)+\hat{f}(x_1)}{2} = f\left(\frac{x_0+x_1}{2}\right) + \frac{\epsilon_{x_0}+\epsilon_{x_1}}{2} + \sum_{i=1}^\infty \frac{1}{(2i)!} \times \left(\frac{x_1x_0}{2}\right)^{2i} \times f^{(2i)}\left(\frac{x_0+x_1}{2}\right)
\]
where \(\sum\) is the summation sign, the exclamation mark stands for the factorial and \(f^{(n)}(x)\) is the \(n^\mathrm{th}\) derivative of \(f\) at \(x\), as proven in derivation 1.
Derivation 1: The Average Value
First, define
\[
\begin{align*}
\bar{x} &= \frac{x_0+x_1}{2} &
\delta &= \frac{x_1x_0}{2}
\end{align*}
\]
so that
\[
\begin{align*}
\bar{x}  \delta &= x_0 &
\bar{x} + \delta &= x_1
\end{align*}
\]
Next, from Taylor's theorem we have
\[
f(\bar{x}+\delta) = \sum_{i=0}^\infty \frac{1}{i!} \times \delta^i \times f^{(i)}(\bar{x})
\]
and so
\[
\begin{align*}
f(\bar{x}\delta) + f(\bar{x}+\delta)
&= \sum_{i=0}^\infty \frac{1}{i!} \times (\delta)^i \times f^{(i)}(\bar{x}) + \sum_{i=0}^\infty \frac{1}{i!} \times \delta^i \times f^{(i)}(\bar{x})\\
&= 2 \times \sum_{i=0}^\infty \frac{1}{(2i)!} \times \delta^{2i} \times f^{(2i)}(\bar{x})
\end{align*}
\]
Finally, the average of the values at the points is therefore
\[
\begin{align*}
\frac{\hat{f}(\bar{x}\delta)+\hat{f}(\bar{x}+\delta)}{2}
&= \frac{f(\bar{x}\delta)+\epsilon_{\bar{x}\delta}+f(\bar{x}+\delta)+\epsilon_{\bar{x}+\delta}}{2}\\
&= \frac{\epsilon_{\bar{x}\delta}+\epsilon_{\bar{x}+\delta}}{2} + \frac{f(\bar{x}\delta)+f(\bar{x}+\delta)}{2}\\
&= \frac{\epsilon_{\bar{x}\delta}+\epsilon_{\bar{x}+\delta}}{2} + \sum_{i=0}^\infty \frac{1}{(2i)!} \times \delta^{2i} \times f^{(2i)}(\bar{x})\\
&= \frac{\epsilon_{\bar{x}\delta}+\epsilon_{\bar{x}+\delta}}{2} + f(\bar{x}) + \sum_{i=1}^\infty \frac{1}{(2i)!} \times \delta^{2i} \times f^{(2i)}(\bar{x})
\end{align*}
\]
If the error introduced by the sum is smaller than that eliminated by averaging the noise, which will be the case if the points are close and the derivatives are small, then this is an effective way to approximate the value at their midpoint with a smaller expected error than they have themselves.
A Weighty Decision
In order to approximate the underlying function at any point, rather than just at the midpoints of the points at which we have values, we'll need a better way of calculating averages. In particular, we might expect that lending greater weight to the values at the points that are closest to the one at which we're trying to approximate the function and lesser weight to those that are furthest from it should be a reasonable way to preserve its value whilst still minimising the noise.A simple way to do this is to use a weighted average of the values at the given points \(x_i\)
\[
\bar{f}(x) = \frac{\sum_{i=0}^{n1} w\left(x, x_i\right) \times \hat{f}(x_i)}{\sum_{i=0}^{n1} w\left(x, x_i\right)}
\]
where the function \(w\), known as a kernel, is always positive, takes a maximum value when its arguments are equal and decreases as they get further apart.This weighted averaging is called smoothing and a frequently used kernel is the Gaussian which is defined by
\[
w\left(x, x_i\right) = \phi_{\,0,\sigma}\left(xx_i\right)
\]
where \(\phi_{\,0,\sigma}\) is the probability density function, or PDF, of the normal distribution with a mean of zero and a standard deviation of \(\sigma\). Program 1 shows how we can use a Gaussian smoother to approximate a cosine function with random errors, plotting the noisy values with crosses, the underlying function in green and the approximation in red.
Program 1: A Gaussian Smoother



The effectiveness of the Gaussian smoother strongly depends upon the choice of \(\sigma\); too small and the noise will have a significant influence upon its results, too great and the values at distant points will do so, as demonstrated by program 2.
Program 2: The Effect Of The Standard Deviation



Unfortunately, the optimal choice depends upon both the unknown distribution of the errors and the unknown underlying relationship between the points and their true values. One way to think about it is in terms of the density of points
\[
\rho = \frac{n}{\max(x_i)  \min(x_i)}
\]
We can consequently define \(\sigma\) according to the average number of points \(k\) that we want to fall within that distance from the Gaussian kernel's midpoint with
\[
\begin{align*}
k &= 2 \times \sigma \times \rho\\
\sigma &= \frac{k}{2 \times \rho}
\end{align*}
\]
Note that the value of \(\sigma\) in program 1 corresponds to a choice of ten points.Another way to define a kernel in terms of the number of points that contribute to its value is to take the average value of the smoother's \(k\) nearest points to \(x\), as illustrated by program 3.
Program 3: A Nearest Neighbour Smoother



This is mathematically equivalent to using the kernel
\[
w\left(x, x_i\right) =
\begin{cases}
\frac{1}{k} & x_i \in N_k(x)\\
0 & \text{otherwise}
\end{cases}
\]
where \(N_k(x)\) is the set of the \(k\) nearest neighbouring points of \(x\).There's nothing about a kernel smoother that requires the points to be scalars, provided that the kernel can calculate the scalar weight for the argument of the smoother and each of its points. In fact the same is true of the values, provided that they support the required arithmetic operations, although for the sake of efficiency and simplicity our implementation will assume that the values at the points are numbers.
ak.kernelSmooth
Theak.kernelSmooth
function defined in listing 1 uses JavaScript's built in arithmetic operators, rather than our overloads, to calculate the numerator and denominator of the smoothed result which is why we require that the points' values are numbers. The points and their values are stored in the nodes
member of the state
object, and the kernel is stored in its kernel
member.Listing 1: ak.kernelSmooth
ak.kernelSmooth = function() { var state = {nodes:[], kernel:undefined}; var arg0 = arguments[0]; var f; constructors[ak.nativeType(arg0)](state, arg0, arguments); f = function(x) { var n = state.nodes.length; var y0 = 0; var y1 = 0; var w; while(n>0) { w = state.kernel(x, state.nodes[n].x); y0 += w * state.nodes[n].y; y1 += w; } return y0/y1; }; f.nodes = function() {return copyNodes(state.nodes);}; f.kernel = function() {return state.kernel;}; return Object.freeze(f); }; var constructors = {};
As usual, these are initialised by dispatching to a
constructors
object, which in this case supports initialisation with an array of points, an array of values, or an array of node objects containing a point x
and a value y
, followed by the kernel f
, as shown by listing 2.Listing 2: The Constructors
constructors[ak.ARRAY_T] = function(state, arg0, args) { var arg1 = args[1]; constructors[ak.ARRAY_T][ak.nativeType(arg1)](state, arg0, arg1, args); }; constructors[ak.ARRAY_T][ak.ARRAY_T] = function(state, arg0, arg1, args) { var arg2 = args[2]; constructors[ak.ARRAY_T][ak.ARRAY_T][ak.nativeType(arg2)](state, arg0, arg1, arg2); }; constructors[ak.ARRAY_T][ak.ARRAY_T][ak.FUNCTION_T] = function(state, x, y, f) { var n = x.length; var i, xi, yi; if(y.length!==n) throw new Error('size mismatch in ak.kernelSmooth'); state.nodes.length = n; for(i=0;i<n;++i) { xi = x[i]; yi = Number(y[i]); if(!isFinite(yi)) throw new Error('invalid node value in ak.kernelSmooth'); state.nodes[i] = {x:xi, y:yi}; } state.kernel = f; }; constructors[ak.ARRAY_T][ak.FUNCTION_T] = function(state, nodes, f) { var n = nodes.length; var i, x, y; state.nodes.length = n; for(i=0;i<n;++i) { x = nodes[i].x; if(ak.nativeType(x)===ak.FUNCTION_T) x = x(); y = nodes[i].y; if(ak.nativeType(y)===ak.FUNCTION_T) y = y(); y = Number(y); if(!isFinite(y)) throw new Error('invalid node value in ak.kernelSmooth'); state.nodes[i] = {x:x, y:y}; } state.kernel = f; };
Finally, when returning the nodes from the
nodes
property accessor we need to make a copy of those held in the state
which is done with the copyNodes
helper function defined in listing 3.Listing 3: Copying The Nodes
function copyNodes(nodes) { var n = nodes.length; var copy = new Array(n); var i; for(i=0;i<n;++i) copy[i] = {x:nodes[i].x, y:nodes[i].y}; return copy; }
Program 4 demonstrates how to create a Gaussian smoother using
ak.kernelSmooth
, which you can find in KernelSmooth.js.
Program 4: A Gaussian ak.kernelSmooth



Using the nearest neighbour kernel would be rather inefficient if we were to calculate the neighbourhood \(N_k(x)\) every time the kernel is called. Fortunately, we can improve matters by only calculating it when the argument changes, such as is done by the
kernel
function defined in listing 4.Listing 4: A Nearest Neighbour Kernel
var n = nodes.length; var x0 = ak.NaN; var xd = new Array(n); var update = function(x) { var i; x0 = x; for(i=0;i<n;++i) xd[i] = {x:nodes[i].x, d:Math.abs(xnodes[i].x)}; xd.sort(function(xd0, xd1) {return xd0.dxd1.d;}); return xd.slice(0, k); }; var kernel, nkx; kernel = function(x, xi) { if(x!==x0) nkx = update(x); return nkx.some(function(njx){return njx.x===xi}) ? 1/k : 0; };
Program 5 illustrates the results of
ak.kernelSmooth
using this kernel
.
Program 5: A Neighbourhood ak.kernelSmooth



Whilst this is admittedly still a lot less efficient than explicitly taking the average values of the \(k\) nearest neighbours and, for scalar points at least, could be improved by sorting the nearest neighbour points and using a binary search to look them up, it's not a wholly unreasonable way to implement a nearest neighbour smoother.
One way to combine the ideas behind nearest neighbour and weighted smoothers is with a moving weighted average. With these we limit the number of points that contribute to the approximation at a given point to those that lie within a certain distance of it, being the moving aspect, lending greatest weight to those of them that are closest and least to those that are furthest, being the weighted aspect.
For example, to restrict the smoother's points to those that are within a distance of \(\delta\) of its argument and base their weights upon the normal PDF, program 6 uses the kernel
\[
w\left(x, x_i\right) =
\begin{cases}
\phi_{\,0,\sigma}\left(xx_i\right)\phi_{\,0,\sigma}(\delta) & xx_i < \delta\\
0 & \text{otherwise}
\end{cases}
\]
where the vertical bars stand for the absolute value of the term between them and the offset of \(\phi_{\,0,\sigma}(\delta)\) ensures that it is continuous at \(x\) equal to \(x_i \pm \delta\) since \(\phi_{\,0,\sigma}\) is symmetric about zero and so
\[
\phi_{\,0,\sigma}\left(x_i \pm \delta  x_i\right)\phi_{\,0,\sigma}(\delta)
= \phi_{\,0,\sigma}\left(\pm\delta\right)\phi_{\,0,\sigma}(\delta)
= \phi_{\,0,\sigma}(\delta)\phi_{\,0,\sigma}(\delta)
= 0
\]
Program 6: A Moving Weighted Average Smoother



In the next post we shall see how we can fit an arbitrary scalar valued functional form that depends upon a set of controlling parameters to a set of observed values.
Leave a comment