Moments Of Pathological Behaviour

Last time[1] we took a look at basis function interpolation with which we approximate functions from their values at given sets of arguments, known as nodes, using weighted sums of distinct functions, known as basis functions. We began by constructing approximations using polynomials before moving on to using bell shaped curves, such as the normal probability density function, centred at the nodes. The latter are particularly useful for approximating multi-dimensional functions, as we saw by using multivariate normal PDFs.
An easy way to create rotationally symmetric functions, known as radial basis functions, is to apply univariate functions that are symmetric about zero to the distance between the interpolation's argument and their associated nodes. PDFs are a rich source of such functions and, in fact, the second bell shaped curve that we considered is related to that of the Cauchy distribution, which has some rather interesting properties.

A Clear Case Of Projection

Figure 1: Projecting To The Intercept
If we have a point $$(\mu, \sigma)$$ in the upper half of the plane, meaning that $$\sigma$$ is positive, and we project a line from it at some angle $$\theta$$ to the $$y$$ axis between $$-\frac\pi{2}$$ and $$+\frac\pi{2}$$ until it intercepts the $$x$$ axis at another point $$(X, 0)$$, as illustrated by figure 1, then if $$\theta$$ is uniformly distributed $$X$$ is Cauchy distributed.
Now, by the rules of trigonometry, we have
$\tan\theta = \frac{X-\mu}{\sigma}$
and so
$X = \mu + \sigma \times \tan\theta$
or, equivalently, if $$U$$ is a standard uniform random variable
$X = \mu + \sigma \times \tan \left(U \times \pi - \tfrac{\pi}{2}\right)$
This is a monotonically increasingly relationship, in the sense that every value of $$U$$ results in a unique value of $$X$$ and that greater values of the former result in greater values of the latter. We can consequently invert it with
$U = \frac{1}{\pi} \times \arctan\left(\frac{X - \mu}{\sigma}\right) + \frac12$
and it must therefore be the case that
$\Pr(X \leqslant x) = \Pr\left(U \leqslant \frac{1}{\pi} \times \arctan\left(\frac{x - \mu}{\sigma}\right) + \frac12\right)$
which gives us the Cauchy cumulative distribution function
Derivation 1: The Characteristic Function
We can recover the PDF from the CF with the inverse Fourier transform
$p_{\mu,\sigma}(x) = \frac{1}{2\pi} \int_{-\infty}^\infty e^{-itx} \hat{p}_{\mu,\sigma}(t) \, \mathrm{d}t = \frac{1}{2\pi} I$
where $$I$$ is the integral
\begin{align*} I &= \int_{-\infty}^\infty e^{-itx} \times e^{i \mu t - \sigma |t|} \, \mathrm{d}t = \int_{-\infty}^\infty e^{-it(x-\mu) - \sigma |t|} \, \mathrm{d}t\\ &= \int_{-\infty}^0 e^{-it(x-\mu) + \sigma t} \, \mathrm{d}t + \int_0^\infty e^{-it(x-\mu) - \sigma t} \, \mathrm{d}t\\ &= \left[\frac{e^{-it(x-\mu) + \sigma t}}{-i(x-\mu)+\sigma}\right]_{-\infty}^0 + \left[\frac{e^{-it(x-\mu) - \sigma t}}{-i(x-\mu)-\sigma}\right]_0^\infty \end{align*}
Since the complex terms in the exponentials result in sinusoids whose magnitudes are bounded by one and $$\sigma$$ is positive so that their real terms yield zero at minus infinity, we have
\begin{align*} I &= \left(\frac{1}{-i(x-\mu)+\sigma} - 0\right) + \left(0 - \frac{1}{-i(x-\mu)-\sigma}\right)\\ &= \frac{(-i(x-\mu)-\sigma) - (-i(x-\mu)+\sigma)}{(-i(x-\mu)+\sigma) \times (-i(x-\mu)-\sigma)}\\ &= \frac{-2\sigma}{-(x-\mu)^2 -\sigma^2} = \frac{2}{\sigma\left(1+\left(\frac{x-\mu}{\sigma}\right)^2\right)} \end{align*}
and
$\frac{1}{2\pi} I = \frac{1}{\pi\sigma\left(1+\left(\frac{x-\mu}{\sigma}\right)^2\right)} = p_{\mu,\sigma}(x)$
as required.
$P_{\mu,\sigma}(x) = \frac{1}{\pi} \times \arctan\left(\frac{x - \mu}{\sigma}\right) + \frac12$
This is trivially inverted with
$P^{-1}_{\mu,\sigma}(u) = \mu + \sigma \times \tan \left(u \times \pi - \tfrac{\pi}{2}\right)$
which matches the definition of its random variable, and differentiated by noting that
$\frac{\mathrm{d}}{\mathrm{d}x} \arctan x = \frac{1}{1+x^2}$
and applying the chain rule to yield the PDF
\begin{align*} p_{\mu,\sigma}(x) &= \frac{\mathrm{d}}{\mathrm{d}x} \left(\frac{1}{\pi} \times \arctan\left(\frac{x - \mu}{\sigma}\right) + \frac12\right)\\ &= \frac{1}{\pi} \times \frac{1}{1+\left(\frac{x - \mu}{\sigma}\right)^2} \times \frac{1}{\sigma}\\ &= \frac{1}{\pi\sigma\left(1+\left(\frac{x - \mu}{\sigma}\right)^2\right)} \end{align*}
which, for $$\mu$$ equal to zero, is just a constant factor away from the function that we considered for the construction of radial basis functions.

Finally, we have the characteristic function which is given by
$\hat{p}_{\mu,\sigma}(t) = e^{i \mu t - \sigma |t|}$
where $$i$$ is the square root of minus one, as proven in derivation 1.

Note that much like we do with the normal distribution, we call the Cauchy distribution having $$\mu$$ equal to zero and $$\sigma$$ equal to one the standard Cauchy distribution.

The ak Cauchy Distribution Functions

The implementation of the Cauchy distribution's functions follows our usual scheme of initialising their parameters by dispatching to a constructors object, as done by ak.cauchyRnd in listing 1.

Listing 1: ak.cauchyRnd
function rnd(m, s, rnd) {
return m + s*Math.tan(rnd()*ak.PI - ak.PI/2);
}

ak.cauchyRnd = function() {
var state = {mu: 0, sigma: 1, rnd: Math.random};
var arg0  = arguments[0];
var f;

constructors[ak.nativeType(arg0)](state, arg0, arguments);

f = function(){return rnd(state.mu, state.sigma, state.rnd);};
f.mu = function(){return state.mu;};
f.sigma = function(){return state.sigma;};
f.rnd = function(){return state.rnd;};
return Object.freeze(f);
};

var constructors = {};


This sets defaults of zero and one for $$\mu$$ and $$\sigma$$, creates a function that uses the rnd helper function to calculate the random observations and adds property accessors to it. The constructors allow leaving $$\mu$$ and $$\sigma$$ at their default values, specifying a new value for $$\sigma$$ or new values for both $$\mu$$ and $$\sigma$$, and an optional random number generator as shown by listing 2.

Listing 2: The Constructors
constructors[ak.UNDEFINED_T] = function() {
};

constructors[ak.FUNCTION_T] = function(state, rnd) {
state.rnd = rnd;
};

constructors[ak.NUMBER_T] = function(state, x, args) {
var arg1 = args[1];
constructors[ak.NUMBER_T][ak.nativeType(arg1)](state, x, arg1, args);

if(!isFinite(state.mu)) throw new Error('invalid mu in ak.cauchy distribution');
if(state.sigma<=0 || !isFinite(state.sigma)) {
throw new Error('invalid sigma in ak.cauchy distribution');
}
};

constructors[ak.NUMBER_T][ak.UNDEFINED_T] = function(state, x) {
state.sigma = Number(x);
};

constructors[ak.NUMBER_T][ak.FUNCTION_T] = function(state, x, rnd) {
state.sigma = Number(x);
state.rnd = rnd;
};

constructors[ak.NUMBER_T][ak.NUMBER_T] = function(state, x0, x1, args) {
var arg2 = args[2];

state.mu = Number(x0);
state.sigma = Number(x1);

constructors[ak.NUMBER_T][ak.NUMBER_T][ak.nativeType(arg2)](state, arg2);
};

constructors[ak.NUMBER_T][ak.NUMBER_T][ak.UNDEFINED_T] = function() {
};

constructors[ak.NUMBER_T][ak.NUMBER_T][ak.FUNCTION_T] = function(state, rnd) {
state.rnd = rnd;
};


Program 1 plots a histogram of the results of ak.cauchyRnd.

Program 1: Using ak.cauchyRnd

The remaining distribution functions are just as easy to implement, starting with ak.cauchyPDF in listing 3.

Listing 3: ak.cauchyPDF
function pdf(m, s, x) {
x = (x-m)/s;
return 1/(ak.PI*s*(1+x*x));
}

ak.cauchyPDF = function() {
var state = {mu: 0, sigma: 1};
var arg0  = arguments[0];
var f;

constructors[ak.nativeType(arg0)](state, arg0, arguments);

f = function(x){return pdf(state.mu, state.sigma, x);};
f.mu = function(){return state.mu;};
f.sigma = function(){return state.sigma;};
return Object.freeze(f);
};


Program 2 demonstrates how the PDF changes shape for increasing values of $$\sigma$$.

Program 2: The Cauchy PDF

Next we have ak.cauchyCDF in listing 4

Listing 4: ak.cauchyCDF
function cdf(m, s, x) {
x = (x-m)/s;
return Math.atan(x)/ak.PI + 0.5;
}

ak.cauchyCDF = function() {
var state = {mu: 0, sigma: 1};
var arg0  = arguments[0];
var f;

constructors[ak.nativeType(arg0)](state, arg0, arguments);

f = function(x){return cdf(state.mu, state.sigma, x);};
f.mu = function(){return state.mu;};
f.sigma = function(){return state.sigma;};
return Object.freeze(f);
};


and ak.cauchyInvCDF in listing 5.

Listing 5: ak.cauchyInvCDF
function invCDF(m, s, c) {
return m + s*Math.tan(c*ak.PI - ak.PI/2);
}

ak.cauchyInvCDF = function() {
var state = {mu: 0, sigma: 1};
var arg0  = arguments[0];
var f;

constructors[ak.nativeType(arg0)](state, arg0, arguments);

f = function(c){return invCDF(state.mu, state.sigma, c);};
f.mu = function(){return state.mu;};
f.sigma = function(){return state.sigma;};
return Object.freeze(f);
};


Finally, ak.cauchyCF given in listing 6 implements the characteristic function

Listing 6: ak.cauchyCF
function cf(m, s, t) {
var r  = Math.exp(-s*Math.abs(t));
var mt = m*t;
return r===0 ? ak.complex(0) : ak.complex(r*Math.cos(mt), r*Math.sin(mt));
}

ak.cauchyCF = function() {
var state = {mu: 0, sigma: 1};
var arg0  = arguments[0];
var f;

constructors[ak.nativeType(arg0)](state, arg0, arguments);

f = function(t){return cf(state.mu, state.sigma, t);};
f.mu = function(){return state.mu;};
f.sigma = function(){return state.sigma;};
return Object.freeze(f);
};


using the fact that
$e^{i\theta} = \cos \theta + i \sin \theta$
to directly initialise the real and imaginary parts of our ak.complex type, which program 3 plots with the real part in black and the imaginary part in red.

Program 3: The Cauchy CF

You can find all of these functions in CauchyDistribution.js.

Some Properties Of The Cauchy Distribution

One property that the Cauchy distribution has in common with the normal distribution is that it is stable, meaning that sums of Cauchy distributed random variables are themselves Cauchy distributed. Specifically, we have
\begin{align*} X_1 &\sim Cauchy\left(\mu_1, \sigma_1\right)\\ X_2 &\sim Cauchy\left(\mu_2, \sigma_2\right)\\ X_1 + X_2 &\sim Cauchy\left(\mu_1+\mu_2, \sigma_1+\sigma_2\right) \end{align*}
where $$\sim$$ means drawn from, which follows from its characteristic function
$\hat{p}_{\mu_1,\sigma_1}(t) \times \hat{p}_{\mu_2,\sigma_2}(t) = e^{i \mu_1 t - \sigma_1 |t|} \times e^{i \mu_2 t - \sigma_2 |t|} = e^{i \left(\mu_1 + \mu_2\right) t - \left(\sigma_1 + \sigma_2\right) |t|}$
Another is that a constant multiple of a Cauchy distributed random variable is also Cauchy distributed
\begin{align*} X &\sim Cauchy\left(\mu, \sigma\right)\\ c \times X &\sim Cauchy\left(c \times \mu, |c| \times \sigma\right) \end{align*}
which follows from its definition for positive $$c$$ with
$c \times X = c \times \left(\mu + \sigma \times \tan(U \times \pi - \tfrac{\pi}{2})\right) = (c \times \mu) + (c \times \sigma) \times \tan(U \times \pi - \tfrac{\pi}{2})$
and for negative $$c$$ with
$c \times X = (c \times \mu) + (-c \times \sigma) \times \tan\left(-\left(U \times \pi - \tfrac{\pi}{2}\right)\right)$
Putting both of these together implies that the average of a set of identically distributed Cauchy random variables has exactly the same distribution as each of them
\begin{align*} X_i &\sim Cauchy\left(\mu, \sigma\right)\\ \frac{1}{n} \sum_{i=0}^{n-1} X_i &\sim Cauchy\left(\frac{1}{n} \sum_{i=0}^{n-1} \mu, \frac{1}{n} \sum_{i=0}^{n-1} \sigma\right) = Cauchy\left(\mu, \sigma\right) \end{align*}
This stands in contrast to the central limit theorem which states that the average of $$n$$ observations of a sufficiently well behaved random variable with a mean of $$\mu$$ and a standard deviation of $$\sigma$$ tends to the normal distribution $$N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)$$ as $$n$$ tends to infinity.
The reason for this discrepancy is that whilst the Cauchy distribution's functions don't look particularly troublesome they are, in fact, very badly behaved indeed.

Derivation 2: The Mean
Noting that
\begin{align*} f(x) &= \ln \left(1+\left(\frac{x-\mu}{\sigma}\right)^2\right)\\ f^\prime(x) &= \frac{1}{1+\left(\frac{x-\mu}{\sigma}\right)^2} \times 2 \left(\frac{x-\mu}{\sigma}\right) \times \frac{1}{\sigma}\\ &= \frac{2}{\sigma^2} \times \frac{x-\mu}{1+\left(\frac{x-\mu}{\sigma}\right)^2} \end{align*}
we can find the form of the integral with
\begin{align*} g(x) &= \frac{\sigma}{2\pi} f(x) + \mu P_{\mu,\sigma}(x)\\ g^\prime(x) &= x p_{\mu,\sigma}(x) - \mu p_{\mu,\sigma}(x) + \mu p_{\mu,\sigma}(x) = x p_{\mu,\sigma}(x) \end{align*}
so that
\begin{align*} \mathrm{E}[X] &= \int_{-\infty}^{\infty} g^\prime(x) \, \mathrm{d}x = \left[\frac{\sigma}{2\pi} f(x) + \mu P_{\mu,\sigma}(x)\right]_{-\infty}^{\infty}\\ &= \infty - \infty \end{align*}
One rather counter intuitive property of a Cauchy distribution is that it doesn't have a mean. You can be forgiven for thinking that it's obviously $$\mu$$ but, whilst that is its midpoint, or median, and its point of maximum density, or mode, it most certainly is not the mean.
The mean of a distribution is defined as the expected value of its random variable $$X$$, which for the Cauchy distribution is given by
\begin{align*} \mathrm{E}[X] &= \int_{-\infty}^{\infty} x \times p_{\mu,\sigma}(x) \, \mathrm{d}x\\ &= \int_{-\infty}^{\infty} \frac{x}{\pi \sigma \left(1+\left(\frac{x-\mu}{\sigma}\right)^2\right)} \, \mathrm{d}x \end{align*}
and results in the meaningless expression $$\infty - \infty$$, as proven by derivation 2. Furthermore, given that the variance of a random variable is defined as the expected value of the square of the difference between it and its mean $$\bar{X}$$, the Cauchy distribution's variance isn't well defined either, nor are any of its higher central moments, defined as
$\mathrm{E}[(X-\bar{X})^n]$
for positive integer $$n$$.
The central limit theorem requires that a distribution has a well defined, finite mean and standard deviation, being the square root of the variance, and since the Cauchy distribution has neither it does not apply to it.

Another way to think about the mean of a distribution is as the limit the of the mean of a sample of $$n$$ observations of its random variable as $$n$$ tends to infinity. Since the mean of $$n$$ observations of a Cauchy distributed random variable has the same distribution as it does for any given $$n$$, the limit simply doesn't have a specific value and is therefore undefined.

Derivation 3: The Second Moment About The Median
Rearranging the integrand yields
\begin{align*} \frac{\left(x-\mu\right)^2}{\pi \sigma \left(1+\left(\frac{x-\mu}{\sigma}\right)^2\right)} &= \frac{\sigma}{\pi} \times \frac{\left(x-\mu\right)^2}{\sigma^2+\left(x-\mu\right)^2}\\ &= \frac{\sigma}{\pi} \times \left(1-\frac{\sigma^2}{\sigma^2+\left(x-\mu\right)^2}\right)\\ &= \frac{\sigma}{\pi} - \frac{\sigma}{\pi} \times \frac{1}{1+\left(\frac{x-\mu}{\sigma}\right)^2}\\ &= \frac{\sigma}{\pi} - \sigma^2 p_{\mu,\sigma}(x) \end{align*}
and so
\begin{align*} \mathrm{E}\left[\left(X-\mu\right)^2\right] &= \left[\frac{\sigma x}{\pi} - \sigma^2 P_{\mu,\sigma}(x)\right]_{-\infty}^{\infty}\\ &= \infty - -\infty \end{align*}
Note that not all of the moments of the Cauchy distribution are undefined. For example, the second moment about the median is given by
\begin{align*} \mathrm{E}\left[\left(X-\mu\right)^2\right] &= \int_{-\infty}^{\infty} \left(x-\mu\right)^2 \times p_{\mu,\sigma}(x) \, \mathrm{d}x\\ &= \int_{-\infty}^{\infty} \frac{\left(x-\mu\right)^2}{\pi \sigma \left(1+\left(\frac{x-\mu}{\sigma}\right)^2\right)} \, \mathrm{d}x \end{align*}
which derivation 3 shows equates to $$\infty - -\infty$$, or in other words to infinity. Whilst this is certainly more satisfying than an undefined value from a theoretical perspective, in practice it's not really that useful.

Now because of its undefined and infinite moments, the Cauchy distribution is an example of a pathological distribution, in contrast to well behaved distributions that have the well defined means, standard deviations and higher moments that we might have naively expected all distributions to have.

Whilst the mathematics describing the pathological behaviour of the Cauchy distribution is undeniable, it doesn't necessarily give us a feeling for why it behaves that way and perhaps the best way to get that is to consider the possible outcomes of observations of it.
Specifically, noting that the probability than an observation of a random variable $$X$$ is greater than a given value $$x$$ can be derived from its CDF $$P_X$$ with
$\Pr(X>x) = 1-P_X(x)$
program 4 compares those probabilities for a standard Cauchy random variable to those of normal random variables with a range of standard deviations, plotting the former in black and the latter in red.

Program 4: Probabilities Of Exceedances

This shows that the Cauchy distribution has a relatively high probability of yielding extremely large observations compared to normal distributions; large enough, in fact, that its mean is repeatedly disturbed from any value that it might settle upon.

We can get a deeper understanding of why observations of Cauchy random variables are frequently very large through the relationship between the standard Cauchy distribution and the standard normal distribution $$N(0,1)$$
\begin{align*} Z_1 &\sim N(0,1)\\ Z_2 &\sim N(0,1)\\ \frac{Z_1}{Z_2} &\sim Cauchy(0,1) \end{align*}
as proven in derivation 4.

Derivation 4: The Ratio Of Standard Normal Random Variables
Firstly, we have
\begin{align*} \Pr\left(\frac{Z_1}{Z_2} \leqslant x\right) &= \Pr\left(Z_1 \leqslant x \times Z_2 \wedge Z_2 \geqslant 0 \right) + \Pr\left(Z_1 \geqslant x \times Z_2 \wedge Z_2 < 0\right)\\ &= \int_0^\infty \phi\left(z_2\right) \times \Pr \left(Z_1 \leqslant x \times z_2\right) \, \mathrm{d} z_2 + \int_{-\infty}^0 \phi\left(z_2\right) \times \Pr \left(Z_1 \geqslant x \times z_2\right) \, \mathrm{d} z_2\\ &= \int_0^\infty \int_{-\infty}^{x \times z_2} \phi\left(z_2\right) \times \phi\left(z_1\right) \, \mathrm{d} z_1 \, \mathrm{d} z_2 + \int_{-\infty}^0 \int_{x \times z_2}^\infty \phi\left(z_2\right) \times \phi\left(z_1\right) \, \mathrm{d} z_1 \, \mathrm{d} z_2 \end{align*}
where $$\wedge$$ means and and $$\phi$$ is the standard normal PDF.
Differentiating this yields the density
\begin{align*} \frac{\mathrm{d}}{\mathrm{d}x} \Pr\left(\frac{Z_1}{Z_2} \leqslant x\right) &= \int_0^\infty z_2 \times \phi\left(z_2\right) \times \phi\left(x \times z_2\right) \, \mathrm{d} z_2 - \int_{-\infty}^0 z_2 \times \phi\left(z_2\right) \times \phi\left(x \times z_2\right) \, \mathrm{d} z_2\\ &= \int_{-\infty}^\infty \left|z_2\right| \times \frac{1}{\sqrt{2\pi}} e^{-\frac12z_2^2} \times \frac{1}{\sqrt{2\pi}} e^{-\frac12\left(x \times z_2\right)^2} \, \mathrm{d} z_2\\ &= \frac{1}{2\pi} \int_{-\infty}^\infty \left|z_2\right| \times e^{-\frac12 \left(1+x^2\right) z_2^2} \, \, \mathrm{d} z_2 = \frac{1}{\pi} \int_0^\infty z_2 \times e^{-\frac12 \left(1+x^2\right) z_2^2} \, \, \mathrm{d} z_2 \end{align*}
Noting that
\begin{align*} f\left(z_2\right) &= -\frac{1}{1+x^2} \times e^{-\frac12 \left(1+x^2\right) z_2^2}\\ f^\prime\left(z_2\right) &= -\frac{1}{1+x^2} \times -\frac12 \left(1+x^2\right) \times e^{-\frac12 \left(1+x^2\right) z_2^2} \times 2z_2 = z_2 \times e^{-\frac12 \left(1+x^2\right) z_2^2} \end{align*}
and so
$\frac{\mathrm{d}}{\mathrm{d}x} \Pr\left(\frac{Z_1}{Z_2} \leqslant x\right) = \frac{1}{\pi} \left[-\frac{1}{1+x^2} \times e^{-\frac12 \left(1+x^2\right) z_2^2}\right]_{0}^{\infty} = \frac{1}{\pi \left(1+x^2\right)}$
which is the standard Cauchy PDF.

As a result, observations of the denominator $$Z_2$$ that are less than one amplify the observations of the numerator $$Z_1$$ and, since the former aren't all that unlikely, this explains why observations of Cauchy distributed random variables can often be so much greater than those of the normal distribution.

References

[1] All Your Basis Are Belong To Us, www.thusspakeak.com, 2020.