Time For A Chi Test

A few months ago[1] we explored the chi-squared distribution which describes the properties of sums of squares of standard normally distributed random variables, being those that have means of zero and standard deviations of one.
Whilst I'm very much of the opinion that statistical distributions are worth describing in their own right, the chi-squared distribution plays a pivotal role in testing whether or not the categories into which a set of observations of some variable quantity fall are consistent with assumptions about the expected numbers in each category, which we shall take a look at in this post.

The Basic Chi-Squared Test

Given $$k$$ categories with expected numbers $$e_i$$ and observed numbers $$o_i$$ such that
$\sum_{i=0}^{k-1} e_i = \sum_{i=0}^{k-1} o_i = n$
then the value
\begin{align*} \chi^2 &= \sum_{i=0}^{k-1} \frac{\left(o_i-e_i\right)^2}{e_i} = \sum_{i=0}^{k-1} \frac{o_i^2 - 2 \times o_i \times e_i + e_i^2}{e_i}\\ &= \sum_{i=0}^{k-1} \frac{o_i^2}{e_i} - 2 \times \sum_{i=0}^{k-1} \frac{o_i \times e_i}{e_i} + \sum_{i=0}^{k-1} \frac{e_i^2}{e_i}\\ &= \sum_{i=0}^{k-1} \frac{o_i^2}{e_i} - 2 \times \sum_{i=0}^{k-1} o_i + \sum_{i=0}^{k-1} e_i = \sum_{i=0}^{k-1} \frac{o_i^2}{e_i} - n \end{align*}
by an argument[2] that I haven't entirely followed nor intend to present here, satisfies
$\chi^2 \sim Chi^2(k-1)$
where $$\sim$$ means drawn from and the degrees of freedom equal $$k-1$$ since the total number of observations is known and so the number in one of the categories is simply that minus the sum of those in the others and is therefore known too.
Note that we can trivially generalise this to observations that depend upon $$p$$ parameters with
$\chi^2 \sim Chi^2(k-p)$
Now, if $$F_k$$ is the chi-squared cumulative distribution function, or CDF, we have
\begin{align*} \Pr\left(x^2 \leqslant \chi^2\right) &= F_{k-p}\left(\chi^2\right)\\ \Pr\left(x^2 > \chi^2\right) &= 1 - F_{k-p}\left(\chi^2\right) \end{align*}
the latter of which we may interpret as the confidence we have that the observations are consistent with the expectations.

Chi-Squared Contingency Tests

Another use of the chi-squared test is to determine the likelihood that the values of two different properties are independently distributed throughout a population. We can represent such distributions with a contingency table of the form
$\begin{array}{c|ccccc} & A & B & C & D & \dots\\ \hline 1 & A_1 & B_1 & C_1 & D_1 & \dots\\ 2 & A_2 & B_2 & C_2 & D_2 & \dots\\ 3 & A_3 & B_3 & C_3 & D_3 & \dots\\ 4 & A_4 & B_4 & C_4 & D_4 & \dots\\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \end{array}$
whose elements are the number of members of the population exhibit the given values of the properties. We can represent this with a matrix
$\mathbf{M} = \begin{pmatrix} m_{00} & m_{01} & m_{02} & m_{03} & \dots\\ m_{10} & m_{11} & m_{12} & m_{13} & \dots\\ m_{20} & m_{21} & m_{22} & m_{23} & \dots\\ m_{30} & m_{31} & m_{32} & m_{33} & \dots\\ \vdots & \vdots & \vdots & \vdots & \ddots \end{pmatrix}$
Defining
\begin{align*} r_i &= \sum_{j=0}^{n_c-1} m_{ij}\\ c_j &= \sum_{i=0}^{n_r-1} m_{ij}\\ n &= \sum_{i=0}^{n_r-1} r_i = \sum_{j=0}^{n_c-1} c_j \end{align*}
the expected value of the element in the $$i^\mathrm{th}$$ row and $$j^\mathrm{th}$$ column is
$e_{ij} = \frac{r_i}{n} \times \frac{c_j}{n} \times n = \frac{r_i \times c_j}{n}$
We may consequently calculate
$\chi^2_{ij} = \frac{\left(m_{ij} - e_{ij}\right)^2}{e_{ij}}$
for each element, the sum of which is chi-squared distributed as
\begin{align*} \chi^2 = \sum_{i=0}^{n_r-1} \sum_{j=0}^{n_c-1} \chi^2_{ij} \sim Chi^2\left(\left(n_r-1\right) \times \left(n_c-1\right)\right) &= Chi^2\left(n_r \times n_c - n_r - n_c + 1\right)\\ &= Chi^2\left(n_r \times n_c - \left(n_r + n_c - 1\right)\right) \end{align*}
meaning that we have
\begin{align*} k &= n_r \times n_c\\ p &= n_r + n_c - 1 \end{align*}
Related to this is a test for whether two samples are independent, which we can calculate by constructing the contingency matrix
$m_{ij} = \sum_{k=0}^{n_k-1} I\left(x_k=i \wedge y_k=j\right)$
where $$I$$ is an indicator function, which takes a value of one if its argument is true and a value of zero if not, and $$\wedge$$ means and.

The Implementation

Listing 1 gives the implementation of the chi-squared statistic which expects the observations and expectations to be passed as arrays of non-negative finite numbers.

Listing 1: The Chi-Squared Statistic
ak.chiSquaredStat = function(observed, expected) {
var no = 0;
var ne = 0;
var s = 0;
var n, i, oi, ei;

if(ak.nativeType(observed)!==ak.ARRAY_T) {
throw new Error('invalid observed samples in ak.chiSquaredStat');
}
if(ak.nativeType(expected)!==ak.ARRAY_T) {
throw new Error('invalid expected samples in ak.chiSquaredStat');
}
n = observed.length;

if(n<2) throw new Error('too few classes in ak.chiSquaredStat');
if(expected.length!==n) {
throw new Error('observed/expected size mismatch in ak.chiSquaredStat');
}

for(i=0;i<n;++i) {
oi = observed[i];
ei = expected[i];
if(ak.nativeType(oi)!==ak.NUMBER_T || oi<0 || !isFinite(oi)) {
throw new Error('invalid observed sample in ak.chiSquaredStat');
}
if(ak.nativeType(ei)!==ak.NUMBER_T || ei<0 || !isFinite(ei)) {
throw new Error('invalid expected sample in ak.chiSquaredStat');
}
no += oi;
ne += ei;
}
if(no===0 || !isFinite(no)) {
throw new Error('invalid observed samples in ak.chiSquaredStat');
}
if(ne===0 || !isFinite(ne)) {
throw new Error('invalid expected samples in ak.chiSquaredStat');
}

for(i=0;i<n;++i) {
oi = observed[i];
ei = expected[i];
if(oi!==0) s += ei!==0 ? oi*oi/ei : ak.INFINITY;
}
s *= ne/no;
return Math.max(s-no, 0);
};


After verifying that the arguments are correct we calculate the statistic itself, taking care to handle zeros consistently. Note that the sum of the observations and expectations are not required to be equal since we scale the sum so that the expectations in the denominators have the same total as the observations.
The chi-squared test is implemented in listing 2, along with the non-parametric test which has just the one parameter.

Listing 2: The Chi-Squared Test
ak.chiSquaredTest = function(observed, expected, params) {
var n, s, x;

if(ak.nativeType(observed)!==ak.ARRAY_T) {
throw new Error('invalid observed samples in ak.chiSquaredTest');
}
n = observed.length;
if(ak.nativeType(params)!==ak.NUMBER_T || params!==ak.floor(params)
|| params<1 || params>=n) {
throw new Error('invalid params in sk.chiSquaredTest');
}

s = ak.chiSquaredStat(observed, expected);
x = ak.chiSquaredCDF(n-params);
return 1-x(s);
};

ak.chiSquaredTestNonParamFit = function(observed, expected) {
return ak.chiSquaredTest(observed, expected, 1);
};


Program 1 demonstrates the chi-squared test for two sets of observations, the first of which is consistent with the expectations and the second of which is not.

Program 1: Using ak.chiSquaredTestNonParamFit

Listing 3 implements the chi-squared test for contingency matrices, represented by our ak.matrix type, which is required to contain non-negative, finite elements.

Listing 3: The Contingent Chi-Squared Test
ak.chiSquaredTestContingent = function(cons) {
var nr, nc, r, c, i, observed, expected, sr, sc, crc;

if(ak.type(cons)!==ak.MATRIX_T) {
throw new Error('invalid contingency matrix in ak.chiSquaredTestContingent');
}
nr = cons.rows();
nc = cons.cols();
if(nr<2) throw new Error('too few rows in ak.chiSquaredTestContingent');
if(nc<2) throw new Error('too few columns in ak.chiSquaredTestContingent');

observed = new Array(nr*nc);
expected = new Array(nr*nc);

sr = new Array(nr);
sc = new Array(nc);
for(r=0;r<nr;++r) sr[r] = 0;
for(c=0;c<nc;++c) sc[c] = 0;

i = 0;
for(r=0;r<nr;++r) for(c=0;c<nc;++c) {
crc = cons.at(r, c);
if(!isFinite(crc) || crc<0) {
throw new Error('invalid table entry in sk.chiSquaredTestContingent');
}

sr[r] += crc;
sc[c] += crc;
observed[i++] = crc;
}
i = 0;
for(r=0;r<nr;++r) for(c=0;c<nc;++c) expected[i++] = sr[r]*sc[c];

return ak.chiSquaredTest(observed, expected, nr+nc-1);
};


This calculates the marginal sums at the same time as converting the contingency matrix into an array of observations. Note that when calculating the expectations we don't need to divide through by the sum since ak.chiSquaredTest will properly scale them for us.
Program 2 demonstrates both a successful and an unsuccessful application of the test.

Program 2: Using ak.chiSquaredTestContingent

Finally, listing 4 shows the implementation of the chi-squared test for independent samples, passed as arrays of non-negative finite integers.

Listing 4: The Independent Chi-Squared Test
ak.chiSquaredTestIndependent = function(sample1, sample2) {
var n1 = 0;
var n2 = 0;
var n, i, j, k, x1i, x2i, n12, p1, p2, observed, expected;

if(ak.nativeType(sample1)!==ak.ARRAY_T) {
throw new Error('invalid first samples in ak.chiSquaredTestIndependent');
}
if(ak.nativeType(sample2)!==ak.ARRAY_T) {
throw new Error('invalid second samples in ak.chiSquaredTestIndependent');
}
n = sample1.length;
if(sample2.length!==n) {
throw new Error('sample size mismatch in ak.chiSquaredTestIndependent');
}

for(i=0;i<n;++i) {
x1i = sample1[i];
x2i = sample2[i];

if(ak.nativeType(x1i)!==ak.NUMBER_T || x1i!==ak.floor(x1i)
|| !isFinite(x1i) || x1i<0) {
throw new Error('invalid first sample in sk.chiSquaredTestIndependent');
}
if(ak.nativeType(x2i)!==ak.NUMBER_T || x2i!==ak.floor(x2i)
|| !isFinite(x2i) || x2i<0) {
throw new Error('invalid second sample in sk.chiSquaredTestIndependent');
}

n1 = Math.max(n1, x1i);
n2 = Math.max(n2, x2i);
}
++n1;
++n2;

if(n1<2) {
throw new Error('too few first categories in ak.chiSquaredTestIndependent');
}
if(n2<2) {
throw new Error('too few second categories in ak.chiSquaredTestIndependent');
}
n12 = n1*n2;

p1 = new Array(n1);
p2 = new Array(n2);
for(i=0;i<n1;++i) p1[i] = 0;
for(i=0;i<n2;++i) p2[i] = 0;

observed = new Array(n12);
expected = new Array(n12);
for(i=0;i<n12;++i) observed[i] = 0;

for(i=0;i<n;++i) {
x1i = sample1[i];
x2i = sample2[i];

++p1[x1i];
++p2[x2i];
++observed[x1i*n2+x2i];
}
k = 0;
for(i=0;i<n1;++i) for(j=0;j<n2;++j) expected[k++] = p1[i]*p2[j];

return ak.chiSquaredTest(observed, expected, n1+n2-1);
};


Rather than explicitly creating a contingency matrix, this populates the arrays of its observed and expected values directly.
The independence test is demonstrated by program 3, firstly for highly dependent samples and secondly for less dependent samples.

Program 3: Using ak.chiSquaredTestIndependent

In closing, all of these functions can be found in ChiSquaredTest.js.

References

[1] Chi Chi Again, www.thusspakeak.com, 2022.

[2] Pearson, K., On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling, Philosophical Magazine, Ser. 5, Vol. 50, No. 302, 1900.