Skip to content

Method | Chi-square Test

Posted on:May 8, 2024

Table of contents

What is the chi-square test?

The chi-square test(卡方检验), also known as the test of goodness of fit(拟合优度检验), is used to determine whether a set of observed data conforms to a particular theoretical distribution.

More specifically, it can test whether a categorical variable follows a particular distribution (distribution test) or whether two categorical variables are independent (independence test).

Since the statistic used in the tests follow a chi-square distribution when the null hypothesis is true, it is called a chi-square test.

The variables involved in chi-square tests are typically categorical. However, when dealing with continuous variables, such as testing the “H0: data comes from a normal distribution,” the data can be discretized first and processed similarly to categorical variables.

One variable - Distribution test

Use the sample {X1,X2,,Xn}\{X_1, X_2, \cdots, X_n\} to test the hypothesis:

H0:P(X=ai)=pi, i=1,,kH_0: P(X=a_i) = p_i,\ i=1, \cdots,k

Test statistic:

Z=i=1k(OiEi)2Ei=i=1k(ninpi)2npi\begin{align*} Z &= \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \\&= \sum_{i=1}^k\frac{(n_i-np_i)^2}{np_i} \end{align*}

OiO_i represents the observed frequency of category i, while EiE_i represents the expected frequency of category i under the H0. Hence, the larger the statistic Z, the easier it is to reject H0.

Furthermore, when H0 is true and the sample size n is large enough, Z follows a chi-square distribution with k-1 degrees of freedom. Based on that, we can calculate the p-value or threshold and make a conclusion accordingly.

Two variables - Independence test

There are two categorical variables. Their ranges are:

X=1,2,,aY=1,2,,bX = 1,2, \cdots, a\\ Y = 1,2, \cdots, b

Our sample is shown in the following contingency table(列联表):

In this scenario, the null hypothesis is (i.e. no relationship):

H0:XYH_0: X \perp Y

or equivalently represented as:

H0:P(X=i,Y=j)=P(X=i)P(Y=j),i,jH_0: P(X=i, Y=j) = P(X=i)P(Y=j),\quad \forall i,j

First, estimate the marginal distributions:

P(X=i)u^i=ni/nP(Y=j)v^j=nj/nP(X=i) \approx \hat{u}_i = n_{i*}/n\\ P(Y=j) \approx \hat{v}_j = n_{*j}/n

Test statistic:

Z=i=1k(OiEi)2Ei=i=1k(nijnu^iv^j)2nu^iv^j\begin{align*} Z &= \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \\&= \sum_{i=1}^k\frac{(n_{ij}-n \hat{u}_i \hat{v}_j)^2}{n \hat{u}_i \hat{v}_j} \end{align*}

When H0 is true and the sample size n is large enough, Z follows a chi-square distribution with (a-1)(b-1) degrees of freedom. Based on that, we can calculate the p-value or threshold and make a conclusion accordingly.

Chi-squared test for trend

In both of the cases mentioned above, we haven’t specified whether the categorical variables involved are nominal(名义变量) or ordinal(定序变量). When dealing with ordinal variables, using the chi-square test in aforementioned ways (also referred to as “Pearson’s chi-squared tests”) fails to leverage the information provided by their ordering:

In spite of its usefulness, there are conditions under which the use of Pearson’s chi-square, although appropriate, is not the optimum procedure. Such a situation occurs when the categories forming a table have a natural ordering.

There are several other tests that take into account the ordering information when dealing with ordinal variables, such as the Cochran–Armitage test for trend and the Mantel-Haenszel linear-by-linear association test.

References