Home T-test: Motivation, Definition, and Derivation
Post
Cancel

T-test: Motivation, Definition, and Derivation

Contents

  1. Contextual Problem.
  2. Defining s2, a.k.a the sample variance (σ2).
  3. Motivating the T test.
  4. Deriving the T statistic.

1. Contextual Problem

A key objective of analysis of large systems across many domains is to see if the quantifiable metrics of the large system do indeed adhere to one’s expectations. For example, you are running a flour packaging factory and want to see if the mass of each pack of flour indeed has some expected value. Or perhaps you are given a sample of flour pack masses that somebody has measured and want to see if that sample of flour packs indeed came from your factory. Whatever it is, you would typically sample a bunch of observations (measure a bunch of flour packs in this case), compute sample statistics (sample mean X and variance 1n(XiXn)2), and try to make comments about how different these look from the expected (“assumed”) distribution statistics (population mean μ and variance σ2). The tests that are typically done with these statistics are the Z test and the T test. The former is significantly easier to understand, but the latter really takes after the former, so we’ll start with the Z test and use our intuition of that to reverse-engineer the T test, as it was actually done during its invention.

2. Defining s2, the unbiased estimator of σ2

Let’s define some terms. Given a sample of n observations X1,,Xn, this is the sample mean:

Xn=1ni=1nXi

Which is markedly different from the true population mean μ. “Population” refers to the universe of all instances of the things you’re trying to measure. This is the sample variance:

1ni=1n(XiXn)2

Notice that I did not assign it any symbol. This is because we’ll be working with the sample (unbiased) estimator of population variance (s2) instead:

s2=1n1i=1n(XiXn)2

This is a scaled version of the actual sample variance. The overarching reason we have to scale it up by a factor of nn1 is that the sample variance tends to underestimate the population variance. There are many ways one could intuitively explain it, including the vague one of: the sample has n data points, but we’re subtracting the mean away from every point, so effectively, it only has n1 degrees of freedom. This is totally non-obvious to me, so here’s a short algebraic proof that s2 is an unbiased estimator for σ2 (meaning that E[s2]=σ2):

Proof that E[s2]=σ2:

s2=1n1i=1n(XiXn)2=1n1i=1n(Xi22XiXn+Xn2)=1n1((i=1nXi2)2nXn2+nXn2)=1n1((i=1nXi2)nXn2)

Taking the expectation of this whole thing:

E[s2]=1n1(nE[Xi2]nE[Xn2])=1n1[n(σ2+μ2)nE[Xn2]] because σ2=E[X2](E[X])2

Now how do we parse the second term, E[Xn2]? We reason in terms of the distribution of Xn:

Var(Xn)=E[Xn2](E[Xn])2so,  Xn=1n(Xi++Xn)(μ,σ2n) E[Xn2]=σ2n+μ2

So, we continue simplying E[s2]:

 E[s2]=1n1[n(σ2+μ2)n(σ2n+μ2)]=1n1[(n1)σ2]=σ2 (Q.E.D.)

3. Motivating the T test

So that was s2, which right now still seems pretty irrelevant, but since the T test uses s, it helps to understand why we use s (it is unbiased). On to the T test, by way of the Z test!

Z test

Here’s how the Z test goes. Suppose you have a random variable X that follows some Gaussian distribution: XN(μ,σ2), and you make an observation: x. You want to see how far off x is from the expected mean. The idea is that if x is several standard deviations (σ) away from the assumed mean μ, then perhaps, there is reason to believe that X doesn’t actually have a mean of μ. In formal language, we say this: XN(μ,σ) under the null hypothesis H0. If an observation x deviates from the null hypothesis mean μ by more than a certain amount (of our choosing), then we reject H0 in favor of an alternative hypothesis HA (or H1, depending on literature).

The deviation from the mean is a measure of how “anomalous” (under H0) this data point looks. To compute this, we simply take the percentile of x. If x has a super high percentile (maybe, above 0.975? Again, this amount is of our choosing) or a super low percentile (maybe 0.025), then it would appear anomalous. There is no analytical percentile function of the Gaussian distribution, but there is a Z table for the Standard Gaussian (N(0,1)), so we’ll first try to transform x such that we can say that it came from a Standard Gaussian:

XN(μ,σ2)Xnormalized=XμσN(0,1)

So, instead of looking at x, we look at xμσ and say that it was drawn from the Standard Gaussian. Because we can say this, we can easily use the Z table to figure out what percentile it is. If it lies on the tails, then we can reject it with some “level of significance”.

alpha = 0.05

This “level of significance” is also known as α, and is what affects how far away from the mean an observation has to be in order to constitute an anomaly. In a two-tailed test (an observation would be anomalous if it’s too far from the mean in BOTH directions) like this, the cut-off percentiles are simply α2 and 1α2, which corresponds to the shaded blue regions above. What does it mean if we reject H0 because we made an observation that lies in this region? It means that we’re correctly rejecting the null with probability 1α. There is always a probability of α that our null was true, and it was just due to sheer probability that our observation was that far from the mean, which, by definition, happens α of the time.

The Z test can be done for any statistic, not just observations of data. As another example, suppose that we were trying to do linear regression of number of bathrooms (X1Rn) and square footage (X2Rn) to house prices (yRn) over a sample of size n. If we define our design matrix as one usually does in linear regression:

[|||1X1X2|||]

Then our predicted covariate coefficients are simply: β^=(XX)1Xy. A question one might ask is: how similar is our sample’s β^ to some presumed “null” β? I.E., if I had a strong reason to believe that houses in San Francisco related number of bathrooms and square footage to price via some β, do I have strong reason to believe that my sample came from the same distribution or otherwise? Then, notice that:

β^N(β,σ2(XX)1)

So the normalization of this would still be conceptually the same (subtract expectation, then divide by standard deviation):

β^normalized=β^βσ(XX)1/2N(0,I)

4. Deriving the T statistic

The T test, conceptually, has the exact same goal as the Z test, the only difference is that we use the T test only when we don’t know the population variance σ2, which is a lot of the time. Very rarely do situations in real life allow us to know or even assume what the global variance σ2 is. If we don’t have the variance, we have to at least estimate it somehow; if we don’t have ANY idea of how spread-out the distribution is, we can make no meaningful assessment of anomaly.

To estimate the population variance, we need multiple samples, not just a single data point. This is why a T test is typically done on sample statistics (and not individual samples), such as sample mean Xn, or even β^ as demonstrated in the example above. Before we deal with the fact we don’t know σ2, let’s assume we know σ2 and spell out clearly what it means to perform a test akin to Z tests on a sample statistic.

Suppose we have a sample of n data points X1,,Xn, and we want to compare the sample mean Xn to a null (“presumed”) mean μ. I.E., the null hypothesis assumes that:

XN(μ,σ2)

This means that the sample mean has the following distribution:

XnN(μ,σ2n)

In the world of Z tests, we wanted to normalize our distribution to a Standard Normal Gaussian because we have the Z table for that, which makes it easy to get our percentile values (“p-value”). That is also true for the t distribution; we have no analytical percentile function for it, but we do have numerical estimates of percentiles for a “standard” (actually, many, depending on degrees of freedom of the t distribution, but more on that later) t distribution. So, let’s normalize to try and get to as close to a Standard Gaussian as possible:

XnμN(0,σ2n)n(Xnμ)N(0,σ2)n(Xnμ)σN(0,1)

If we knew σ2, that would be it. That’s exactly a Standard Gaussian, and we can just use the Z-test - we’re done. But, we don’t know σ2, we only know s, which, in expectation, should be σ (explicitly, E[s2]=σE[s]=σ), so perhaps we can just put s in the denominator instead of σ and call it a day. This would be our T statistic that corresponds to this particular sample, but this is only halfway there, because unlike σ, s is itself an observation of a random variable (sometimes, our sample has large variance (hence making s large), sometimes, it has small variance (hence making s small), but in expectation, the sample variance should be close to σ2). To get our percentile, we need to know what the t distribution is in order to see what percentile our T statistic stands at. Now we have:

T statistic

n(Xnμ)s

t distribution

n(Xnμ)S?

What is this? It’s a Standard Gaussian distribution over another random distribution. There are multiple ways to figure out what this is; you may write the whole thing out analytically and mash out the algebra, but the conventional way of doing it is to reason in terms of a χ2 (“Chi-Squared”) distribution. Notice that the numerator above has the following distribution:

n(Xnμ)σN(0,1)

The numerator is the Standard Gaussian scaled by σ. Our goal is to then express the denominator as something “standard”, also scaled by σ, so that the σ’s cancel out and our distribution becomes not a function of σ. Thankfully, there is some way to express S just as that.

Finding the distribution of S:

We defined S using a formula above, but let’s see what S can also be written as:

S2=1n1i=1n(XiXn)2(n1)S2=i=1n(XiXn)2

Since Xi are all i.i.d. and Gaussian, they can be written as a mean-centered multivariate Gaussian vector W:

W=[X1XnXnXn]=[I1n11][X1Xn]σN(0,I)

We note down two important facts:

Firstly:

i=1n(XiXn)2=XiXn22=W22

is the sum of the squares of all the entries in W. If we rotate W about the origin, the sum of squares will not change. W is also a symmetric multivariate Gaussian ball, which is symmetric about the origin. So, we can simply rotate W any way we want without changing the distribution nor the sum of squares.

Secondly:

[I1n11]

is a projection matrix that projects the 1 subspace out of the operand.

If you combine the above 2 facts, we note that there is some way to rotate W such that we can effectively just “nullify” one of the dimensions, making

W22=σ2(the sum of (n1) squares of Standard Gaussian variables)=σ2χn12

Wrapping this all together, we have:

S2=σ2n1χn12

t distribution:

n(Xnμ)SσN(0,1)σn1χn12=N(0,1)χn12n1=tn1

The tn1 distribution, fully described as a “t distribution with n1 degrees of freedom”, is precisely defined as a Standard Gaussian over χn12n1. You can see how expressing things in terms of scalar multiples of “standard” things like N(0,1) and χ2 allows us to cancel the σ’s in the numerator and denominator, effectively giving us a distribution that is not a function of σ2. This is what makes the t-distribution and the T statistic the hypothesis testing method of choice when we’re unable to make any assumptions on σ2.

For visualization, here’s what the t distribution looks like at various n= degrees of freedom; image credits to Shoichi Midorikawa:

alpha = 0.05

It looks very similar to the Gaussian distribution; as you can imagine, your T statistic, will lie somewhere on one of these distributions (depending on degrees of freedom). While in the past, one might have used the t table to figure out percentiles, now, you could easily do it using a programming language like R or Python.

This post is licensed under CC BY 4.0 by the author.