The Null Hypothesis

Hypothesis testing is a method used to determine whether there is enough evidence in a sample of data to infer that a certain condition holds true for the entire population. It starts with a null hypothesis (H₀), which assumes no effect or difference, and uses a p-value to measure how likely the observed data would occur if H₀ were true. If the p-value is below a chosen threshold (like 0.05), the null hypothesis is rejected, suggesting that the observed effect is statistically significant.

What Is the Null Hypothesis?

The null hypothesis (often written as H₀) is a starting assumption in statistics. It represents the idea that there is no effect, no difference, or nothing unusual happening in your data.

It’s the idea you are trying to challenge.
The opposite of what you are trying to test for

The property you are trying to test for is Alternative Hypothesis (H₁ or Hₐ). Which is what you testing for statistical significance.

For example:

Thesis – Trading signal X has a higher probability of directional prediction than random entry.

H₀ = Trading signal X has no benefit over a random entry trading signal.
H₁ = Trading signal X does have benefit over a random entry trading signal.

What Is a P-Value?

The p-value is a number between 0 and 1 that tells you how likely it is to observe your data if the null hypothesis were true.

In simple terms:

The smaller the p-value, the less likely the null hypothesis is to be true.

The p-value represents the probability that H₀ can be observed in the data. The lower the value the more certainty that the alternative hypothesis, H₁ is observed and as a result increases statistical significance of that property being observed.

How Null Hypothesis and P-Values Work Together in Hypothesis Testing

The basis steps of a hypothesis test:

1. State the Hypotheses

Null Hypothesis (H₀): e.g., μ = 0
Alternative Hypothesis (H₁): e.g., μ ≠ 0

2. Choose a Significance Level (α):

This is your threshold for rejecting H₀.
Common choices: α = 0.05 (5%), 0.01 (1%)

3. Collect Data and Run a Test:

Use statistical tests: t-test, chi-square test, ANOVA, etc.
These give you a p-value

4. Compare p-value to α:

If p ≤ α → Reject H₀ (evidence against the null)
If p > α → Fail to reject H₀ (not enough evidence)

Example:

Let’s say you’re testing whether a new teaching method improves scores.

H₀: No improvement in test scores (mean = 70)
H₁: Improvement (mean > 70)
α = 0.05
You get a p-value of 0.01

Since 0.01 < 0.05 → Reject H₀ → The new method likely works.

The p-value represents the error threshold or percentage that you might have a false positive and still be wrong, the probability that you’re happy with living with as a risk that null hypothesis may actually be true.

The Significance Level (α) that you set determines how accurate or strict you need to be, 1%, and 5% are the most common, while 10% is used it often is considered less reliable as it allows a 10% chance that you have a false positive and in fact H₁ has no observable statistical significance and we actually Fail to Reject H₀

When to Use P-Value Testing

You want to test a specific claim or effect in a data sample.
You have enough data to make probabilistic conclusions.
You want to make objective, data-driven decisions.
You are comparing two or more groups, and want to know if differences are real or due to chance.

Limitations & When Not to Use It

1. Misinterpretation of the P-Value

P-value ≠ probability the null hypothesis is true.
It’s about the probability of your data given H₀ is true, not the other way around.

2. Statistical Significance ≠ Practical Significance

A result may be “statistically significant” but have tiny or meaningless real-world impact.
E.g., a drug improves recovery by 0.1% with p=0.01. Is it worth it?

3. P-Hacking / Multiple Testing

If you run lots of tests, some will come up significant just by chance.
This inflates false positives unless you correct for it (e.g., Bonferroni correction).

4. Overreliance on Arbitrary Thresholds

People treat 0.05 like a sacred cut off, but it’s just a convention.
P=0.049 → “Significant!” vs. P=0.051 → “Not significant!” is a silly binary.

5. Small Sample Sizes

With too little data, you might miss real effects (false negatives).
With too much data, even tiny effects become “significant”.

How Effective Is It?

Strengths

Well-established and widely used in scientific and business research.
Offers a clear framework for testing assumptions.
Works well when the assumptions (normality, independence, etc.) hold.

Weaknesses

Easy to misuse or misinterpret.
Can be gamed (e.g., by cherry-picking results).
Doesn’t provide the size or direction of an effect (you need confidence intervals or effect sizes for that).

Contents: