Hypothesis testing is a method used to determine whether there is enough evidence in a sample of data to infer that a certain condition holds true for the entire population. It starts with a null hypothesis (H₀), which assumes no effect or difference, and uses a p-value to measure how likely the observed data would occur if H₀ were true. If the p-value is below a chosen threshold (like 0.05), the null hypothesis is rejected, suggesting that the observed effect is statistically significant.
Contents:
- Null Hypothesis Vs Alternative Hypothesis
- What a P-Value is
- How they are used together in statistical hypothesis testing
- When and why they are useful
- Limitations and when not to use them
What Is the Null Hypothesis?
The null hypothesis (often written as H₀) is a starting assumption in statistics. It represents the idea that there is no effect, no difference, or nothing unusual happening in your data.
- It’s the idea you are trying to challenge.
- The opposite of what you are trying to test for
The property you are trying to test for is Alternative Hypothesis (H₁ or Hₐ). Which is what you testing for statistical significance.
For example:
Thesis – Trading signal X has a higher probability of directional prediction than random entry.
- H₀ = Trading signal X has no benefit over a random entry trading signal.
- H₁ = Trading signal X does have benefit over a random entry trading signal.
What Is a P-Value?
The p-value is a number between 0 and 1 that tells you how likely it is to observe your data if the null hypothesis were true.
In simple terms:
The smaller the p-value, the less likely the null hypothesis is to be true.
The p-value represents the probability that H₀ can be observed in the data. The lower the value the more certainty that the alternative hypothesis, H₁ is observed and as a result increases statistical significance of that property being observed.
How Null Hypothesis and P-Values Work Together in Hypothesis Testing
The basis steps of a hypothesis test:
1. State the Hypotheses
- Null Hypothesis (H₀): e.g., μ = 0
- Alternative Hypothesis (H₁): e.g., μ ≠ 0
2. Choose a Significance Level (α):
- This is your threshold for rejecting H₀.
- Common choices: α = 0.05 (5%), 0.01 (1%)
3. Collect Data and Run a Test:
- Use statistical tests: t-test, chi-square test, ANOVA, etc.
- These give you a p-value
4. Compare p-value to α:
- If p ≤ α → Reject H₀ (evidence against the null)
- If p > α → Fail to reject H₀ (not enough evidence)
Example:
Let’s say you’re testing whether a new teaching method improves scores.
- H₀: No improvement in test scores (mean = 70)
- H₁: Improvement (mean > 70)
- α = 0.05
- You get a p-value of 0.01
Since 0.01 < 0.05 → Reject H₀ → The new method likely works.
The p-value represents the error threshold or percentage that you might have a false positive and still be wrong, the probability that you’re happy with living with as a risk that null hypothesis may actually be true.
The Significance Level (α) that you set determines how accurate or strict you need to be, 1%, and 5% are the most common, while 10% is used it often is considered less reliable as it allows a 10% chance that you have a false positive and in fact H₁ has no observable statistical significance and we actually Fail to Reject H₀
When to Use P-Value Testing
- You want to test a specific claim or effect in a data sample.
- You have enough data to make probabilistic conclusions.
- You want to make objective, data-driven decisions.
- You are comparing two or more groups, and want to know if differences are real or due to chance.
Limitations & When Not to Use It
1. Misinterpretation of the P-Value
- P-value ≠ probability the null hypothesis is true.
- It’s about the probability of your data given H₀ is true, not the other way around.
2. Statistical Significance ≠ Practical Significance
- A result may be “statistically significant” but have tiny or meaningless real-world impact.
- E.g., a drug improves recovery by 0.1% with p=0.01. Is it worth it?
3. P-Hacking / Multiple Testing
- If you run lots of tests, some will come up significant just by chance.
- This inflates false positives unless you correct for it (e.g., Bonferroni correction).
4. Overreliance on Arbitrary Thresholds
- People treat 0.05 like a sacred cut off, but it’s just a convention.
- P=0.049 → “Significant!” vs. P=0.051 → “Not significant!” is a silly binary.
5. Small Sample Sizes
- With too little data, you might miss real effects (false negatives).
- With too much data, even tiny effects become “significant”.
How Effective Is It?
Strengths
- Well-established and widely used in scientific and business research.
- Offers a clear framework for testing assumptions.
- Works well when the assumptions (normality, independence, etc.) hold.
Weaknesses
- Easy to misuse or misinterpret.
- Can be gamed (e.g., by cherry-picking results).
- Doesn’t provide the size or direction of an effect (you need confidence intervals or effect sizes for that).