USGS Logo: Link to U.S. Geological Survey

The Insignificance of Statistical Significance Testing

What is Statistical Hypothesis Testing?


Four basic steps constitute statistical hypothesis testing. First, one develops a null hypothesis about some phenomenon or parameter. This null hypothesis is generally the opposite of the research hypothesis, which is what the investigator truly believes and wants to demonstrate. Research hypotheses may be generated either inductively, from a study of observations already made, or deductively, deriving from theory. Next, data are collected that bear on the issue, typically by an experiment or by sampling. (Null hypotheses often are developed after the data are in hand and have been rummaged through, but that's another topic.) A statistical test of the null hypothesis then is conducted, which generates a P-value. Finally, the question of what that value means relative to the null hypothesis is considered. Several interpretations of P often are made.

Sometimes P is viewed as the probability that the results obtained were due to chance. Small values are taken to indicate that the results were not just a happenstance. A large value of P, say for a test that µ = 0, would suggest that the mean actually recorded was due to chance, and µ could be assumed to be zero (Schmidt and Hunter 1997).

Other times, 1-P is considered the reliability of the result, that is, the probability of getting the same result if the experiment were repeated. Significant differences are often termed "reliable" under this interpretation.

Alternatively, P can be treated as the probability that the null hypothesis is true. This interpretation is the most direct one, as it addresses head-on the question that interests the investigator.

These 3 interpretations are what Carver (1978) termed fantasies about statistical significance. None of them is true, although they are treated as if they were true in some statistical textbooks and applications papers. Small values of P are taken to represent strong evidence that the null hypothesis is false, but workers demonstrated long ago (see references in Berger and Sellke 1987) that such is not the case. In fact, Berger and Sellke (1987) gave an example for which a P-value of 0.05 was attained with a sample of n = 50, but the probability that the null hypothesis was true was 0.52. Further, the disparity between P and Pr[H0 | data], the probability of the null hypothesis given the observed data, increases as samples become larger.

In reality, P is the Pr[observed or more extreme data | H0], the probability of the observed data or data more extreme, given that the null hypothesis is true, the assumed model is correct, and the sampling was done randomly. Let us consider the first two assumptions.

What are More Extreme Data?

Suppose you have a sample consisting of 10 males and three females. For a null hypothesis of a balanced sex ratio, what samples would be more extreme? The answer to that question depends on the sampling plan used to collect the data (i.e., what stopping rule was used). The most obvious answer is based on the assumption that a total of 13 individuals were sampled. In that case, outcomes more extreme than 10 males and 3 females would be 11 males and 2 females, 12 males and 1 female, and 13 males and no females.

However, the investigator might have decided to stop sampling as soon as he encountered 10 males. Were that the situation, the possible outcomes more extreme against the null hypothesis would be 10 males and 2 females, 10 males and 1 female, and 10 males and no females. Conversely, the investigator might have collected data until 3 females were encountered. The number of more extreme outcomes then are infinite: they include 11 males and 3 females, 12 males and 3 females, 13 males and 3 females, etc. Alternatively, the investigator might have collected data until the difference between the numbers of males and females was 7, or until the difference was significant at some level. Each set of more extreme outcomes has its own probability, which, along with the probability of the result actually obtained, constitutes P.

The point is that determining which outcomes of an experiment or survey are more extreme than the observed one, so a P-value can be calculated, requires knowledge of the intentions of the investigator (Berger and Berry 1988). Hence, P, the outcome of a statistical hypothesis test, depends on results that were not obtained, that is, something that did not happen, and what the intentions of the investigator were.

Are Null Hypotheses Really True?

P is calculated under the assumption that the null hypothesis is true. Most null hypotheses tested, however, state that some parameter equals zero, or that some set of parameters are all equal. These hypotheses, called point null hypotheses, are almost invariably known to be false before any data are collected (Berkson 1938, Savage 1957, Johnson 1995). If such hypotheses are not rejected, it is usually because the sample size is too small (Nunnally 1960).

To see if the null hypotheses being tested in The Journal of Wildlife Management can validly be considered to be true, I arbitrarily selected two issues: an issue from the 1996 volume, the other from 1998. I scanned the results section of each paper, looking for P-values. For each P-value I found, I looked back to see what hypothesis was being tested. I made a very biased selection of some conclusions reached by rejecting null hypotheses; these include: (1) the occurrence of sheep remains in coyote (Canis latrans) scats differed among seasons (P = 0.03, n = 467), (2) duckling body mass differed among years (P < 0.0001), and (3) the density of large trees was greater in unlogged forest stands than in logged stands (P = 0.02). (The last is my personal favorite.) Certainly we knew before any data were collected that the null hypotheses being tested were false. Sheep remains certainly must have varied among seasons, if only between 61.1% in 1 season and 61.2% in another. The only question was whether or not the sample size was sufficient to detect the difference. Likewise, we know before data are collected that there are real differences in the other examples, which are what Abelson (1997) referred to as "gratuitous" significance testing—testing what is already known.

Three comments in favor of the point null hypothesis, such as µ = µ0. First, while such hypotheses are virtually always false for sampling studies, they may be reasonable for experimental studies in which subjects are randomly assigned to treatment groups (Mulaik et al. 1997). Second, testing a point null hypothesis in fact does provide a reasonable approximation to a more appropriate question: is µ nearly equal to µ0 (Berger and Delampady 1987, Berger and Sellke 1987), if the sample size is modest (Rindskopf 1997). Large sample sizes will result in small P-values even if µ is nearly equal to µ0. Third, testing the point null hypothesis is mathematically much easier than testing composite null hypotheses, which involve noncentrality parameters (Steiger and Fouladi 1997).

The bottom line on P-values is that they relate to data that were not observed under a model that is known to be false. How meaningful can they be? But they are objective, at least; or are they?

P is Arbitrary

If the null hypothesis truly is false (as most of those tested really are), then P can be made as small as one wishes, by getting a large enough sample. P is a function of (1) the difference between reality and the null hypothesis and (2) the sample size. Suppose, for example, that you are testing to see if the mean of a population (µ) is, say, 100. The null hypothesis then is H0: µ = 100, versus the alternative hypothesis of H1: µ 100. One might use Student's t-test, which is

GIF - Students t-test

where is the mean of the sample, S is the standard deviation of the sample, and n is the sample size. Clearly, t can be made arbitrarily large (and the P-value associated with it arbitrarily small) by making either ( – 100) or large enough. As the sample size increases, ( – 100) and S will approximately stabilize at the true parameter values. Hence, a large value of n translates into a large value of t. This strong dependence of P on the sample size led Good (1982) to suggest that P-values be standardized to a sample size of 100, by replacing P by P (or 0.5, if that is smaller).

Even more arbitrary in a sense than P is the use of a standard cutoff value, usually denoted . P-values less than or equal to are deemed significant; those greater than are nonsignificant. Use of was advocated by Jerzy Neyman and Egon Pearson, whereas R. A. Fisher recommended presentation of observed P-values instead (Huberty 1993). Use of a fixed level, say = 0.05, promotes the seemingly nonsensical distinction between a significant finding if P = 0.049, and a nonsignificant finding if P = 0.051. Such minor differences are illusory anyway, as they derive from tests whose assumptions often are only approximately met (Preece 1990). Fisher objected to the Neyman-Pearson procedure because of its mechanical, automated nature (Mulaik et al. 1997).

Proving the Null Hypothesis

Discourses on hypothesis testing emphasize that null hypotheses cannot be proved; they can only be disproved (rejected). Failing to reject a null hypothesis does not mean that it is true. Especially with small samples, one must be careful not to accept the null hypothesis. Consider a test of the null hypothesis that a mean µ equals µ0. The situations illustrated in Figure 1 both reflect a failure to reject that hypothesis. Figure 1A suggests the null hypothesis may well be false, but the sample was too small to indicate significance; there is a lack of power. Conversely, Figure 1B shows that the data truly were consistent with the null hypothesis. The two situations should lead to different conclusions about µ, but the P-values associated with the tests are identical.

Taking another look at the two issues of The Journal of Wildlife Management, I noted a number of articles that indicated a null hypothesis was proven. Among these were (1) no difference in slope aspect of random snags (P = 0.112, n = 57), (2) no difference in viable seeds (F2,6 = 3.18, P = 0.11), (3) lamb kill was not correlated to trapper hours (r12 = 0.50, P = 0.095), (4) no effect due to month (P = 0.07, n = 15), and (5) no significant differences in survival distributions (P-values > 0.014!, n variable). I selected the examples to illustrate null hypotheses claimed to be true, despite small sample sizes and P-values that were small but (usually) >0.05. All examples, I believe, reflect the lack of power (Fig. 1A) while claiming a lack of effect (Fig. 1B).

Fig 1. Results of a test that failed to reject the null hypothesis that a mean equals 0. Shaded areas indicate regions for which hypothesis would be rejected. (A) suggests the null hypothesis may well be false, but the sample was too small to indicate significance; there is a lack of power. (B) suggests the data truly were consistent with the null hypothesis

Power Analysis

Power analysis is an adjunct to hypothesis testing that has become increasingly popular (Peterman 1990, Thomas and Krebs 1997). The procedure can be used to estimate the sample size needed to have a specified probability (power = 1 - ) of declaring as significant (at the level) a particular difference or effect (effect size). As such, the process can usefully be used to design a survey or experiment (Gerard et al. 1998). Its use is sometimes recommended to ascertain the power of the test after a study has been conducted and nonsignificant results obtained (The Wildlife Society 1995). The notion is to guard against wrongly declaring the null hypothesis to be true. Such retrospective power analysis can be misleading, however. Steidl et al. (1997:274) noted that power estimated with the data used to test the null hypothesis and the observed effect size is meaningless, as a high P-value will invariably result in low estimated power. Retrospective power estimates may be meaningful if they are computed with effect sizes different from the observed effect size. Power analysis programs, however, assume the input values for effect and variance are known, rather than estimated, so they give misleadingly high estimates of power (Steidl et al. 1997, Gerard et al. 1998). In addition, although statistical hypothesis testing invokes what I believe to be 1 rather arbitrary parameter ( or P), power analysis requires three of them ( , , effect size). For further comments see Shaver (1993:309), who termed power analysis "a vacuous intellectual game," and who noted that the tendency to use criteria, such as Cohen's (1988) standards for small, medium, and large effect sizes, is as mindless as the practice of using the = 0.05 criterion in statistical significance testing. Questions about the likely size of true effects can be better addressed with confidence intervals than with retrospective power analyses (e.g., Steidl et al. 1997, Steiger and Fouladi 1997).

Biological Versus Statistical Significance

Many authors make note of the distinction between statistical significance and subject-matter (in our case, biological) significance. Unimportant differences or effects that do not attain significance are okay, and important differences that do show up significant are excellent, for they facilitate publication (Table 1). Unimportant differences that turn out significant are annoying, and important differences that fail statistical detection are truly depressing. Recalling our earlier comments about the effect of sample size on P-values, the two outcomes that please the researcher suggest the sample size was about right (Table 2). The annoying unimportant differences that were significant indicate that too large a sample was obtained. Further, if an important difference was not significant, the investigator concludes that the sample was insufficient and calls for further research. This schizophrenic nature of the interpretation of significance greatly reduces its value.

Table 1. Reaction of investigator to results of a statistical significance test (after Nester 1996).
  Statistical significance
Practical importance of observed difference Not significant Significant
Not important Happy Annoyed
Important Very sad Elated

Table 2. Interpretation of sample size as related to results of a statistical significance test.
  Statistical significance
Practical importance of observed difference Not significant Significant
Not important n okay n too big
Important n too small n okay

Other Comments on Hypothesis Tests

Statistical hypothesis testing has received an enormous amount of criticism, and for a rather long time. In 1963, Clark (1963:466) noted that it was "no longer a sound or fruitful basis for statistical investigation." Bakan (1966:436) called it "essential mindlessness in the conduct of research." The famed quality guru W. Edwards Deming (1975) commented that the reason students have problems understanding hypothesis tests is that they may be trying to think. Carver (1978) recommended that statistical significance testing should be eliminated; it is not only useless, it is also harmful because it is interpreted to mean something else. Guttman (1985) recognized that "In practice, of course, tests of significance are not taken seriously." Loftus (1991) found it difficult to imagine a less insightful way to translate data into conclusions. Cohen (1994:997) noted that statistical testing of the null hypothesis "does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!" Barnard (1998:47) argued that "... simple P-values are not now used by the best statisticians." These examples are but a fraction of the comments made by statisticians and users of statistics about the role of statistical hypothesis testing. While many of the arguments against significance tests stem from their misuse, rather than their intrinsic values (Mulaik et al. 1997), I believe that 1 of their intrinsic problems is that they do encourage misuse.
Previous Section -- Introduction
Return to Contents
Next Section -- Why are Hypothesis Tests Used