Significance tests harm progress in forecasting

https://doi.org/10.1016/j.ijforecast.2007.03.004Get rights and content

Abstract

I briefly summarize prior research showing that tests of statistical significance are improperly used even in leading scholarly journals. Attempts to educate researchers to avoid pitfalls have had little success. Even when done properly, however, statistical significance tests are of no value. Other researchers have discussed reasons for these failures. I was unable to find empirical evidence to support the use of significance tests under any conditions. I then show that tests of statistical significance are harmful to the development of scientific knowledge because they distract the researcher from the use of proper methods. I illustrate the dangers of significance tests by examining a re-analysis of the M3-Competition. Although the authors of the re-analysis conducted a proper series of statistical tests, they suggested that the original M3-Competition was not justified in concluding that combined forecasts reduce errors, and that the selection of the best method is dependent on the selection of a proper error measure. I show that the original conclusions were correct. Authors should avoid tests of statistical significance; instead, they should report on effect sizes, confidence intervals, replications/extensions, and meta-analyses. Practitioners should ignore significance tests and journals should discourage them.

Section snippets

Problems with statistical significance

Researchers have been arguing since 1931 that tests of statistical significance are confusing and misleading. The criticisms have been more common in recent years. They find that

  • researchers publish faulty interpretations of statistical significance in leading economics journals (McCloskey & Ziliak, 1996), in psychology (Cohen, 1988), and in management sciences (Hubbard & Bayarri, 2003).

  • journal reviewers misinterpret statistical significance (e.g., see the experiment by Atkinson, Furlong, &

Statistical significance is harmful for communicating scientific advances

Schmidt and Hunter (1997, p. 38) concluded, “ … reliance on significance testing is logically indefensible and retards the research enterprise by making it difficult to develop cumulative knowledge.” For example, numerous studies had failed to find statistically significant differences in response rates due to the length of mail surveys, so that many researchers concluded that length made no difference. However, meta-analyses that used only effect sizes showed that length did make a difference,

Significance testing of the M3-Competition

The dangers involved in testing statistical significance apply even when the testing is properly done and fully reported. I use a recent application of statistical significance testing in forecasting to illustrate these dangers.

Koning, Franses, Hibon, and Stekler (2005), referred to hereafter as KFHS, used tests of statistical significance to support their claim that such tests should be used in forecasting. They examined four conclusions from the M3-Competition (Makridakis & Hibon, 2000) that

Discussion

My conclusions relate to the use of statistical significance in the development of knowledge about forecasting. This does not rule out the possibility that statistical significance might help in other areas such as (1) in aiding decision makers by flagging areas that need attention (e.g., quality control charts); (2) as part of a forecasting procedure (e.g., helping to decide whether to apply a seasonality adjustment or when to damp trends); or (3) serving as a guide to a scientist who is

Conclusions

Schmidt and Hunter (1997) refuted objections to the conclusion that significance testing should be avoided. They made requests for empirical evidence that would support the use of tests of statistical significance in any situations, but were unable to obtain any such evidence. I renewed this call for evidence on email lists, websites, and talks, and I have also been unable to find any empirical evidence. On the other hand, there were many people who provided their opinions that statistical

Acknowledgments

Many people provided useful suggestions. In thanking them, I do not mean to imply that they agree with my positions. They include Kay A. Armstrong, Eric Bradlow, Chris Chatfield, Fred Collopy, Robert Fildes, Kesten Green, Ray Hubbard, Randall Jones, Magne Jørgensen, Mark Little, Keith Ord, Frank L. Schmidt, William Starbuck, Malcom Wright, and Tom Yokum. I also solicited suggestions from the four authors of the Koning et al. paper and sent copies to some of the researchers cited in this paper.

References (20)

There are more references available in the full text version of this article.

Cited by (99)

  • Forecasting: theory and practice

    2022, International Journal of Forecasting
  • Interventions as experiments: Connecting the dots in forecasting and overcoming pandemics, global warming, corruption, civil rights violations, misogyny, income inequality, and guns

    2020, Journal of Business Research
    Citation Excerpt :

    Also, several statistical domain experts concur with the American Statistical Association committee’s (Wasserstein, 2016) conclusion on the topic of using null hypothesis significance testing (NHST): avoid doing so. As aptly put, “Statistical significance testing is detrimental to advances in science” (Armstrong, 2007a, 2007b). Because of the pervasive use of NHST in medical research, in particular, and business and behavioral research as well, being blunt is necessary: “The progress of economic science has been seriously damaged.

View all citing articles on Scopus
View full text