Significance tests harm progress in forecasting
Section snippets
Problems with statistical significance
Researchers have been arguing since 1931 that tests of statistical significance are confusing and misleading. The criticisms have been more common in recent years. They find that
- •
researchers publish faulty interpretations of statistical significance in leading economics journals (McCloskey & Ziliak, 1996), in psychology (Cohen, 1988), and in management sciences (Hubbard & Bayarri, 2003).
- •
journal reviewers misinterpret statistical significance (e.g., see the experiment by Atkinson, Furlong, &
Statistical significance is harmful for communicating scientific advances
Schmidt and Hunter (1997, p. 38) concluded, “ … reliance on significance testing is logically indefensible and retards the research enterprise by making it difficult to develop cumulative knowledge.” For example, numerous studies had failed to find statistically significant differences in response rates due to the length of mail surveys, so that many researchers concluded that length made no difference. However, meta-analyses that used only effect sizes showed that length did make a difference,
Significance testing of the M3-Competition
The dangers involved in testing statistical significance apply even when the testing is properly done and fully reported. I use a recent application of statistical significance testing in forecasting to illustrate these dangers.
Koning, Franses, Hibon, and Stekler (2005), referred to hereafter as KFHS, used tests of statistical significance to support their claim that such tests should be used in forecasting. They examined four conclusions from the M3-Competition (Makridakis & Hibon, 2000) that
Discussion
My conclusions relate to the use of statistical significance in the development of knowledge about forecasting. This does not rule out the possibility that statistical significance might help in other areas such as (1) in aiding decision makers by flagging areas that need attention (e.g., quality control charts); (2) as part of a forecasting procedure (e.g., helping to decide whether to apply a seasonality adjustment or when to damp trends); or (3) serving as a guide to a scientist who is
Conclusions
Schmidt and Hunter (1997) refuted objections to the conclusion that significance testing should be avoided. They made requests for empirical evidence that would support the use of tests of statistical significance in any situations, but were unable to obtain any such evidence. I renewed this call for evidence on email lists, websites, and talks, and I have also been unable to find any empirical evidence. On the other hand, there were many people who provided their opinions that statistical
Acknowledgments
Many people provided useful suggestions. In thanking them, I do not mean to imply that they agree with my positions. They include Kay A. Armstrong, Eric Bradlow, Chris Chatfield, Fred Collopy, Robert Fildes, Kesten Green, Ray Hubbard, Randall Jones, Magne Jørgensen, Mark Little, Keith Ord, Frank L. Schmidt, William Starbuck, Malcom Wright, and Tom Yokum. I also solicited suggestions from the four authors of the Koning et al. paper and sent copies to some of the researchers cited in this paper.
References (20)
Findings from evidence-based forecasting: Methods for reducing forecast error
International Journal of Forecasting
(2006)- et al.
Error measures for generalizing about forecasting methods: Empirical comparisons
International Journal of Forecasting
(1992) - et al.
Debiasing forecasts: How useful is the unbiasedness test?
International Journal of Forecasting
(2003) - et al.
The M3-Competition: Statistical tests of the results
International Journal of Forecasting
(2005) - et al.
The M3-Competition: Results, conclusions and implications
International Journal of Forecasting
(2000) - et al.
Econometric forecasting
Principles of forecasting: A handbook for researchers and practitioners
(2001)Combining forecasts
- et al.
Statistical significance, reviewer evaluations, and the scientific process: Is there a statistically significant relationship?
Journal of Counseling Psychology
(1982) The earth is round (p < .05)
American Psychologist
(1988)
Cited by (99)
Forecasting: theory and practice
2022, International Journal of ForecastingRelative performance of judgmental methods for forecasting the success of megaprojects
2022, International Journal of ForecastingDepression in pregnancy “strongly predicts” depression postpartum: Are we inadvertently misleading clinicians and researchers?
2021, Journal of Affective DisordersInterventions as experiments: Connecting the dots in forecasting and overcoming pandemics, global warming, corruption, civil rights violations, misogyny, income inequality, and guns
2020, Journal of Business ResearchCitation Excerpt :Also, several statistical domain experts concur with the American Statistical Association committee’s (Wasserstein, 2016) conclusion on the topic of using null hypothesis significance testing (NHST): avoid doing so. As aptly put, “Statistical significance testing is detrimental to advances in science” (Armstrong, 2007a, 2007b). Because of the pervasive use of NHST in medical research, in particular, and business and behavioral research as well, being blunt is necessary: “The progress of economic science has been seriously damaged.
The art of researching: The scientific importance of null and negative results in psychology
2018, Pratiques PsychologiquesImproving time series forecasting: An approach combining bootstrap aggregation, clusters and exponential smoothing
2018, International Journal of Forecasting