Theory & Science (2003)

ISSN: 1527-5558

Alternatives to Null Hypothesis Significance Testing

Daniel J. Denis, Ph.D.*
Quantitative/Statistical Psychology
University of Montana
daniel.denis@umontana.edu

Abstract

Despite years of criticism, null hypothesis significance testing (NHST) continues to be psychology's most widely employed model of statistical inference. Since Fisher (1925), psychologists have routinely used the model to dismiss sampling error when making substantive inferences, despite the deep methodological and philosophical flaws inherent in the hypothesis-testing procedure. Part of this problem may be due to a perceived lack of available alternatives to traditional null hypothesis significance testing. The present research sought to search for and evaluate existing alternatives to current NHST. The central question asked was whether there exist any feasible and noteworthy alternatives to today's NHST model. This article provides a summary and review of alternative theory-testing models which are evaluated based on their ability to either complement or altogether replace today's NHST. It seeks to spread awareness that choices do exist for the psychologist and social scientist alike when “shopping” for a theory-testing paradigm. It is concluded that through the use of effect sizes, confidence intervals, graphical methods, “good-enough” hypotheses, and in comparing alternative models to account for sample data, there exist a number of useful and very practical alternatives to traditional NHST. Several of these approaches are recommended for use in psychology and abroad. Power analysis is also reviewed, and although deemed a useful complement to NHST, the addition of power-analytic strategies does not “save” the problematic paradigm.

*Correspondence concerning this article should be addressed to Daniel J. Denis, Department of Psychology, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA. Electronic mail may be sent via Internet to dand@yorku.ca.

Introduction

That problems abound with null hypothesis significance testing is no secret. Indeed, years of criticism have made clear the fact that NHST is very problematic. Even the most loyal and steadfast advocate would surely admit that the model is questionable as a means of statistical inference. The problems with NHST are so severe that some have argued for it to be completely banned from scholarly journals (Shrout, 1997). Given this reputation, the issue then becomes that of determining a suitable alternative to NHST as currently used in the social and physical sciences. Although it is fine to point out the difficulties with today's hypothesis testing, it is also necessary to contemplate alternatives that could either complement or replace today's model. This article is written with the view that a model of hypothesis testing is favorable if it constitutes the “best” of available alternatives, even if exhibiting problems if its own. Indeed, even the best model may require amendments. Until a superior model is agreed upon however, the “best alternative” shall be considered ideal and will be promoted for use in scientific research. As with any theory, a theory of statistical inference is favorable until another competing theory proves itself more scientifically or methodologically desirable.

The purpose of this article is to address the following question -- given that NHST is problematic, what other frequentist procedures ¹ or models can be used in its place? Are there any feasible alternatives to null hypothesis significance testing? Many alternatives to NHST have been proposed and will be considered in this article. Each will be evaluated for its ability to either replace or complement today's NHST procedure.

I. Effect Size

Perhaps the most commonly recommended alternative to solely interpreting p values is to determine the magnitude of effect, more commonly known as “effect size.” The use of effect size measures has been recommended repeatedly throughout the social science literature (e.g., see Dunnette, 1966; Huberty, 1987; Rosenthal, 1992; Vaughan & Corballis, 1969). Magnitude of effect measures are indicators of the strength of association between the independent variable(s) and the dependent variable(s). As noted by Huberty (1993), magnitude of association indicators have existed in statistics textbooks since the early 1930s. It wasn't until the 1970s however that the name “effect size” was popularized. Since Hays (1963) discussed the use of omega squared (ω ² ), many articles have been published on the topic (Haase, Waechter, & Solomon, 1982). However, according to Huberty, few textbook authors presently discuss effect size measures. ² This is despite evidence suggesting that effect size reporting, especially to a younger audience, tends to increase researchers' confidence in results (Nelson, Rosenthal, & Rosnow, 1986). Although the tide is currently shifting and more researchers are beginning to report it in results sections (Cohen, 1994), effect size has traditionally been ignored by most social science investigators. For instance, in a survey of articles published in British Journal of Psychology and British Journal of Social and Clinical Psychology between the years 1969 and 1972, none reported or discussed strength of relationship among the variables under study (Cochrane & Duffy, 1974). Furthermore, the survey found η ² (once it was calculated) to be very weak in 56% of the studies, and approximately one quarter of the studies to have relationships accounting for less than 10% of shared variance. ³ Today, investigators appear to be aware that effect size measures yield important information despite the fact that still few are reported in the social science literature. Prior to considering effect size measures as alternatives to hypothesis testing, a brief review of their nature is in order.

What is Effect Size?

An effect size measure is an indicator of the association that exists between two or more variables. An exception to this is Cohen's d which is a measure of the distance between means. These definitions translate into how much variance in one variable is accounted for by knowledge of another variable. As noted by some (e.g., Cohen, 1968; Kerlinger & Pedhazur, 1973), the increase in interest in effect size measures among psychologists is associated with their increased awareness of the similarity between ANOVA and regression (Haase, Waechter, & Solomon, 1982). Hence, realizing that assessing group differences is but only one way of employing basic correlational techniques may have sparked an interest in effect size indicators.

Determining what constitutes a “big” effect is troubling for work in the social sciences. As noted by Haase et al. (1982), answering the question of “how big is big?” is not as difficult in the natural sciences as it is for psychological science:

Effect sizes and related values of strength of association found in engineering, for example, are likely to be much larger than those found in the typical experiment in the behavioral sciences. Thus all investigators are left to develop their own intuitive standards as to what constitutes a small, medium, or large effect. (p. 59)

So, what constitutes a “big” effect in the average psychology experiment? Cohen (1977), without a doubt one of the strongest advocates of reporting effect size statistics, has issued guidelines as to what constitutes small, medium, and large effect sizes. According to Cohen, for d, 0.20, 0.50, and 0.80 constitute small, medium, and large effects respectively.

Effect Size Vs. P Value

Critics of significance testing argue that magnitude of effect measures provide more reliable information than do p values. Moreover, they note that associating strength of effect with a p value is wholly misleading. As Oakes (1986) states, “the fact that two experiments may yield sharply varying test statistics (i.e., p values) is no guarantee that the underlying strengths-of-effect are likewise variant” (p. 50). Hence, it is important not to confuse the significance of a test with a healthy magnitude of effect. Contrary to the p value, effect size measures offer the advantage of not being sensitive to sample size. Therefore, unlike p values, effect size indicators yield a stable numerical value that will not increase or decrease based solely on the total number of subjects used in the research. That is not to say that an increase in sample size may not facilitate finding new effects (and thereby increasing effect size), but only that effect size measures are not “functions” of sample size. A substantial increase in sample size does not necessarily guarantee an increased effect. However, an increase in sample size is almost always associated with a smaller p value.

In illustrating the above, consider an example offered by Chow (1996) in which two means are compared. In study A, the mean of the experimental group is 6 while the mean of the control group is 5. Degrees of freedom are equal to 22. Given a t-test, statistical significance is obtained, with an effect size of 0.1. In study B, the mean of the experimental group is 25 while the mean of the control group is 24. Degrees of freedom in this case are equal to 8. Again, given a t-test, but contrary to study A's results, study β did not obtain a significant p value, but had the same magnitude of effect as that found in study A (i.e., effect size = 0.1). The point of this example is that similar studies, although showing an identical degree of shared variance among variables, are nonetheless different in terms of statistical significance. Considering how statistical significance and effect size measures lead to different “conclusions,” ⁴ the question then becomes one of evaluating the importance and relevance of each indicator. Which is the more important to interpret, the p value or effect size?

Perhaps the above question is a bit too narrowing. After all, why should we not be able to interpret both? For the purpose of this discussion however, I will try to answer my original question, and I will assume for the moment that we had to choose one or the other. This in mind, I ask again – which is more important for our purposes, the p value or effect size? Although both indicators might (and should) be used in evaluating the merit of a study, which of the two is the more “informative” and relevant in reporting research? Cohen (1990) claims to have learnt that the primary product of research is that of effect size measures, and not p values. In solely interpreting significance levels, we are left with an inadequate account of the degree of association that exists between variables. Similarly however, as noted by Chow (1996), by interpreting only effect size measures, we are left with an inadequate account of the probability of the data. Whereas a healthy effect size may appear illuminating, it may lose some of its shine when a glance at the p value informs us that the probability of the result being due to chance is still quite high. Hence, a large effect may exist in the sample, but at the same time, results may stand a substantial probability of being due to chance. Can healthy effects be found in the proximity of the null hypothesis? Presumably, yes. As noted earlier, there is little or no relation between the size of the p value to that of effect magnitude. Thus, the compatibility of the two statistics may indeed be questioned and even the most liberal researcher (or editor) may find himself with conflicting statistics. She may therefore have to chose the one she prefers based on her own theoretical orientation when drawing conclusions from data.

Criticisms of Effect Size

Although criticisms directed toward effect size measures exist, they are not severe enough to seriously question the utility of such measures. For instance, Favreau (1997) criticizes effect size based on the idea that they do not represent the size of difference that distinguishes each member of one group from those of another. She points out that given the effect size, we are still left unknowing whether all individuals differ moderately or substantially from those in a comparison group. Levin (1967) also notes the problem in interpreting the variation shared among variables. As an example, he notes an experiment in which the shared variance was equal to a substantial 37%, but that by closer inspection, one would find that over 85% of the total shared variance resulted from one of the six experimental groups. Thus, if you have one superior group, this “success” may blanket over to the other groups and give the illusion of a large effect.

Although the above critique is worthwhile, these difficulties must be considered secondary for the purposes of our discussion. Although it is true that we cannot see where the differences lie simply by interpreting effect size, effect size does not purport to answer this question anyway. This is not to disregard the importance of knowing where these differences lie. Graphical techniques (e.g., linear functions as described by Loftus, 1996) would be ideal in visually interpreting where differences lie and discerning patterns of means. The use of error bars could also emphasize the degree to which differences between individual means are substantial. Loftus' approach to data analysis will be discussed later in this article.

Favreau (1997) also argues that a limitation of effect size is its dependence on how the dependent variable is operationalized. She gives the example that in a comparison of helping behavior, the difference may depend on the type of help required. More recently, Prentice and Miller (1992) have reviewed studies in which small effects were regarded as very impressive due largely to the operationalization of the variables studied. The way a variable is operationalized can have much influence on how effect size is interpreted, and hence one can say that an effect size estimate is at least partly a function of how variables are defined. However, it is difficult to pin this problem solely on effect size. It is obvious that how a variable is operationalized has much to do with what differences are found, but we cannot begin to say that this is a limitation of effect size. This is more of a methodological issue and not a statistical one. This critique brings to light a limitation of effect size, rather than a problem with the statistic.

Another critique of effect size is given by Dooling and Danks (1975). These researchers argue that psychology, because of the nature of its experimental designs, is simply not ready to begin adequately interpreting effect size statistics. The argument centers around how levels of the independent variable are chosen. They argue that this choice is arbitrary and that various levels may lead to various effect size estimations. For instance, a researcher wanting to increase the likelihood of obtaining a significant result may choose levels that are at opposite ends of values of the independent variable. Another researcher, however, performing the same study, may choose levels that are not so diametrically opposed to one another. Dooling and Danks argue that values of estimated ω ² depend on the way in which these levels are determined. Thus, while one experiment may reveal a particular effect size magnitude, another one, identical except for chosen levels of the independent variable, may reveal different effect size values. Furthermore, as noted by Dooling and Danks, most psychologists will interpret these results as resulting from a random effects model, rather than from a fixed effects model, and in doing so, will generalize effects across all levels of the independent variable rather than restrict them to levels actually tested. This possibility greatly reduces the interpretability of effect size measures in psychological research:

For such purposes of statistical inference, ω ² is either useless, misleading, or both because the strength of an experimental effect will depend directly on ill-defined decisions made by the E [i.e., experimenter]. The E who chooses extreme values of his independent variable is likely to obtain strong effects (assuming, of course, that there is some relationship between independent and dependent variables). Another E might choose closely related values of the same independent variable that would in turn yield a smaller ω ² . Hence, a correct inference based on the ω ² is impossible because the underlying population of values on the independent variable has not been randomly sampled. Since most readers of research articles are likely to draw unwarranted inferences from a reported ω ² , we suggest that its widespread use be discouraged. (Dooling & Danks, 1975, p. 16)

Dooling and Danks entertain the possibility of keeping ω ² and changing the designs typically employed in psychological research, but claim that this is likely to be an impossible assignment since “psychology is a science that still does not know what its major independent variables are” (p. 16). Thus, changing the designs of psychology may not be possible in order to better interpret ω ² . As an improvement, they recommend the use of random effects models instead of fixed effects for levels of the independent variable. They continue to argue however, that for the most part, this cannot be accomplished:

We are simply not ready to force ill-defined variables into standard molds. Nor would there be any wide agreement on standard experimental procedures in the various areas of psychology. Such efforts at standardization would eliminate many potentially interesting variables that would now be considered “extraneous” and thereby limit the scope of psychological research. Clearly, there is no realistic hope that experiments can be redesigned in order to make ω ² a useful measure. (p. 16)

The criticisms offered by Dooling and Danks, although legitimate, are not enough to seriously challenge magnitude of effect measures. Choosing levels of the independent variable is more of a methodological concern and although it may have an influence on the statistics that are interpreted, this problem cannot be considered a “drawback” of effect size indicators. It is rather up to the well-trained researcher to appropriately choose the levels of the independent variable and to report these decisions in methods sections of journal articles. Likewise, it is up to the knowledgeable interpreter of such literature to realize the limitations of the study when reading that a fixed-effects model was used. In such a case, the results would only be applicable to those levels of the independent variable that were actually tested in the experiment, and could not be generalized to all possible levels.

There are many measures that can be used in estimating effect size. As already discussed, ω ² and η ² are common statistics for estimating magnitude of effect. However, as noted by Howell (1989), η ² , although relatively simple to calculate, often yields an overestimate of effect. Cohen's d, the standardized mean difference, has also been criticized as a measure of effect size (see Richardson, 1996, for a review). Other measures for estimating magnitude of effect, reviewed by Snyder and Lawson (1993), include Epsilon squared, Wherry, Herzberg, and Lord. Ray and Shadish (1996) caution however against interchanging effect size estimates. They found that, especially for meta-analytic studies, different estimators yield various estimates, and are not often equal to d. Methods for estimating effect size for various models (e.g., ANOVA) are offered by Vaughan and Corballis (1969) and more recently Kirk (1996). Rosenthal and Rubin (1982) also offer a method for calculating effect magnitude to determine improvements in success rates. Recently, Kirk (1996) has found the reporting of effect size measures to be quite low. In a survey of articles published in 1995, Kirk found only 12% of articles in Journal of Experimental Psychology, Learning & Memory to use a magnitude of effect measure when reporting statistical significance.

Effect Size: Complement or Replacement to Significance Testing?

In evaluating effect size measures as an alternative to significance testing, I conclude that they are best used in conjunction with p values. They are especially useful for conducting meta-analysis of prior research when the goal is that of finding overall effects across many research studies (Schmidt, 1996). They are helpful in that instead of merely reporting whether a given result is likely due to chance, they indicate the strength of any relationship that may be present. Again, however, as noted by Chow (1996), effect size measures are limited in that they do not indicate the probability that the association between variables (however large) is due to chance factors. As Chow notes:

Effect size advocates are prepared to apply the non-statistical criterion even when it is not possible to exclude chance influences as an explanation of the results. Why is statistics used at all? Should the researcher embark on a course of action when the research outcome may actually be the result of chance influences? (p. 117)

In addition to Chow's remark, it should be noted that effect sizes are only “descriptive” statistics, and not “inferential” ones. That is, they only reveal the magnitude of effect in the sample, and offer no information of how likely this estimate is in the population. One needs NHST to reveal the latter information. Clearly, the use of either effect size measures or significance values is not a sufficient means of appraising data. It is difficult to choose one over the other in that crucial and useful information is lost when such a selection is made. As noted by Richardson (1996), “the important point is that measures of effect size are simply another part of the composite picture that a researcher builds when reporting data that indicate that one or more variables are helpful in understanding a particular behavior” (p. 21). Taking an even more extreme position, Snyder and Lawson (1993) conclude: “These aids [effect size measures] are merely tools to assist the researcher in gaining a more informed analysis of data. The ultimate responsibility for developing a comprehensive analysis of the meaning of results rests with the researcher” (p. 347). If forced to decide, one should side with effect size estimates over NHST. However, under the aegis of null hypothesis testing, I recommend using both significance testing and effect size when interpreting data.

I I. Confidence Intervals

Confidence intervals are another means of making sense of data, and consequently, are a candidate for either replacing or complementing NHST. Many have recommended the use of confidence intervals when interpreting experimental findings (Bakan, 1966; Chandler, 1957; Cohen, 1990; Grant, 1962; Hammond, 1996; Kirk, 1996; Rozeboom, 1960; Schmidt, 1996). A confidence interval provides a medium for assessing the probability that a parameter lies within a particular range of values. As Kirk (1996) notes, providing a confidence interval requires no more information (i.e., calculations) than conducting a null hypothesis significance test. Furthermore, a confidence interval tells us all and more than a null hypothesis significance test. He strongly recommends the use of intervals in the advancement of our science:

A rejection [of the null hypothesis] means that the researcher is pretty sure of the direction of the difference. Is this any way to develop psychological theory? I think not. How far would physics have progressed if their researchers had focused on discovering ordinal relationships? What we want to know is the size of the difference between A and B and the error associated with our estimate; knowing that A is greater than B is not enough. (p. 754)

Kirk offers an excellent example of how a mere significance test can be greatly misleading and that instead, a confidence interval can inform us of what we are really searching for in scientific investigations. He considers a researcher who wants to know whether a medication will improve intelligence test performance for Alzheimer patients. Twelve patients are assigned to the experimental condition while 12 are assigned to the control group. Using a t-test, the researcher tests whether there is a significant difference in IQ scores between the two groups. The test reports a non-significant value, t(10) = 1.61, p = .14. Although most researchers would conclude that a reliable difference does not exist and that medication does not have a substantial effect, Kirk argues that reliance on the p value alone is very misleading, and that the medication does have a great effect. By calculating a 95% confidence interval, the researcher would have found the mean difference to be between 6.3 and 32.3 IQ points. He would have also found that an unbiased estimate of Cohen's d is 0.86, suggesting that the difference represents a large effect. According to Kirk, it is clear the data provide good evidence for the scientific hypothesis, and hence the study should not be discarded merely because the .05 level was not reached. He argues that “the nonsignificant t-test does not mean that there is no difference between the IQs; all it means is that the researcher cannot rule out chance or sampling variability as an explanation of the observed difference” (p. 755). Kirk advises researchers to inspect the “practical” significance of results rather than continually limit data appraisal to mere p values. With Kirk I concur. Again, as has been noted many times before by many methodologists, data analysis than ends with the p value is inadequate. Additional tools are required, if not even good old fashioned experimenter judgment.

Cohen (1990) suggests building 80% confidence intervals around sample effect sizes. Such a measure would inform the researcher as to the probability that a given range includes the given effect size. For instance, an 80% confidence interval for the effect size of 0.3 would imply that we are 80% sure that the true effect size lies between say, 0.1 and 0.5. Confidence intervals represent a way of quantifying our degree of certainty (or uncertainty) with regards to the true population parameter.

Loftus (1991) has also been a strong advocate of using confidence intervals in reporting results. For him, the provision of intervals constitutes a useful and necessary means of accounting for error and provides the researcher with a much more meaningful answer to the question she is really probing:

The emphasis on hypothesis testing produces a concomitant de-emphasis of an alternative technique for coping with statistical error that is simple, direct, and intuitive and that has wide acceptance in the natural sciences: the use of confidence intervals. Whereas hypothesis testing emphasizes a very narrow question (“Do the population means fail to conform to a specific pattern?), the use of confidence intervals emphasizes a much broader question (“What are the population means?”). Knowing what the means are, of course, implies knowing whether they fail to conform to a specific pattern, although the reverse is not true. In this sense, use of confidence intervals subsumes the process of hypothesis testing. (p. 103)

More recently, Hunter (1997) has remarked on the high use of confidence intervals in the quantitative sciences (i.e., physics, chemistry, electronics), and that adoption of these measures in psychology would greatly reduce the error rates present with mere significance testing:

The older quantitative sciences, such as physics, chemistry, and electronics do not now use and have never used the significance test as it is currently used in psychology. This fact strongly suggests that there are alternatives to the significance test that are used in the other sciences. The dominant technique used in the quantitative sciences and the dominant technique used by mathematical statisticians is the technique of confidence intervals. I have studied the use of confidence intervals in other fields, and I am convinced that these fields do not suffer the high error rate that we suffer in psychology. (p. 6)

In advocating the use of confidence intervals, Hunter highlights the fact that a “95% confidence interval always has only a 5% error rate; it is not context dependent like the significance test . . . . furthermore, the various confidence intervals are all compatible with each other; there is no need for a social convention such as ‘a = 5%’” (p. 6). Many sources exist with good explanations of how to construct and use confidence intervals (for example, see Hays, 1963).

Evaluation of Confidence Intervals

In evaluating confidence intervals, I find them to be an indispensable tool in the appraisal of research results. They allow the researcher to estimate the degree to which (i.e., probability) the observed value (or difference) is likely to be the true value. Unlike effect sizes, I find confidence intervals better able to replace significance testing, again if we had to choose one over the other. What makes confidence intervals so appealing is that they can provide both a hypothesis test, and produce an estimate of the observed parameter (Becker, 1991). This is true despite the fact that the hypothesis test and confidence interval are still regarded as separate steps in the process of data appraisal (Serlin, 1993). As will be discussed further, confidence intervals work extremely well with the “good-enough” model proposed by Serlin & Lapsley (1985). The use of confidence intervals is summarized quite well by Serlin (1993): “we use them, in the same way an astronomer would use a telescope, to determine if the theoretical prediction, fortified by a good-enough principle belt, obtains” (p. 352). What this essentially results in, is using a “range null hypothesis” rather than a “point-null hypothesis,” the range represented by the “belt” that surrounds the obtained estimate. The “good-enough principle” will be described in more detail later in this article. For now it is enough to recognize that from the preceding discussion, confidence intervals constitute an essential tool in analyzing social science data. In conjunction with hypothesis-testing (which as noted, is really one process), they offer much and are highly recommended for use in the behavioral sciences. On methodological and philosophical grounds, confidence intervals can easily replace significance tests. Trying to convince this to the research community is another story.

I I I. Interval Estimation as Significance Test

Expanding on the preceding discussion, an alternative to merely rejecting null hypotheses is to provide an interval estimation of the desired parameter, and treat it as a significance test. Loftus (1996) recommends this technique as a replacement to hypothesis-testing. But how exactly is this done? According to Loftus, the null hypothesis could be rejected if the confidence intervals suggest there to be an effect in the data. He uses the example of a psychologist comparing his newly discovered treatment to a traditional treatment, hypothesizing that his treatment is just as effective as the traditional one. The psychologist tests this hypothesis by comparing scores from his treatment against scores obtained on the traditional treatment. A simple t-test shows the difference to be non-significant, thus suggesting both treatments are equally effective. However, Loftus argues that this report can be grossly misleading if confidence intervals are not provided. As shown in Figure 1, high power is indicated by smaller error bars, while low power is indicated by larger error bars.

Figure 1: Graphical representation of statistical power. Smaller error bars suggest higher power. (Reproduced from Loftus, 1996)

Given two treatments on some phenomenon of interest (e.g., phobia), Loftus' claim is that the top representation gives us reason to reject the null hypothesis because of the large degree of power, while the lower representation represents little power and hence the null should not be rejected based on these data alone. The lower panel suggests that the means are more likely to vary than are the means in the top panel. One advantage of Loftus' approach is the ease with which results can be interpreted. By visually inspecting the two graphs, results are readily apparent, and tentative conclusions regarding the data can immediately be made.

I V. Plot-Plus-Error-Bar Procedure (P P E Procedure)

Another alternative to traditional NHST, advocated by Loftus (1993, 1996) is called the “plot-plus-error-bar” approach to data analysis. In general, Loftus (1996) finds present-day reporting of F ratios and p values cumbersome and confusing. He argues:

Most people find such tabular-cum-text data presentation difficult to assimilate. . . . Decades of cognitive research, plus millennia of common sense, teach us that the human mind is not designed to integrate information that is presented in this form. There is too much of it, and it cannot be processed in parallel. (p. 166)

Loftus refers to the data reproduced in Table 1. Rather than presenting data in this “confusing” way, Loftus recommends the use of graphical techniques. He argues that such techniques convey the same information, but in a more clear, and concise fashion. The data in Table 1 represent results obtained from an experiment investigating memory for visually presented material.

Table 1: “Tabular” memory performance data. (Reproduced from Loftus, 1996)

Loftus argues that such information is difficult to interpret and at most, causes overdue confusion. He suggests rather to present such information as is depicted in Figure 2. By plotting confidence intervals around each group mean, this allows the interpreter to assess the degree of statistical power present in the investigation. Loftus argues that by visual inspection alone, the reader can obtain as much information from Figure 2 as he could from Table 1 – the only difference being that the graphical representation of results is much easier to interpret. Taking an extreme position, he holds that “creative use of such procedures allows one to jettison NHST entirely” (Loftus, 1996).

Figure 2: Display of memory data using error bars and showing linear trends across mean differences. (Reproduced from Loftus, 1996)

Although Loftus' approach offers much in the way of reporting results with maximal clarity, his model has been criticized heavily as a way of replacing NHST. Specifically, Morrison & Weaver (1995) argue that although the PPE procedure is very useful in attaining an advanced understanding of a data set, it cannot and should not totally supplement standard hypothesis testing procedures. Firstly, they argue that in Loftus' example, it is unclear as to how he obtained the error bars, since his design is a completely within-subjects design. They charge Loftus with “pooling” error estimates (i.e., error bars) across all means drawn in the linear function (see Figure 2). Loftus' (1995) reply to this criticism is that since the error terms are sufficiently similar, they can be pooled, and little information is lost in doing so. He argues that should the error terms be dissimilar, this would cast the opportunity to question why the error terms are different across levels of various factors. Therefore, contrary to Morrison and Weaver (1995), Loftus holds that using the same error term is acceptable, unless error terms are vastly different, in which case this difference could be studied in itself as a meaningful source of variation.

Morrison and Weaver (1995) also charge Loftus with “selecting results to demonstrate his points” (p. 54). They argue that the linear functions plotted in Figure 2 may be interpreted differently by different researchers. They claim that it is entirely possible to believe that while the mean increases, so do the variances (i.e., the error bars become larger). By merely viewing the functions, one could question whether the variances of the means would eventually become so large as to overlap, thus making the linear functions “parallel” for larger means. In response to this criticism, Loftus (1995) admits the potential “ambiguity” of this problem, but claims that the critics' implication that an ANOVA would help clarify this ambiguity is false. Loftus notes:

It is the lot of many data sets reported in the social sciences to be similarly ambiguous. Morrison and Weaver seem to imply, however, that a simple ANOVA would clear up this ambiguity, since the ANOVA would cleanly result in a “reject” or a “don't reject” decision. (p. 58)

This last criticism by Morrison and Weaver and reply by Loftus highlights perhaps the greatest misuse of NHST – the tendency to “decide” based on some statistical model. In the above, I would side with Loftus' reasoning that although results may be ambiguous, this does not call for some artificial means of making sense of the data. We may have to accept the ambiguity of our data if we are to represent them in an honest way. To impose some statistical decision where we “fool” ourselves into thinking we have clarity where there is none, is at best, an exercise in triviality. Too often, researchers deceive themselves into a “yes-no” answer when using NHST. Indeed, it is foolish to think of NHST as being able to provide this degree of information regarding a hypothesis. Rather, the function of NHST is to provide the experimenter with the probability of the data, given a true null hypothesis. It is then up to him/her to make a decision based on the data, and not the other way around.

Overall, Loftus' graphical techniques represent a welcome shift in data appraisal. His PPE procedure is to be applauded for improving the clarity of analysis. However, it is doubtful that Loftus is completely “jettisoning” NHST by using his PPE procedure. After all, do error bars distant from one another not suggest rejecting the null hypothesis? It would appear that although presented in a much clearer fashion, the concept of NHST may still be present in Loftus' model – only instead of F-ratios, Loftus uses linear functions.

V. Power Analysis

Jacob Cohen, the spokesman for power analysis, first recommended the procedure for psychology in 1962, and later provided an extensive handbook on the subject (Cohen, 1969). Although he wasn't the first to advocate power (Neyman & Pearson, 1928, first originated the concept), he nonetheless can be considered a pioneer in its development. Cohen provided a means for calculating power that was relatively easy, mathematically non-threatening, and very comprehensive. In doing this, he addressed the concerns of both Neyman and Pearson, and also touched on Fisher's “sensitivity” with regards to detecting an effect.

Cohen's pioneering work originated from his own perception that power analysis was largely neglected in psychological research. In a survey of articles published in Journal of Abnormal and Social Psychology in the year 1960, Cohen (1962) found the mean power to detect medium effect sizes to be 0.48. He concluded that the chance of obtaining a significant result was about that of tossing a head with a fair coin. Hoping to correct for this neglect in statistical power, Cohen would publish his most popular work, Statistical power analysis for the behavioral sciences (1969).

What is Power?

The statistical power of a significance test is the long-term probability, given the population effect size, alpha, and sample size, of rejecting the null hypothesis, given that the null hypothesis is false. The influence of sample size on significance has long been understood. For instance, Hay (1953) noted that “sometimes the characteristics of a sample are such that no amount of treatment will bring about a satisfactory result” (p. 445). The present discussion is a very brief explanation of the logic of power analysis. For an extensive review, see Cohen's handbook. For a brief overview and very “user-friendly” publication, see his more recent article, “A Power Primer” (1992).

Chandler (1957) explains through the following example, that without the consideration of power, one chances having a very inflated β value:

Although texts in psychological statistics do not seem to place a great deal of emphasis upon the power of a test, power is the basic concept responsible for one's employing statistical tests as a basis for taking action on an H. If this were not so, to test an H at the 5% level of significance, one could simply draw from a box of 100 beads – 95 white and 5 red – a bead at random and adopt the convention that he would reject the H whenever a red bead appeared. With such a test, one can readily see that not only α but also 1-β always equals .05, or β = .95. It is this large value of β that precludes one's employing the bead-box test. (p. 430)

Another technique, also related to sample size, has been recommended by some authors (e.g., Snyder and Lawson, 1993). The technique is called the critical n, and although similar to power analysis, it is much simpler to interpret. It is somewhat of a “watered-down” version of Cohen's power analysis. In using the technique, the researcher makes an estimate of effect size based on the critical value of the test statistic. Results with low critical n's represent larger effect sizes than results with higher critical n's. Thus, a test achieving significance with a low number of subjects would automatically suggest a prominent effect, compared to a test barely achieving significance with many subjects. However, as argued earlier, the relation between a significant p value and effect size is not clear, and hence drawing conclusions regarding magnitude of effect from significance values is risky. Power analysis provides a much more accurate estimate of these parameters and it should be used in its place.

Despite Cohen's efforts, power analysis has historically been given minimal attention by psychological researchers. Only a few textbooks have discussed sample size in its relation to statistical power (e.g., Glass & Stanley, 1970; Howell, 1989; Kirk, 1982; Myers, 1979). Furthermore, as noted by Olejnik (1984), many of these sources restrict the discussion to one-way ANOVA designs. Reviews by Sedlmeier and Gigerenzer (1989) report little increase in power since Cohen first recommended it for social and life sciences. Out of 54 articles reviewed, only two mentioned power, and none actually calculated it. Furthermore, Sedlmeier and Gigerenzer found the median power in 1960, 0.46 for a medium size effect, to have dropped to 0.37 in 1984. Only 3 power surveys have been conducted specifically for psychology since Cohen (1962). In addition to the study by Sedlmeier and Gigerenzer (1989), a survey by Chase and Chase (1976) found mean power to be 0.66 for medium effects, quite higher than that found by Cohen and that found by Sedlmeier and Gigerenzer (1989). A more recent study by Rossi (1990) suggests power to be higher than results in Cohen's first survey. In a survey of three journals, Journal of Abnormal Psychology, Journal of Consulting and Clinical Psychology, and Journal of Personality and Social Psychology, all selected from the year 1982, average power was found to be 0.17 for small effects, 0.57 for medium effects, and 0.83 for large effects. These results suggest an increase in power since Cohen's 1962 survey. However, as noted by Rossi (1990), these results are in no way reason for great optimism:

The general character of the statistical power of psychological research remains the same today as it was then; Power to detect small effects continues to be poor, power to detect medium effects continues to be marginal, and power to detect large effects continues to be adequate. (p. 650)

Thus, power estimates continue to be on average, low for psychological research studies, as well as research in other fields. Research scientists continue to pay little attention to sample size when conducting experiments. As remarked by Tversky and Kahneman (1971), this makes for sloppy research practices: “Cohen's analysis shows that the statistical power of many psychological studies is ridiculously low. This is a self-defeating practice; it makes for frustrated scientists and inefficient research” (p. 107).

Why the Lack of Power Analysis?

Attempts to discern why social scientists, particularly psychologists, continue to ignore the calculation of power have produced as of yet, no definite answer. There seems to be hardly any “rational” reason why the technique is ignored to the extent that it is. As Cohen (1977) has remarked, likely in frustration, “since statistical significance is so earnestly sought and devoutly wished for by the behavioral scientists, one would think that the a priori probability of its accomplishment would be routinely determined and well understood” (p. 1).

Despite Cohen's push for power, it remains poorly understood by today's investigators, and is seldom reported in psychological studies. If power analysis is such a wonderful tool, then why is it not used? I suggest three potential reasons as to why power analysis is practically ignored by today's community of psychological researchers: 1) ignorance, 2) laziness, and 3) lack of responsibility for proper research.

In considering the first reason, that of ignorance, I suggest that few researchers are aware of the basic necessity of conducting power-analytic procedures, given the use of significance testing. In other words, few psychologists may be truly aware of the “meaningfulness” ⁵ of their studies if a power analysis is not performed to ascertain a good probability of rejecting a false null hypothesis. Because it is well known that many psychologists misuse and misinterpret the statistical tools they use daily (Brewer, 1985), it would surely be of no surprise to find that the methodological requirement of power is generally poorly understood as well. Rossi (1990) notes, “It is probably not an exaggeration to assert that most researchers know little more about statistical power than its definition, even though a routine consideration of power has several beneficial consequences” (p. 646).

Or, perhaps the mathematical calculations in power analysis (as elementary as they are) prove intimidating to psychologists, who, on the average, have extremely little mathematical training compared to members of the natural sciences. As recently argued by Estes (1997), the community of psychology students and researchers are not adequately prepared or trained in understanding the statistical tools they use. The implication here is that better-trained researchers (mathematically) might be more willing to investigate power analysis. However, even to gain a moderate understanding of power analysis does not require extensive mathematical ability. Also, many computer software programs exist that will perform the math and determine sample size for the researcher. Goldstein (1989) provides an excellent review of available computer programs used for power analysis and his paper is highly recommended for consultation for those seeking a computer package.

The second possible reason for the lack of concern for power is that of laziness. Of the few researchers who are aware of the theoretical necessity of power, few may be prepared to bother with the technique. After all, any new calculations are time consuming and provide an “opportunity cost” against other activities that may occupy the researcher, such as summarizing results of other “power-deficient” studies. Furthermore, since editors (or APA) do not require power analysis, many researchers probably do not consider it a necessity to calculate. Why make life more complicated when editors do not require it to be?

The third reason why power is generally ignored, follows from that of the second. Researchers may be unwilling to bother with the calculation of power, even if they know it to be an important and necessary inclusion in any research report. However, as mentioned, since journal editors do not require its inclusion, many do not bother with its calculation. Knowing that power should be calculated, but refusing to do so because you will not be penalized for it, I refer to as lack of responsibility for proper research.

From the above discussion, one important consideration remains. If knowing the power of an experiment and calculating it appropriately can only increase the likelihood of publication, then why would researchers not rejoice in a method of increasing the chance of reaching the revered .05 level of significance if in fact the tested null is false? As Rossi (1990) correctly put it:

Knowledge of power of a statistical test indicates the likelihood of obtaining a statistically significant result. Presumably, most researchers would not want to conduct an investigation of low statistical power. The time, effort, and resources required to conduct research are usually sufficiently great that a reasonable chance of obtaining a successful outcome is at least implicitly desirable. Thus, if a priori power estimates are low, the researcher might elect either to increase power or to abandon the proposed research altogether if the costs of increasing power are too high, or if the costs of conducting research of low power cannot be justified. (p. 646)

The truth of the matter is however, as pointed out by Schmidt (1996), that if every researcher were conscientious about power, many studies would not be conducted at all. The reason for this is the high number of subjects needed to have an even moderately powerful test. Schmidt notes correctly that even with power at 0.80, this still allows a 20% Type II error rate when the null hypothesis is false. In order to obtain power this high, the number of subjects needed are often in the order of 1,000 or more (Schmidt & Hunter, 1978). That in itself could account for why most researchers ignore power; they know just enough about it to know they never achieve adequate power, so why diminish the credibility of their research studies with calculations pointing out their weaknesses? Although power may be considered as something of a life-saver if NHST is retained as the dominant model of statistical inference, the difficulty in achieving reasonable power may be problematic, and thus constitutes another problem directed at the NHST model. Conclusion – power does not save NHST, it only makes it more bearable.

V I. Multiple Models

Another alternative to traditional NHST is that of specifying how well the data fit various models. This approach, advocated by Edwards (1965) and later discussed by Wilson, Miller, and Lower (1967), is one in which the data are not automatically assumed to derive from the null distribution, but are left open for any model that best accounts for the data. As Wilson et al. argue, traditional classical statistics usually limit the specification of models upon considering the data:

Multiple models seem thoroughly desirable, and it seems worthwhile to note that in a traditional two-tailed test, classical statistics always implies three families of models: one predicting no difference, one predicting a positive difference, and one predicting a negative difference. (p. 196)

By comparing models to discern which is most accountable for the data, this would presumably avoid the need to “reject” a null hypothesis, or at least minimize its importance. The null hypothesis, which merely represents the distribution (i.e., model) that the data are distributed according to chance would indeed be rejected if a more plausible model were found to better account for the data. As Edwards (1965) illustrates through an example, comparing models is a necessary prerequisite to determining the plausibility of the null hypothesis:

A man from Mars, asked whether or not your suit fits you, would have trouble answering. He could notice the discrepancies between its measurements and yours, and might answer no; he could notice that you did not trip over it, and might answer yes. But give him two suits and ask him which fits you better, and his task starts to make sense. (p. 402)

Dunnette (1966), in addition to defaming NHST, has recommended that psychology advance by testing not one grand theory, but rather by testing a multitude of hypotheses:

The approach entails devising multiple hypotheses to explain observed phenomena, devising crucial experiments each of which may exclude or disprove one or more of the hypotheses, and continuing with the retained hypotheses to develop further subtests or sequential hypotheses to refine the possibilities that remain . . . . However, in psychology, the approach is little used, for, as we have said, the commitments are more often to a theory than to the process of finding out. (p. 350)

More recently, Dixon and O'Reilly (1999) have recommended that psychologists use the maximum likelihood ratio in comparing two or more models to the obtained data. They conclude that since most investigations discuss alternative plausible hypotheses that could account for their findings, why not include these models in the main analysis? They argue that as an alternative to null testing, a comparison of alternative hypotheses does not simply reject a “meaningless” null hypothesis, but rather compares how the data fit any one of the posited alternatives by means of the maximum likelihood ratio. According to the authors, a likelihood ratio of greater than 10 constitutes substantial evidence for siding with one model over another. They stress however that decisions regarding the importance of the research should not be limited solely to the value of the likelihood ratio. In effect, these researchers are arguing for a similar philosophy of inference as Edwards and Dunnette, but at the same time are recommending the maximum likelihood ratio as a reputable means of comparing alternative hypotheses.

Such recommendations by the preceding authors make much sense. Although the nature of the problem discussed is not merely a statistical one, it may be precipitated by psychologists' obsession to state one alternative instead of trying to specify several substantive hypotheses that could just as well account for the data. Customarily, researchers are interested in confirming the hypothesis of interest, and not in comparing that hypothesis against rival hypotheses within the same investigation. As Dunnette has noted, we are hardly interested in “finding out” the answer as we are in “finding support” for a given theory.

V I I. The Good-Enough Principle

Serlin and Lapsley (1985), in response to psychology's methodological problems highlighted by Meehl (1967, 1978), have proposed the “good-enough principle.” Their model adheres to Popperian falsificationism, and does not always reject the null hypothesis given a large enough sample size. Well deserving of proper space in an article on alternatives to NHST, details of their proposal now follow.

According to Popperian falsificationism, scientists, whether physical or social, should specify in advance what they will accept as a falsifying instance for a theory under test. Implied in this is that the scientist will also predict what data, or range of data will be accepted as support for a given theoretical position. The former is what Serlin & Lapsley refer to as the “good-enough principle.” The principle requires that scientists predict in advance what results will count as “good enough” for support of a theory. The main tenet of the model rests on the assumption that when specifying a hypothesis under test, the hypothesized “true value” can never be exactly true in theory, but must rather have some margin (or width) to vary. Thus, the predicted value becomes a value plus a given width around that value, much in the sense that a confidence interval will specify a “margin of error.” Once the results of the experiment are available, a statistical test is used to determine if the observed value is in the range of the “good-enough belt,” that is, the area around the predicted value that was formerly specified as “good enough” to act as confirming evidence for the theory. Following this procedure, the problems of “sample-size-sensitivity” are eliminated, because as precision increases (i.e., sample size increases, resulting in a wider estimate), the imprecision involved in estimating the population value is reduced. One would then find theoretical support for a theory outside of the previous “good enough” range. Hence, as Serlin & Lapsley argue, “even with an infinite sample size, the point-null hypothesis, fortified with a good-enough belt, is not always false” (p. 79).

As noted by these authors, this principle holds just as well for what occurs in physics as for what transpires in psychology, in that, “nature is just as unkind to physicists as it is to psychologists” (pp. 79-80). This is to say that even with the most precise of experiments (i.e., of the kind conducted in physics), the obtained value will never be exactly identical to the predicted value. There always must be a “belt” to account for the margin for which the data will still be considered to support the theory. Serlin and Lapsley argue that it is only through the use of such a “belt” that a precise experiment (i.e., one that predicts a point-value) can yield important information, especially considering that the most probable outcome is that the observed value will not equal the predicted or theoretical value.

The theoretical logic behind the principle is best appreciated when considered for directional hypotheses in disciplines such as psychology. As argued by Serlin and Lapsley, “under the aegis of the good-enough principle, one may not merely predict a direction. One also must specify in advance the magnitude of the change in that direction that is good enough” (p. 80). Good enough for what? Good enough to still count as confirming evidence for the theory. They continue: “If the statistical test indicates a possible increase less than that which is specified as good enough, the directional null hypothesis is retained. With infinite precision, one does not always reject the directional null hypothesis, and this is especially advantageous when the result is in the correct direction but only infinitesimally so” (p. 80). The authors also argue that such a specification of the magnitude of chance accords well with Lakatosian principles of scientific methodology in that a theory that specifies the degree of change in a specified direction is better than a theory that specifies merely the direction (i.e., such as traditional null hypothesis testing). Serlin and Lapsley (1999) provide a good example of how to use a statistical test in conjunction with the good-enough principle (see their 1985 article for details).

Is the Good-Enough Principle Good Enough?

The good-enough principle is excellent and has much to offer in terms of how we interpret social science data. More similar to confidence intervals than hypothesis-testing, the good-enough principle allows for an estimate of a mean (or mean difference), while specifying which results will count as “significant” should the precise prediction not be the observed value (i.e., the theoretical prediction does not match up to the observed value). This method of hypothesis-testing is advantageous over traditional NHST in that it specifies the range of values that will count toward support for a theory rather than merely reject chance as a possibility for the data. If there is a disadvantage to the good-enough philosophy, it is that it remains unknown how to specify the “good enough” region, and therefore this constitutes a similar problem as that of specifying the alpha level in hypothesis testing. According to Serlin (1993), the width of the belt is determined by “the state of the art of the theory and by the strengths of the auxiliary theories on which the prediction is based” (p. 351). In other words, if the theory is very successful and prominent, the width of the belt would tend to be smaller because of our prior “confidence” in the theory. We would be prepared to “zero in” on some phenomenon. On the contrary, if we are testing a rather loose and non-established theory, we would tend to specify a larger belt. What is key to the good enough test is that the magnitude of effect must be specified before the procedure is run. Each theory has various expected effect magnitudes, and these are determined by what the researcher would accept in light of the success or failure of the given theory based on past research. The concept sounds similar to that of estimating a prior probability within the context of a Bayesian analysis. As emphasized by Serlin (1993), the goal of psychological research is to estimate parameters of the population, not merely to reject chance:

Simply concluding that the population results seem to be or seem not to be good enough does not do us justice. Rather, if theory is supported, we would like to know how big the effect really is; if theory is not supported, we want to know how close we came. In either case, the magnitude of the effect can be used to gauge theoretical progress. It is for these reasons that most of the critics of significance testing suggest the use of confidence intervals as a way of improving the scientific utility of statistical methodology. (p. 352)

Evaluation

The good-enough principle shines over NHST because of its ability to deal effectively with large samples, thus increasing the precision of estimate, without inflating the degree to which the chance hypothesis will be rejected. The success of this model owes much to the fact that it is similar in theory to confidence intervals, in that point-estimation is the goal, not merely the rejection of an unlikely theoretical distribution. The good-enough model represents much advancement over null testing, and can be considered as a complete substitute to the NHST model.

Alternatives Within the N H S T Paradigm

Some authors recommend keeping NHST but changing the way in which we report findings and utilize the procedure. I will now briefly discuss recommendations based on “modifications” to NHST rather than complete abandonment of the statistical model. Rather than specifying alternatives to NHST, the following recommendations attempt to save (at least somewhat) NHST from complete failure through sharpening the means by which we use the model. Thus, the question to answer in the following is whether these recommendations do salvage NHST to any degree, as was concluded from a review of power analysis, or whether they are meaningless attempts to save a doomed model. Two recommendations for improvement are briefly discussed below.

Among the popular recommendations is that the exact significance level be reported in results sections rather than merely reporting whether p exceeds a given significance level. As Morrison and Henkel (1970) argue, following this procedure fosters a view to report only the data, and not imply conclusions based solely on the data. Implicit in their argument is that too many times decisions are reached with regards to experiments (and maybe even theories) based on the “greater” or “lesser” than alpha level. Indeed, the reporting of the specific probability at which the null is rejected has been recommended elsewhere (e.g., APA, 1994). Although it is wise to report exact p values, this is hardly enough to account for the problems of NHST. Therefore, it is concluded that although a favourable amendment, including exact p values in results sections is like topping off a poorly baked cake with high-quality frosting.

A second recommendation to “save” null hypothesis significance testing is to restructure the vocabulary used in reporting results, specifically, the use of the term “significance.” As argued by some (e.g., Carver, 1993; Scarr, 1997, Thompson, 1996), the term “reliability” should be used in place of the term “significance” for the purpose of eliminating the supposed confusion that exists with “significant” results implying a degree of importance. Although some researchers may confuse the significance of a result with its importance, the well-educated and conscientious social scientist would not make such an error. There are much more urgent problems when it comes to our statistical procedures than a mere semantic difficulty. I therefore declare the “significance” distinction to be a non-issue.

Conclusion and Comment

The primary goal of this article was to provide a fairly extensive review of alternatives that could be used in place of traditional null hypothesis significance testing. Whereas I hope this goal was achieved, I hope even more that an additional goal was realized – that of bringing awareness of just how many good, feasible alternatives there are to NHST. Given these attractive alternatives, one could easily write a book in exploring the reasons why the psychological and social science community as a whole has been resistant to alter its practices with respect to statistical inference.

In concluding this article, I will not do as I thought I would originally do from the outset – that of recommending a single alternative to NHST. To make such a suggestion would be to imply that all social science data should be analyzed and evaluated the same way, whether coming from social, cognitive, or general experimental domains. What I can recommend however, based on my review, is that NHST as we know it, be either 1) abandoned, or 2) complemented by one or more of the procedures outlined above. I would venture as far as to say that NHST is not even required for most studies, and assuming the regular inclusion of confidence intervals, one or more of the alternatives listed above would more than satisfy our expectations of what a data-analytic tool should provide.

References

American Psychological Association. (1994). Publication manual of the American Psychological Association (4 ^th ed.). Washington, DC.

Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423-437.

Becker, G. (1991). Alternative methods of reporting research results. American Psychologist, 46, 654-655.

Brewer, J. K. (1985). Behavioral statistics textbooks: Source of myths and misconceptions? Journal of Educational Statistics, 10, 252-268.

Carver, R. (1993). The case against statistical significance testing, revisited. Journal of Experimental Education, 61, 287-292.

Chandler, R. E. (1957). The statistical concepts of confidence and significance. Psychological Bulletin, 54, 429-430.

Chase, L. J., & Chase, R. B. (1976). A statistical power analysis of applied psychological research. Journal of Applied Psychology, 61, 234-237.

Chow, S. L. (1996). Statistical significance: Rationale, validity and utility. London: Sage Publications.

Cochrane, R., & Duffy, J. (1974). Psychology and scientific method. Bulletin of the British Psychological Society, 27, 117-121.

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153.

Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70, 426-443.

Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York: Academic Press.

Cohen, J. (1977). Statistical power analysis for the behavioral sciences (rev. ed.). New York: Academic Press.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

Dixon, P., & O'Reilly, T. (1999). Scientific versus statistical inference. Canadian Journal of Experimental Psychology, 53, 133-149.

Dooling, D. J., & Danks, J. H. (1975). Going beyond tests of significance: Is psychology ready? Bulletin of the Psychonomic Society, 5, 15-17.

Dunnette, M. D. (1966). Fads, fashions, and folderol in psychology. American Psychologist, 21, 343-352.

Edwards, W. (1965). Tactical note on the relation between scientific and statistical hypotheses. Psychological Bulletin, 63, 400-402.

Estes, W. K. (1997). Significance testing in psychological research: Some persisting issues. Psychological Science, 8, 18-20.

Favreau, O. E. (1997). Sex and gender comparisons: Does null hypothesis testing create a false dichotomy? Feminism & Psychology, 7, 63-81.

Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: Oliver and Boyd.

Gill, J. (2002). Bayesian methods: A social and behavioral sciences approach. New York: Chapman & Hall.

Glass, G. V., & Stanley, J. C. (1970). Statistical methods in education and psychology. Englewood Cliffs, NJ: Prentice-Hall.

Goldstein, R. (1989). Power and sample size via MS/PC-DOS computers. The American Statistician, 43, 253-260.

Grant, D. A. (1962). Testing the null hypothesis and the strategy and tactics of investigating theoretical models. Psychological Review, 69, 54-61.

Haase, R. F., Waechter, D. M., & Solomon, G. S. (1982). How significant is a significant difference? Average effect size of research in counseling psychology. Journal of Counseling Psychology, 29, 58-65.

Hammond, G. (1996). The objections to null hypothesis testing as a means of analyzing psychological data. Australian Journal of Psychology, 48, 104-106.

Hay, E. N. (1953). A note on small samples. The Journal of Applied Psychology, 37, 445.

Hays, W. L. (1963). Statistics for psychologists. New York: Holt, Rinehart & Winston.

Howell, D. C. (1989). Fundamental statistics for the behavioral sciences (2 ^nd edition). Boston: PWS-Kent Publishing Company.

Huberty, C. (1987). On statistical testing. Educational Researcher, 16, 4-9.

Huberty, C. (1993). Historical origins of statistical testing practices: The treatment of Fisher versus Neyman-Pearson views in textbooks. Journal of Experimental Education, 61, 317-333.

Hunter, J. E. (1997). Needed: A ban on the significance test. American Psychological Society, 8, 3-7.

Kerlinger, F. N., & Pedhazur, E. J. (1973). Multiple regression in behavioral research. New York: Holt, Rinehart & Winston.

Kirk, R. E. (1982). Experimental design: Procedures for the behavioral sciences, (2 ^nd ed.). Belmont, CA: Brooks/Cole.

Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759.

Levin, J. R. (1967). Misinterpreting the significance of “explained variation.” American Psychologist, 22, 675-676.

Loftus, G. R. (1991). On the tyranny of hypothesis testing in the social sciences. Contemporary Psychology, 36, 102-104.

Loftus, G. R. (1993). A picture is worth a thousand p values. On the irrelevance of hypothesis testing in the microcomputer age. Behavioral Research Methods, Instruments, and Computers, 25, 250-256.

Loftus, G. R. (1995). Data analysis as insight: Reply to Morrison and Weaver. Behavior Research Methods, Instruments, & Computers, 27, 57-59.

Loftus, G. R. (1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 5, 161-171.

Meehl, P. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115.

Meehl, P. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.

Morrison, D. E., & Henkel, R. E. (1970). The significance test controversy: A reader. Chicago: Aldine.

Morrison, D. E., & Weaver, B. (1995). Exactly how many p values is a picture worth? A commentary on Loftus's plot-plus-error-bar approach. Behavior Research Methods, Instruments, and Computers, 27, 52-56.

Myers, J. S. (1979). Fundamentals of experimental design (3 ^rd ed.). Boston: Allyn & Bacon.

Nelson, N., Rosenthal, R., & Rosnow, R. (1986). Interpretation of significance levels and effect sizes by psychological researchers. American Psychologist, 41, 1299-1301.

Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inferences (part 1). Biometrika, 20A, 175-240.

Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. Chichester: John Wiley & Sons.

Olejnik, S. F. (1984). Planning educational research: Determining the necessary sample size. Journal of Experimental Education, 53, 40-48.

Prentice, D. A., & Miller, D. T. (1992). When small effects are impressive. Psychological Bulletin, 112, 160-164.

Ray, J. W., & Shadish, W. R. (1996). How interchangeable are different estimators of effect size? Journal of Consulting and Clinical Psychology, 64, 1316-1325.

Richardson, J. T. E. (1996). Measures of effect size. Behavior Research Methods, Instruments, & Computers, 28, 12-22.

Rosenthal, R., & Rubin, D. B. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166-169.

Rosenthal, R. (1992). Effect size estimation, significance testing, and the file-drawer problem. Journal of Parapsychology, 56, 57-58.

Rossi, J. S. (1990). Statistical power of psychological research: What have we gained in 20 years? Journal of Consulting and Clinical Psychology, 58, 646-656.

Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416-428.

Scarr, S. (1997). Rules of evidence: A larger context for the statistical debate. Psychological Science, 8, 16-17.

Schmidt, F. L., & Hunter, J. E. (1978). Moderator research and the law of small numbers. Personnel Psychology, 31, 215-232.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115-129.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309-316.

Serlin, R. C., & Lapsley, D. K. (1985). Rationality in psychological research. American Psychologist, 40, 73-83.

Serlin, R. C. (1993). Confidence intervals and the scientific method: A case for holm on the range. Journal of Experimental Education, 61, 350-360.

Shrout, P. E. (1997). Should significance tests be banned? Introduction to a special section exploring the pros and cons. Psychological Science, 8, 1-2.

Snyder, P., & Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates. Journal of Experimental Education, 61, 334-349.

Thompson, B. (1996). AERA editorial policies regarding statistical significance testing: Three suggested reforms. Educational Researcher, 25, 26-30.

Tversky, A., & Kahneman, D. (1971). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124-1131.

Vaughan, G. M. & Corballis, M. C. (1969). Beyond tests of significance: Estimating strength of effects in selected ANOVA designs. Psychological Bulletin, 72, 204-213.

Wilson, W., Miller, H. L., & Lower, J. S. (1967). Much ado about the null hypothesis. Psychological Bulletin, 67, 188-196.

Endnotes

1. The word “frequentist” is emphasized here to define those approaches that assume probability to be a long-term frequency. Although Bayesian approaches have been deemed extremely fruitful as a way of testing hypotheses (e.g., see Gill, 2002), to adopt a Bayesian paradigm requires the assumption that probability is a subjective phenomenon, and is not limited to long-term frequency. Arguing for a Bayesian interpretation of probability is beyond the scope of this article. Only procedures assuming a frequentist approach to probability will be considered.

2. An exception to this is Howell (1989) who provides a good explanation of effect size.

3. Three other journals were also surveyed and showed more promising reports of magnitude of effect measures. Journal of Applied Psychology had an effect size reporting rate of 77%, Journal of Educational Psychology had an effect size reporting rate of 55%, and Journal of Personality and Social Psychology had an effect size reporting rate of 47%.

4. I use the word “conclusions” hesitantly. My goal is not to imply that conclusions should be based on either the p value or effect size. Rather, if NHST is to be used, both indicators should be employed in data analysis and interpretation.

5. By “meaningfulness,” I imply that a good study may have been performed, but because power was not calculated, the experiment might be dismissed as not providing evidence for the theory under test. Thus, although the study may indeed be “meaningful,” (for instance, large effect size) a researcher might mistakenly consider it to be meaningless because of the lack of power in rejecting the null hypothesis, and thus instead of being published, the manuscript will find itself on an office shelf, forever ignored.

Theory & Science