1997 Research Methods Forum
No. 2 (Summer 1997)
Introduction Jeffrey R. Edwards, Jeanne C. King Regression Models for Discrete and Limited Dependent Variables Michael R. Frone Hierarchical Linear Models in Organizational Research: Cross-level Interactions Mark A.Griffin, David A. Hofmann The Qualitative Corner What Are We Doing When We Cite Others Work in the Methodological Accounts We Provide of Our Research Activities?Karen Locke Current and Future Research Methods in Strategic ManagementMichael A. Hitt Validity, Variance, and the Interpretation of Power Values Jose M. Cortina Do Structural Equation Models Correct For Measurement Error?Richard P. DeShon |
Validity, Variance, and the Interpretation of Power Values
Jose M. Cortina
George Mason University
There has been some recent interest in the power differences between extreme groups and observational designs (e.g., McClelland & Judd, 1993; Cortina & DeShon, under review). For example, it has been suggested that extreme groups designs have more power to detect moderator effects than do observational designs (McClelland & Judd, 1993; Champoux & Peters, 1987). An extreme groups design is typically an experimental design in which each independent variable has values only at the extremes of its distribution, whereas an observational design, when applied to continuous variables, results in roughly continuous distributions with relatively few values in the multivariate extremes. This difference would result in considerable differences in independent variable variances between the two designs, with the extreme groups design producing an independent variable with a variance that is several times that yielded by the observational design. This difference in variances, in turn, results in differences in power values (McClelland & Judd, 1993). My purpose here is to show that this difference in variance values has implications for validity which, in turn, has implications for the interpretation of power values.
What Is Power?
First, there is the statistical answer with which we are all familiar. Power is the probability of a sample value exceeding some probability criterion given that the null hypothesis, whatever that might be, is false. This is what we are all taught, but this formulation is likely to lead to misinterpretation unless an unstated assumption is met. It is an assumption of validity.
The Assumption of Validity
Consider the diagram in Figure 1 below. Figure 1 is similar to that presented in Binning & Barrett (1989). First, we have two constructs, A and B, the relationship between which is of primary interest. This relationship is represented as Path 1. The constructs A and B are the constructs of interest, where "constructs of interest" is defined simply as the constructs that are described/mentioned in the Introduction section of a given paper and all of the characteristics associated with the typical understanding of those constructs. We typically make inferences about Path 1 based on Path 4, which links the measures or operationalizations of Constructs A and B.

For such a case, we would define the p-value as the probability of achieving a value as extreme as that which represents Path 4 given that the value associated with Path 1 is zero, and power as the probability that the value associated with Path 4 will exceed some probability criterion given that Path 1 is some value other than zero. But what happens to this formulation if Path 2 or 3 is zero? In other words, what happens if one of the measures lacks validity? The answer is that the link between Path 4 and Path 1 is severed. Path 4 has nothing to do with Path 1, except in a coincidental way. Consider the following example.
Suppose that I am interested in assessing the relationship between intelligence and academic achievement. Suppose that I operationalize intelligence as fingernail width and academic achievement as toenail width. I then compute the correlation between my two measures and find a statistically significant correlation. Does this finding justify conclusions about Path 1 (i.e., the intelligence/academic achievement relationship)? Of course not. Likewise, the p-value associated with the observed correlation has nothing to do with an assumption about the Path 1 value linking intelligence and academic achievement. It is not the case that p is the probability of achieving a result as extreme as Path 4 (the fingernail width/toenail width relationship) given that Path 1 (the intelligence/academic achievement relationship) is zero, because the null hypothesis has nothing to do with Path 1 in this example. Instead, the null on which the p-value is based has to do with the constructs that are actually measured by my operationalizations, and p is the probability of achieving a result as extreme as Path 4 given that the relationship between the constructs that are actually being measured is zero. Similarly, it is not the case in this example that power is the probability that the value associated with Path 4 will exceed some probability criterion given that Path 1 is nonzero. Power is instead the probability of achieving a result that exceeds a given probability criterion given that the relationship between the constructs that are actually being measured is nonzero.
Thus, because of the lack of validity associated with the measures in this design, we can make no statements about the constructs of interest (i.e., the constructs described in the Introduction section of a given paper), and we can not interpret values such as p and power as if they were based on assumptions about the relationship between the constructs of interest. The links between the measures and the constructs of interest have been broken, and because useful interpretation of values such as p and power depends on these links, these values are not interpretable except in terms of constructs that are not of interest.
The Comparison of Extreme Groups Designs and Observational Designs: An Example
McClelland & Judd (1993) and McClelland (1997) offered some of the advantages of the extreme groups design. It is, however, important to consider the fact that the same characteristics of this type of design that lead to the advantages described in these papers also produce serious limitations relative to the observational design. To illustrate, consider research on the effects of mood differences.
"Mood" can be roughly defined as a transient affective state and is usually induced through the use of relatively benign stimuli such as cookies (Brief, Butcher, & Roberson, 1995), scents, the recalling of past life events, (Bower, 1981), etc. Let us assume that, at the population level, "mood" is a normally distributed variable with some population variance s2. Much of the published mood research involves experimental designs in which mood is manipulated such that good moods are induced in some participants and bad moods are induced in others. Thus, assuming an adequate manipulation, each participant is either high on the mood variable or low on the mood variable. This observed, dichotomous variable will have a variance equal to s2. Moreover, because the sample values are extremely grouped to some extent, they will produce a variance value that far exceeds the corresponding population value.
Before continuing with this example, let us briefly explore the relationship between variance and validity. The importance of validity is well known, but most conceptualizations of validity do not explicitly take variability of constructs into account. Cook & Campbell (1979) stated that the inferences drawn from designs which categorize continuous variables must be made with caution. Although these authors were referring primarily techniques such as the median split, their statement also applies to the present case. The issue is one of validity, as validity is nothing more (or less) than the extent to which we can draw inferences about the constructs of interest (APA, ACME, NCME, 1985). By 'extent', I mean not only confidence in inferences, but also breadth of inferences (cf. Lawshe, 1985; Landy, 1986). Cook & Campbell (1979), in their discussion of "construct validity", use as an example research involving the effects of supervisory distance (p.61). Specifically, these authors point out that it is dangerous to generalize from a limited number of levels of the distance variable (e.g., ten feet or less) to the general construct "supervisory distance". Thus, the Introduction and Discussion sections of the paper in which this limited number of levels of the independent variable was used should contain only qualified references to the distance construct such as 'supervisory distance from ten feet versus two feet'. This is unfortunate in that we are typically interested in generalizing to more than a couple of isolated levels. Nevertheless, to the extent that such qualifiers are not present, our inferences go beyond those which are supportable from the data, and the validity of our manipulation/measure viz. such broad inferences can be called into question. Note that this is not merely an issue of generalizing from one set of people, occasions, or settings to the next. Instead, this is an issue of the extent to which the values included in our measure reflect the language used in our Introduction and Discussion sections.
Now, let us return to the "mood" example. Suppose we wish to assess the relationship between mood and life satisfaction. One approach to this experiment might be to simply measure the moods of participants as they arrive at the experimental setting and administer the satisfaction measure. Assuming an adequate sampling strategy and acceptable measures, this approach would yield a "mood" variable whose distribution would be a reasonable approximation of the corresponding population of values for the "mood" construct. This distribution might be expected to be continuous and roughly normal. Further, the approach might suggest that there is some noticeable, if not overwhelming relationship between mood and satisfaction.
Alternatively, we might take an extreme groups approach. For example, we might assign people to either a "low" mood group in which we have the bosses of the participants meet the participants immediately prior to the experiment and tell them that they have been fired from their jobs effective immediately, or a "high" mood group in which their bosses tell them that they have been promoted from entry-level employee to company VP, thus tripling their salaries and providing stock options and access to a company car. We might analyze these data and find that mood does have an overwhelming impact on life satisfaction. In fact, one could scarcely imagine, in light of the data, that life satisfaction ratings are determined by anything else. This conclusion is, at best, wildly exaggerated, and the reason is that the mood variable had too much variance. It had so much variance, as a result of the strategy used for sampling levels of the independent variable, that it no longer made sense to refer to it as mood unless we included in our references to and inferences about the construct a variety of qualifications and ceteris paribus clauses. Thus, in the extreme groups design, as with the example from Cook & Campbell (1979), the use of a particular set of values for the independent variable precludes the use of unqualified statements about the constructs of interest.
Consider the issue from one final perspective. Power is determined by a, s2, N, the difference between the true and null distributions, and the type of test. The best way to increase power is to increase sample size. One can also change the power of a test by changing the probability cutoff or changing the type of test (e.g., one vs two-tailed). Anything having to do with the population of values is, by definition, beyond the reach of the experimenter who desires a more powerful design. The Central Limit Theorem says that the variance of the sampling distribution of the mean is equal to the variance of the individual scores in the population divided by the sample size. All else being equal, the greater the variance of the sampling distribution, the less power a design has. Greater amounts of sampling error make the identification of effects more difficult, i.e. less power. What is it that determines the variance of the sampling distribution other than N? The answer is, the population variance. If sample variance approximates the population variance, then the sample variance can be used to estimate power. If not, then the sample variance has nothing to do with power (again, viz. the constructs of interest). In other words, it is not possible to increase power by increasing the sample variance. Artificially creating a discrepancy between the sample and population variances simply causes validity problems, and it makes no more sense to talk about increasing power by increasing sample variance than it does to talk about increasing power by using a measure of a predictor construct that is more strongly related to the criterion of interest. This doesn't increase power, it decreases the extent to which one can draw inferences about the relationships among the constructs described in the hypotheses.
Let me conclude by saying that it is axiomatic that decreases in validity will never lead to increases in statistical power. The extreme groups design produces inflated variances (relative to the corresponding population values when those populations are normally distributed), and these inflated variances limit the inferences that can be drawn with respect to the constructs of interest, that is, they result in a reduction of construct validity. In terms of Figure 1, Path 2 or 3 is weakened, and once again, it makes little sense to talk about the probability of rejecting the null based on Path 4 given that Path 1 is nonzero because the link between Path 4 and Path 1 has been compromised. The "given" has less to do with Path 1 in the extreme groups design than in the observational design, therefore, comparing the power of the two designs is akin to comparing apples and oranges, particularly when one is primarily interested in oranges.
References
American Psychological Association, American Educational Research Associate, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, D.C.: American Psychological Association.
Binning, J.F., & Barrett, G.V. (1989). Validity of personnel decision: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology. 74, 478-494.
Bower, G.H. (1981). Mood and memory. American Psychologist, 36, 129-148.
Brief, A.P., Butcher, A.H., & Roberson, L. (1995). Cookies, disposition, and job attitudes: The effects of positive mood- inducing events and negative affectivity on job satisfaction in a field experiment. Organizational Behavior and Human Decision Processes, 62, 55-62.
Champoux, J.E., & Peters, W.S. (1987). Form, effect size, and power in moderated regression analysis. Journal of Occupational Psychology, 60, 243-255.
Cook, T.D. & Campbell, D.T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago: Rand-McNally.
Cortina, J.M., & DeShon, R.P.. (1997). Extreme groups vs. observational designs: Issues of appropriateness in applied research designs. (Manuscript under review)
Landy, F.J. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 1183-1192.
Lawshe, C.L. (1985). Inferences from personnel tests and their validities. Journal of Applied Psychology, 70, 237-238.
McClelland, G.H. (1997). Optimal design in psychological research. Psychological Methods, 2, 3-19.
McClelland, G.H., & Judd, C.M. (1993). Statistical difficulties of detecting interaction and moderator effects. Psychological Bulletin, 114, 376-390.
|