Introduction: Eclecticisim in Methods —David A. Harrison Controlling Method Effects in Self Report Instruments —Mary E. McLaughlin Missing Data: Instrument-Level Heffalumps and Item-Level Woozles —Philip L. Roth and Fred S. Switzer III Paradigms and Research Methods —Robert Gephart Improving the Power of Moderated Multiple Regression to Estimate Interaction Effects —Herman Aguinis and Charles A. Pierce Lost Time: Reflections and Recommendations on the Treatment of Temporal Issues in Organizational Research —Donald D. Bergh |
Missing Data: Instrument-Level Heffalumps and Item-Level Woozles
PHILIP L. ROTH
Department of Management
Clemson University
rothp@clemson.edu
FRED
S. SWITZER III
Department of Psychology
Clemson University
switzef@clemson.edu
Winnie the Pooh was sitting at his laptop in the hundred-acre wood puzzling over a problem. It seems Pooh had been offered a great deal of honey to analyze a data set about other bears. “Oh pooh,” said Pooh. “I have a great deal of missing data and I simply don’t know how what to do.” “Maybe Christopher Robin will understand this.” Christopher Robin, having read the missing data literature (trying to overcome insomnia), noted that missing data were rather like heffalumps and woozles. In both cases, the creatures had never been seen, but had been hunted and studied a great deal. The parallel to heffalumps and woozles was also quite striking as missing entire instruments are a very different “creature” than missing a few responses to items in a scale designed to measure the same underlying construct.
The purpose of this paper is to provide a brief overview of each of two missing data situations, and try to show the importance of considering which elusive creature a researcher might be hunting. We find that much of the previous literature does not consider the distinction between missing data at the item level or instrument level. Failure to make this distinction can partially muddle one’s treatment of missing data in important situations.
Data Missing from Entire Instruments: The Heffalump
Missing data research has typically (and almost exclusively) focused on situations in which an entire “measurement instrument” was missing. This is not surprising given that a great deal of missing data research has been conducted in demographics, statistics, economics and related fields. Researchers often measure a variable, such as income, opinion about a political issue, or amount of money spent on a certain activity with a single item. Alternatively, some researchers use tests (e.g., cognitive ability) to measure a construct and that entire test might be missing for some respondents. Researchers are left with options such as deleting data or imputing estimates of the missing data by using scores from measures of other variables. Researchers evaluating MDTs in such situations would develop a study in which there was 10 – 30% of missing data on an instrument or two and then compare missing data techniques (MDTs) based on their variability around the true scores (e.g., root mean squared error). MDTs with smaller variability were generally preferred to those with greater variability. Unfortunately, there was much less investigation of bias (or average amount of error) resulting from use of various MDTs.
Summarizing the MDTs that have typically been used in this research is difficult in such a short article (readers are referred to Roth, 1994 for a conceptual review). First, researchers might use listwise deletion. For example, Pooh might have a data set with four variables for two hundred bears. The variables of the Ursine Conscientiousness test, the Ursine Honey Commitment scale, and the Ursine Paw Dexterity Test are designed to predict number of Honey Pots successfully consumed. Data missing on any measure would result in dropping the entire case from analysis when using listwise deletion. While this technique is the most frequently used MDT in applied psychology (there appear to be no analyses of MDT use in other fields), a bear using this MDT will find his analyses subject to a decreasing level of power as more data is missing.
Other researchers might use pairwise deletion in which they use as much of the data as possible. For example, ten bears might have failed to respond to the Ursine Honey Commitment scale. Their data could not be used to calculate correlations between Honey Commitment and other variables. However, their data could be used to calculate correlations between conscientiousness, dexterity, and honey pot consumption. Thus, pairwise deletion saves more data. Again, pairwise is often used in applied psychology, but has limitations as it “reverts” back to listwise deletion for regression or other multivariate analyses. That is, it deletes cases if the case does not have data for all the needed variables.
Imputation (estimation) techniques are also available to researchers. Relatively simple ones include mean substitution, regression imputation, and hot-deck imputation. More complex procedures are available such as the Expectation Maximization Algorithm (Dempster, Laird, & Rubin, 1977). We briefly cover the simple mechanisms. Mean substitution simply substitutes the mean for all bears in place of missing data. For example, the ten missing Ursine Conscientiousness scores would be replaced by the mean score for all bears. This approach is not often used (or bears don’t admit to using it) and it suffers from strong downwardly biased estimation of covariance and variance (Switzer, Roth, & Switzer, 1998).
Regression imputation takes advantage of the relationship between variables to estimate missing data. Pooh could use regression imputation for the ten missing conscientiousness scores by computing a regression equation on the bears with complete data where conscientiousness is the dependent variable and commitment and dexterity are the independent variables. This equation could be used to impute likely values for the missing values. Pooh could then analyze the data as if he had all the cases. This approach is seldom used in many business related fields, but does offer the advantage of utilizing existing relationships to make fairly good estimates of missing data. However, researchers should be cautioned not to use independent variables to impute dependent variables and vice versa. This practice will overstate the independent-dependent relationships.
Researchers might also use hot-deck imputation. Hot-deck approaches seek to find a similar case to the case with missing data. Then, hot-deck “steals” the value from the similar case and puts it in place of the missing datum (OK, it doesn’t really steal it, but “stealing” sounds better than “borrowing” with a name like “hot-deck”). For example, Pooh might estimate a missing conscientiousness score by finding a case with full data closest (e.g., Euclidian distance) to the missing cases based on the commitment and dexterity variables. The two advantages of hot-deck are that: a) it uses relationships within the data to make estimates and b) the imputed score(s) already has some error variance imputed into it since it uses an actual score (rather than a regression based imputation). Other imputation techniques such as the Expectation Maximization Algorithm also take advantage of relationships in the data and iteratively estimate missing values. However, we do not cover them due to space limitations.
Recommendations for Techniques to Handle the Heffalump
Several thoughts might help researchers cope with missing data when entire instruments are missing. First, just about any technique will probably do if the amount of missing data is small (though we really do dislike mean substitution as do many reviewers). Missing data in the range of 5% or under (for a given variable) represents a situation in which researchers probably should not worry too much about their choice of MDTs. Even some situations with under 10% missing data are not too problematic unless the data set is quite small or relevant covariances are quite small.
Second, researchers might want to consider the accuracy of the estimation procedures. There is some debate whether the estimation procedures (e.g., regression imputation) are superior to deletion procedures (listwise & pairwise). Our research has found that the deletion procedures often produce little to no bias and small levels of dispersion around true scores (Roth et al., 1995; Switzer et al., 1998), but that regression imputation works fairly well too. Others have found less dispersion around true scores with regression imputation (Kim & Curry, 1977). Though we do note that none of the previous procedures provides extremely accurate estimation of missing values (probably because variables available for imputation are not too highly related to each other in many business setting). Our experience also suggests that the deletion strategies or imputation strategies usually do not overestimate measures of covariance such as correlations or regression weights. Instead, they more often underestimate covariances.
Third, there is little research on how various patterns of missing data affect MDTs. Partially, this is a function of the fact that most organizational disciplines don’t study how data are missing (e.g., are data missing in one tail of the distribution, randomly from all parts of the distribution, etc.). A couple of moderately severe non-random patterns (such as more data loss in both high scores and low scores or more data loss as scores increase) did not greatly influence the effectiveness of MDTs (Switzer et al., 1998). Thus, there is some evidence that moderate patterns are not too problematic.
Typically, we suggest researchers examine the pattern of missing scores. Some researchers may feel comfortable simply choosing an MDT under the following circumstances.
If the previous conditions are not present, the choice of MDT is much more difficult. Individuals with a great deal of missing data or strong relationships between measured variables and the data’s “missingness” may wish to model why the data were missing (Graham & Donaldson, 1993). Also, there has been little investigation of patterns in which most of the data loss is from the middle of the distribution of a variable. This could be problematic and its consequences are not known.
In sum, there are several decent alternatives to dealing with missing data. We tend to like listwise deletion given its low level of bias (though we caution that it can markedly reduce power). We also like regression imputation for situations in which entire instrument scores are missing. We really loathe mean substitution on conceptual grounds (not using relationships within the data for estimation) and empirical grounds (strong downward biases observed in measures of covariance). Thus, the hunt for the elusive heffalump is truly a difficult methodological safari. One simply makes the best of the situation that is offered.
Data Missing in Multiple Item Scales: The Woozle
Recently, we have turned to examining situations in which data are missing in multiple item scales. For example, Pooh has a 7-item scale for Ursine Commitment to Honey. What happens if a bear failed to fill out two of the items (but did fill out the rest of the items)? This is a completely different situation than heffalump hunting (entire instrument missing) for two reasons. First, many organizational researchers find that they have data missing from just a few of many items measuring the same underlying thing or construct. Second, the items have moderate to relatively high intercorrelations. Thus, there is a single factor model underlying all of the responses to individual items and there is often moderate inter-item correlations to support imputation approaches. We find it extremely interesting that only a few researchers have delineated heffalumps and woozles (e.g., Downey & King, 1998; Roth, Switzer, & Switzer, 1999) and the delineation of these two different situations is only recently starting to emerge in many individuals’ thinking. We believe researchers should strongly consider which safari they are embarking on as they consider how to deal with missing data.
While the basic missing data techniques are somewhat similar, their implementation and the results are somewhat different for the woozle safari. Listwise deletion is more problematic because missing one item in a scale results in missing the scale score, which results in eliminating the entire case. Given that there are more items to miss, missing data “cascades” from item to scale score to measure of covariance into a potentially greater problem. Thus, the problem of power could become acute even with relatively little missing data. Likewise, pairwise deletion is problematic. One missing item means the total scale score is lost and all statistical procedures that need the scale score delete that case (e.g. multiple regression, etc.).
At the same time that deletion approaches become less attractive, imputation approaches become more attractive because of the multiple items that are designed to measure the same underlying construct. There are several relatively simple ways one could impute items on your scale.
Pooh could use regression imputation. In the case of missing items, imputation procedures now occur within the multiple measures of the same construct. As such, Pooh would develop a regression equation based on cases in which he had complete data. For example, let’s say 12 bears did not answer the first two items on the commitment scale. Pooh develops a regression equation in which the five items that were universally answered are the independent variables. The two unanswered items would be the dependent variables. Pooh then uses the equation to impute the values for the unanswered items. Once he imputed these items we could add up all the items for a scale score (or otherwise score the items) and continue with any analyses he wanted to conduct. While this explanation is quite straightforward, the number of regression equations needed to implement this approach to a moderate sized data set can be large and unwieldy if proper software is not available.
Pooh (and other researchers) could also use even more simple ways to impute the missing items scores. Some researchers have simply taken the mean for all respondents and used that score to estimate the missing score. For example, we could simply take the mean response to item 1 across all bears and replace any missing data with that mean. Other researchers suggest taking the mean of all items measuring the same construct within a bear and using that mean to estimate. For example, one might take the mean score for a given bear on items 2-7 to estimate the item 1 response. Some people term this the person-mean or mean person approach (or maybe mean-bear if no honey is available).
Pooh could also use a form of hot-deck imputation. Pooh could find the closest case with the least distance on items 2-7 to the case that needed data on item one. Then Pooh “borrows” that score and uses it to estimate the missing data.
Recommendations for Techniques to Tame the Woozle
As with missing data for entire instruments, missing a very small amount of data usually allows one a great deal of discretion in the choice of MDT. As long as the total N available for analysis does not drop a great deal, the choice of MDT is often not of great importance. However, note again that missing a relatively small amount of data at the item level can lead to a large amount of data loss when using deletion techniques.
It appears that the accuracy of estimating measures of covariance when items are missing is much greater than when entire instruments are missing. Past research at the instrument level found levels of bias were often in the neighborhood of .06 to .11 when estimating correlations with true values that averaged .36 and 10 to 20% of the data were missing (Switzer et al., 1998). Interestingly, bias for most MDTs with missing item level data was in the third decimal place (Roth et al., 1999). For example, regression imputation bias was .002 when 2 items could be missing from a seven item scale with an average inter-item correlation of .30. Across all statistics (correlations, multiple Rs, and regression weights) and patterns of missing data, one article suggests that the most promising technique is to substitute the person’s mean response for their missing data (Roth et al., 1999). It is simple to implement and quite robust to average inter-item correlations and the three studied deletion patterns.
We should also note some interesting findings regarding listwise deletion in multiple item scales. While the bias was fairly low, the dispersion (e.g., root mean square error) was much higher for listwise than any imputation procedure (e.g., mean substitution across respondents, mean substitution within respondents, regression imputation and hot-deck imputation). For correlations, the root mean square error for most imputation techniques was often around .02 to .03 when one item could be missing from a three item scale and 20% of the data was missing on that item. The root mean square error for listwise was usually .08 or three to four times as large. This suggests more variability in results across studies. Further, there was a great loss of power as sample sizes dropped from 400 to 267 in many cases with one item of missing data. Thus, power was probably greatly influenced for such analyses. Overall, we discourage routine use of listwise deletion for item level missing data.
Conclusion
While we have only been able to provide a brief overview of some MDTs, we might suggest several things to try to overcome the fact that “heffalumps and woozles are quite confusil.” First, low levels of missing data (usually at 5% or less) are seldom problematic and do not require a great deal of thought by researchers. Second, missing data at the instrument level and the item level are two completely different “animals.” Missing data at the instrument level may be handled by listwise deletion, pairwise deletion, or regression imputation. Missing data at the item level may be handled with substitution of mean response of the person (or bear) and routine or default use of listwise deletion may be discouraged.
References
Cohen, J. & Cohen, P. (1983). Applied multiple regression/correlational analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38.
Downey, R. G., & King, C. V. (1998). Missing data in Likert ratings: A comparison of replacement methods. Journal of General Psychology, 125, 175-189.
Graham, J. W., & Donaldson, S. W. (1993). Evaluating interventions with differential attrition: the importance of nonresponse mechanisms and use of follow-up data. Journal of Applied Psychology, 78, 119-128.
Kim, J. O., & Curry, J. (1977). The treatment of missing data in multivariate analyses. Sociological Methods and Research, 6, 215-241.
Roth, P.L., Switzer, F.S. III, & Switzer, D. (1999). Missing data in multi-item scales: A Monte Carlo Analysis of missing data techniques. Organizational Research Methods, 2, 211-232.
Roth, P.L., & Switzer, F.S. III. (1995). A Monte Carlo analysis of missing data techniques in an HRM setting. Journal of Management, 21, 1003-1023.
Roth, P.L. (1994). Missing data: A conceptual review for applied psychologists. Personnel Psychology, 47, 537-560.
Switzer, F.S. III, Roth, P.L., & Switzer, D.M. (1998). Systematic data loss in HRM settings: A Monte Carlo analysis. Journal of Management, 24, 763-779.