Category Archives: regression analysis

Introducing Mega-analysis

How to find truth in an ocean of correlations – with breakers, still waters, tidal waves, and undercurrents? In the old age of responsible research and publication, we would collect estimates reported in previous research, and compute a correlation across correlations. Those days are long gone.

In the age of rat race research and publication it became increasingly difficult to do a meta-analysis. It is a frustrating experience for anyone who has conducted one: endless searches on the Web of Science and Google Scholar to collect all published research, input the estimates in a database, find that a lot of fields are blank, email authors for zero-order correlations and other statistics they had failed to report in their publications and get very little response.

Meta-analysis is not only a frustrating experience, it is also a bad idea when results that authors do not like do not get published. A host of techniques have been developed to find and correct publication bias, but the problem that we do not know the results that do not get reported is not solved easily.

As we enter the age of open science,  we do not have to rely any longer on the far from perfect cooperation from colleagues who have moved to a different university, left academia, died, or think you’re trying to prove them wrong and destroy your career. We can simply download all the raw data and analyze them.

Enter mega-analysis: include all the data points relevant for a certain hypothesis, cluster them by original publication, date, country, or any potentially relevant property of the research design, and add the substantial predictors you find documented in the literature. The results reveal not only the underlying correlations between substantial variables, but also the differences between studies, periods, countries and design properties that affect these correlations.

The method itself is not new. In epidemiology, and Steinberg et al. (1997) labeled it ‘meta-analysis of individual patient data’. In human genetics, genome wide association studies (GWAS) by large international consortia are common examples of mega-analysis.

Mega-analysis includes the file-drawer of papers that never saw the light of day after they were put in. It also includes the universe of papers that have never been written because the results were unpublishable.

If meta-analysis gives you an estimate for the universe of published research, mega-analysis can be used to detect just how unique that universe is in the milky way. My prediction would be that correlations in published research are mostly further from zero than the same correlation in a mega-analysis.

Mega-analysis bears great promise for the social sciences. Samples for population surveys are large, which enables optimal learning from variations in sampling procedures, data collection mode, and questionnaire design. It is time for a Global Social Science Consortium that pools all of its data. As an illustration, I have started a project on the Open Science Framework that mega-analyzes generalized social trust. It is a public project: anyone can contribute. We have reached mark of 1 million observations.

The idea behind mega-analysis originated from two different projects. In the first project, Erik van Ingen and I analyzed the effects of volunteering on trust, to check if results from an analysis of the Giving in the Netherlands Panel Survey (Van Ingen & Bekkers, 2015) would replicate with data from other panel studies. We found essentially the same results in five panel studies, although subtle differences emerged in the quantative estimates. In the second project, with Arjen de Wit and colleagues from the Center for Philanthropic Studies at VU Amsterdam, we analyzed the effects of volunteering on well-being conducted as part of the EC-FP7 funded ITSSOIN study. We collected 845.733 survey responses from 154.970 different respondents in six panel studies, spanning 30 years (De Wit, Bekkers, Karamat Ali & Verkaik, 2015). We found that volunteering is associated with a 1% increase in well-being.

In these projects, the data from different studies were analyzed separately. I realized that we could learn much more if the data are pooled in one single analysis: a mega-analysis.

References

De Wit, A., Bekkers, R., Karamat Ali, D., & Verkaik, D. (2015). Welfare impacts of participation. Deliverable 3.3 of the project: “Impact of the Third Sector as Social Innovation” (ITSSOIN), European Commission – 7th Framework Programme, Brussels: European Commission, DG Research.

Van Ingen, E. & Bekkers, R. (2015). Trust Through Civic Engagement? Evidence From Five National Panel StudiesPolitical Psychology, 36 (3): 277-294.

Steinberg, K.K., Smith, S.J., Stroup, D.F., Olkin, I., Lee, N.C., Williamson, G.D. & Thacker, S.B. (1997). Comparison of Effect Estimates from a Meta-Analysis of Summary Data from Published Studies and from a Meta-Analysis Using Individual Patient Data for Ovarian Cancer Studies. American Journal of Epidemiology, 145: 917-925.

Leave a comment

Filed under data, methodology, open science, regression analysis, survey research, trends, trust, volunteering

Why a high R Square is not necessarily better

Often I encounter academics thinking that a high proportion of explained variance is the ideal outcome of a statistical analysis. The idea is that in regression analyses a high R Square is better than a low R Square. In my view, the emphasis on a high R2 should be reduced. A high R2 should not be a goal in itself. The reason is that a higher R2 can easily be obtained by using procedures that actually lower the external validity of coefficients.

It is possible to increase the proportion of variance explained in regression analyses in several ways that do not in fact our ability to ‘understand’ the behavior we are seeking to ‘explain’ or ‘predict’. One way to increase the R2 is to remove anomalous observations, such as ‘outliers’ or people who say they ‘don’t know’ and treat them like the average respondent. Replacing missing data by mean scores or using multiple imputation procedures often increases the Rsquare. I have used this procedure in several papers myself, including some of my dissertation chapters.

But in fact outliers can be true values. I have seen quite a few of them that destroyed correlations and lowered R squares while being valid observations. E.g., a widower donating a large amount of money to a charity after the death of his wife. A rare case of exceptional behavior for very specific reasons that seldom occur. In larger samples these outliers may become more frequent, affecting the R2 less strongly.

Also ‘Don’t Know’ respondents are often systematically different from the average respondent. Treating them as average respondents eliminates some of the real variance that would otherwise be hard to predict.

Finally, it is often possible to increase the proportion of variance explained by including more variables. This is particularly problematic if variables that are the result of the dependent variable are included as predictors. For instance if network size is added to the prediction of volunteering the R Square will increase. But a larger network not only increases volunteering; it is also a result of volunteering. Especially if the network questions refer to the present (do you know…) while the volunteering questions refer to the past (in the past year, have you…) it is dubious to ‘predict’ volunteering in the past by a measure of current network size.

As a reviewer, I give authors reporting an R2 exceeding 40% a treatment of high-level scrutiny for dubious decisions in data handling and inclusion of variables.

As a rule, R Squares tend to be higher at higher levels of aggregation, e.g. when analyzing cross-situational tendencies in behavior rather than specific behaviors in specific contexts; or when analyzing time-series data or macro-level data about countries rather than individuals. Why people do the things they do is often just very hard to predict, especially if you try to predict behavior in a specific case.

1 Comment

Filed under academic misconduct, data, methodology, regression analysis, survey research