Twenty Years of Generosity in the Netherlands

In the past two decades, philanthropy in the Netherlands has gained significant attention, from the general public, from policy makers, as well as from academics. Research on philanthropy in the Netherlands has documented a substantial increase in amounts donated to charitable causes since data on giving in the Netherlands have become available in the mid-1990s (Bekkers, Gouwenberg & Schuyt, 2017). What has remained unclear, however, is how philanthropy has developed in relation to the growth of the economy at large and the growth of consumer expenditure. For the first time, we bring together all the data on philanthropy available from eleven editions of the Giving in the Netherlands survey among households (n = 17,033), to answer the research question: how can trends in generosity in the Netherlands in the past 20 years be explained?


The Giving in the Netherlands Panel Survey

One of the strengths of the GINPS is the availability of data on prosocial values and attitudes towards charitable causes. In 2002, the Giving in the Netherlands survey among households was transformed from a cross-sectional to a longitudinal design (Bekkers, Boonstoppel & De Wit, 2017). The GIN Panel Survey has been used primarily to answer questions on the development of these values and attitudes in relation to changes in volunteering activities (Bekkers, 2012; Van Ingen & Bekkers, 2015; Bowman & Bekkers, 2009). Here we use the GINPS in a different way. First we describe trends in generosity, i.e. amounts donated as a proportion of income. Then we seek to explain these trends, focusing on prosocial values and attitudes towards charitable causes.


How generous are the Dutch?

Vis-à-vis the rich history of charity and philanthropy in the Netherlands (Van Leeuwen, 2012), the current state of giving is rather poor. On average, charitable donations per household in 2015 amounted to €180 per year or 0,4% of household income. The median gift is €50 (De Wit & Bekkers, 2017). In the past fifteen years, the trend in generosity is downward: the proportion of income has declined slowly but steadily since 1999 (Bekkers, De Wit & Wiepking, 2017). In 2015, giving as a proportion of income has declined by one-fifth of its peak in 1999 (see Figure 1).


Figure 1: Household giving as a proportion of consumer expenditure (Source: Bekkers, De Wit & Wiepking, 2017)


Why has generosity of households in the Netherlands declined?

The first explanation is declining religiosity. Because giving is encouraged by religious communities, the decline of church affiliation and practice has reduced charitable giving, as in the US (Wilhelm, Rooney & Tempel, 2007). The disappearance of religiosity from Dutch society has reduced charitable giving because the non-religious have become more numerous. In Figure 2 we see a similar decline in generosity to religion (the red line) as to other organizations (the blue line).


Figure 2: Household giving to religion (red) and to other causes (blue) as a proportion of household income (Source: Bekkers, De Wit & Wiepking, 2017)


We also find that those who are still religious have become much more generous. Figure 3 shows that the amounts donated by Protestants (the green line) have almost doubled in the past 20 years. The amounts donated by Catholics (the red line) have also doubled, but are much lower. The non-religious have not increased their giving at all in the past 20 years. However, the increasing generosity of the religious has not been able to turn the tide.


Figure 3: Household giving by non-religious (blue), Catholics (red) and Protestants (green) in Euros (Source: Bekkers, De Wit & Wiepking, 2017)

The second explanation is that prosocial values have declined. Because generosity depends on empathic concern and moral values such as the principle of care (Bekkers & Ottoni-Wilhelm, 2016), the loss of such prosocial values has reduced generosity. Prosocial values have lost support, and the loss of prosociality explains about 40% of the decline in generosity. The loss of prosocial values itself, however, is closely connected to the disappearance of religion. About two thirds of the decline in empathic concern and three quarters of the decline in altruistic values is explained by the reduction of religiosity.

In addition, we see that prosocial values have also declined among the religious. Figure 4 shows that altruistic values have declined not only for the non-religious (blue), but also for Catholics (red) and Protestants (green).


Figure 4: Altruistic values among the non-religious (blue), Catholics (red) and Protestants (green) (Source: Giving in the Netherlands Panel Survey, 2002-2014).

Figure 5 shows a similar development for generalized social trust.


Figure 5: Generalized social trust among the non-religious (blue), Catholics (red) and Protestants (green)  (Source: Giving in the Netherlands Panel Survey, 2002-2016).

Speaking of trust: as donations to charitable causes rely on a foundation of charitable confidence, it may be argued that the decline of charitable confidence is responsible for the decline in generosity (O’Neill, 2009). However, we find that the decline in generosity is not directly related to the decline in charitable confidence, once changes in religiosity and prosocial values are taken into account. This finding indicates that the decline in charitable confidence is a sign of a broader process of declining prosociality.


What do our findings imply?

What do these findings mean for theories and research on philanthropy and for the practice of fundraising?

First, our research clearly demonstrates the utility of including questions on prosocial values in surveys on philanthropy, as they have predictive power not only for generosity and changes therein over time, but also explain relations of religiosity with generosity.

Second, our findings illustrate the need to develop distinctive theories on generosity. Predictors of levels of giving measured in euros can be quite different from predictors of generosity as a proportion of income.

For the practice of fundraising, our research suggests that the strategies and propositions of charitable causes need modification. Traditionally, fundraising organizations have appealed to empathic concern for recipients and prosocial values such as duty. As these have become less prevalent, propositions appealing to social impact with modest returns on investment may prove more effective.

Also fundraising campaigns in the past have been targeted primarily at loyal donors. This strategy has proven effective and religious donors have shown resilience in their increasing financial commitment to charitable causes. But this is not a feasible long term strategy as the size of this group is getting smaller. A new strategy is required to commit new generations of donors.




Bekkers, R. (2012). Trust and Volunteering: Selection or Causation? Evidence from a Four Year Panel Study. Political Behavior, 32 (2): 225-247.

Bekkers, R., Boonstoppel, E. & De Wit, A. (2017). Giving in the Netherlands Panel Survey – User Manual, Version 2.6. Center for Philanthropic Studies, VU Amsterdam.

Bekkers, R. & Bowman, W. (2009). The Relationship Between Confidence in Charitable Organizations and Volunteering Revisited. Nonprofit and Voluntary Sector Quarterly, 38 (5): 884-897.

Bekkers, R., De Wit, A. & Wiepking, P. (2017). Jubileumspecial: Twintig jaar Geven in Nederland. In: Bekkers, R. Schuyt, T.N.M., & Gouwenberg, B.M. (Eds.). Geven in Nederland 2017: Giften, Sponsoring, Legaten en Vrijwilligerswerk. Amsterdam: Lenthe Publishers.

Bekkers, R. & Ottoni-Wilhelm, M. (2016). Principle of Care and Giving to Help People in Need. European Journal of Personality, 30(3): 240-257.

Bekkers, R., Schuyt, T.N.M., & Gouwenberg, B.M. (Eds.). Geven in Nederland 2017: Giften, Sponsoring, Legaten en Vrijwilligerswerk. Amsterdam: Lenthe Publishers.

De Wit, A. & Bekkers, R. (2017). Geven door huishoudens. In: Bekkers, R., Schuyt, T.N.M., & Gouwenberg, B.M. (Eds.). Geven in Nederland 2017: Giften, Sponsoring, Legaten en Vrijwilligerswerk. Amsterdam: Lenthe Publishers.

O’Neill, M. (2009). Public Confidence in Charitable Nonprofits. Nonprofit and Voluntary Sector Quarterly, 38: 237–269.

Van Ingen, E. & Bekkers, R. (2015). Trust Through Civic Engagement? Evidence From Five National Panel Studies. Political Psychology, 36 (3): 277-294.

Wilhelm, M.O., Rooney, P.M. and Tempel, E.R. (2007). Changes in religious giving reflect changes in involvement: age and cohort effects in religious giving, secular giving, and attendance. Journal for the Scientific Study of Religion, 46 (2): 217–32.

Van Leeuwen, M. (2012). Giving in early modern history: philanthropy in Amsterdam in the Golden Age. Continuity & Change, 27(2): 301-343.


Filed under Uncategorized

Hunting Game: Targeting the Big Five

Do not use the personality items included in the World Values Survey. That is the recommendation of Steven Ludeke and Erik Gahner Larsen in a recent paper published in the journal Personality and Individual Differences. The journal is owned by Elsevier so the official publication is paywalled. Still I am writing about it because the message of the paper is extremely important. Ludeke and Gahner Larsen formulate their recommendation a little more subtle: “we suggest it is thus hard to justify the use of this data in future research.”

What went wrong here? Join me in a hunting game, targeting the Big Five.

The World Values Survey (WVS) is the largest, non-commercial survey in the world. It is frequently used in social science research. The most recent edition contained a short, 10 item measure of personality characteristics (BFI-10), validated in a well-cited paper by Rammstedt and John in the Journal of Research in Personality. The inclusion of the BFI-10 enables researchers to study how the Big Five personality traits is related to political participation, happiness, education, and health, among many other things.

So what is wrong with the personality data in the WVS? Ludeke and Gahner Larsen found that the pairs of adjectives designed to measure the five personality traits Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism are not correlated as expected. To measure openness, for instance, the survey asked participants to indicate agreement with the statement “I see myself as someone who: has few artistic interests” and “I see myself as someone who: has an active imagination”. One would expect a negative relation between the responses to the two statements. However, the correlation between the two items across all countries is positive, r = .164. This correlation is not strong, but in the wrong direction. Similar discrepancies were found between items designed to measure the four other dimensions of personality.

The BFI-10 included in the WVS is this set of statements (an r indicates a reverse-scored item):

I see myself as someone who:

  • is reserved (E1r)
  • is generally trusting (A1)
  • tends to be lazy (C1r)
  • is relaxed, handles stress well (N1r)
  • has few artistic interests (O1r)
  • is outgoing, sociable (E2)
  • tends to find fault with others (A2r)
  • does a thorough job (C2)
  • gets nervous easily (N2)
  • has an active imagination (O2)

In a factor analysis of the 1o items, we would expect to find the five dimensions. However, that is not the result of an exploratory factor analysis applying the conventional criterion of an Eigen value > 1. In this analysis and all following analyses negative items are reverse scored. Including all countries, a three factor solution emerges that is very difficult to interpret. Multiple items show high loadings on multiple factors. Removing these one by one, as is usually done in inventories with large numbers of items, we are left with a two-factor solution. If a five-factor solution is forced, we obtain the following component matrix. This is a mess.



2 3 4


O1 not artistic (r)


-.054 .105 -.049


O2 active imagination


.162 -.031 .197


C1 lazy (r)


-.004 .836 -.045


C2 thorough


.425 .231 .078


E1 reserved (r)


-.825 -.022 -.183


E2 outgoing


.097 -.004 -.105 -.068
A1 trusting


.722 .003 -.160 -.137
A2 fault with others (r)


.079 .614 -.259


N1 relaxed (r)

-.461 -.377 .235 .534


N2 nervous

.188 .133 -.291 .770


So what is wrong with these data?

Upon closer inspection, Ludeke and Gahner Larsen found that the correlations were markedly different across countries. Bahrain is a clear outlier. The weakly positive correlation between O1 and O2r is due in part to the inclusion of data from Bahrain. Without this country, the correlation is only .135. Still positive, but not as strongly. The data for Bahrain are not only strange for openness, but also for other factors. In the table below I have computed the correlations among recoded items for the five dimensions.

Without Bahrain, the correlations are still strange, but a little less strange.


With Bahrain


.238 -.207 -.036


Without Bahrain


.275 -.181 -.009


What is wrong with the data for Bahrain? The patterns of responses for cases from Bahrain, it turns out, are surprisingly often a series of ten exactly the same values, such as 1111111111 or 555555555555. I routinely check data from surveys for such patterns. While it is impossible to prove this, serial response patterns suggest fabrication of data. Participants and/or interviewers skipping questions may follow such patterns. Almost half of all the cases from Bahrain follow such a pattern. Other countries with a relatively high proportion of serial pattern responses are South Africa, Singapore, and China. The two countries for which the BFI-10 behaves close to what previous research has reported, the Netherlands and Germany, have a very low occurrence of serial pattern responses.

Number of serial pattern responses




South Africa















Even without the data for Bahrain and the serial responses from all other countries, however, the factor structure is err…not what one would expect. Still a mess.



2 3 4


O1 not artistic (r)

-.094 -.040 .086 -.031 .968
O2 active imagination


.150 -.046 .158 -.130
C1 lazy (r)


.023 .815 -.017 .146
C2 thorough


.410 .241 .050 .088
E1 reserved (r)


-.828 -.033 -.158 -.058

E2 outgoing

.771 .070 -.001 -.140


A1 trusting

.192 .710 .022 -.190


A2 fault with others (r)

-.405 .080 .628 -.230


N1 relaxed (r)

-.421 -.352 .218 .592


N2 nervous

.192 .133 -.315 .750


Only for Germany and the Netherlands the factor structure is somewhat in line with previous research. Here is the solution for the two countries combined. In both countries, the two statements for agreeableness do not correlate as expected. Also the second statement for conscientiousness (thorough) has a cross-loading with one of the agreeableness items (trusting).



2 3 4


O1 not artistic (r)


-.056 .842 .120


O2 active imagination


.050 .729 -.140 .173
C1 lazy (r)


-.083 -.040 .865 -.087
C2 thorough


.053 .057 .627 .440
E1 reserved (r)


-.113 .130 .032 -.219

E2 outgoing

.732 -.166 .126 .166


A1 trusting

-.008 -.100 .042 .049


A2 fault with others (r)

-.657 -.272 .090 .177


N1 relaxed (r)

.012 .804 -.002 .116


N2 nervous

-.052 .835 -.006 -.160


This leaves us with three possibilities.

One possibility was raised by Christopher Soto on Twitter: acquiescence bias could be driving the results. In a study using data from another multi-country survey in the International Social Survey Program (ISSP), Rammstedt, Kemper & Borg subtracted each respondent’s mean response across all BFI-10 items from his or her score on each single item. Doing this, however, does not clear the sky. Looking again at the correlations for the pairs of items measuring the same constructs, we see that they are not ‘better’ in the second row. In contrast, they are less positive.






.286 -.166 .001




.078 -.235 -.107 .049

Also the factor structure of the attenuated scores is not anything like the ‘regular’ five-factor structure. Still a mess.



2 3 4




-.025 .096 -.117




.190 -.319 .034




.469 .617 -.460




.681 -.005 .050




-.846 .080 -.250




.029 .034 .045




.285 .026 .821




-.246 .555 .274




-.345 -.223 -.345




.043 -.854 -.031


The second possibility is that things went wrong in the translation of the questionnaire. The same adjectives or statements may mean different things in different countries or languages, which makes them useless as operationalizations of the same underlying construct. It will require a detailed study of the translations to see if anything went wrong. The questionnaires are available at the World Values Survey website. The Dutch questionnaire is good. I looked at a few other languages. The Spanish questionnaire for Ecuador also seems right. “Me veo como alguien que…… es confiable” is quite close to “I see myself as someone who is… generally trusting”. My Spanish is not very good though. Rene Gempp wrote on Twitter that the BFI-10 is a Likert-type scale, but the Spanish translation asks about the frequency, and one of the options, “para nada frecuentemente” is *very* confusing in Spanish.

I am not sure about your fluency in Kinyarwanda, the language spoken in Rwanda, but the backtranslation of the questionnaire in English does not give me much confidence. Apparently, “…wizera muri rusange” is the translation of “is generally trusting”. The backtranslation is “…believe in congregation”.


The third possibility is that personality structure may indeed be different in different countries. This would be the most problematic one.

Data from the 2010 AmericasBarometer Study, conducted by the Latin American Public Opinion Project (LAPOP) support this interpretation. The survey included a different short form of the Big Five, the TIPI, developed by Gosling, Rentfrow, and SwannA recent study by Weinschenk published in Social Science Quarterly shows that personality scores based on the TIPI are hardly related to turnout in elections in the Americas. This result may be logical in countries where voting is mandatory, such as Brazil. But the more disconcerting methodological problem is that the Big Five are not reliably measured with pairs of statements in most of the countries included in the survey. Here are the correlations between the pairs of items for each of the five dimensions, taken from the supplementary online materials of the Weinschenk paper.


The graphs show that the TIPI items only work well in the US and Canada – the two ‘WEIRD’ countries in the study. In Brazil, to take one example, the correlations are <.10 for extraversion, agreeableness and conscientiousness, and lower than .25 for emotional stability and openness.

Back to the WVS case, which raises important questions about the peer review process. Two journal articles based on the WVS (here and here) were able to pass peer review because neither the reviewers nor the editors asked questions about the reliability of the items being used. Neither did the authors check, apparently. Obviously, researchers should check the reliability of measures they use in an analysis. In case authors fail to check this, reviewers and editors should ask. Weinschenk reported the low correlations in the online supplementary materials, but did not report reliability coefficients in the paper.

The good thing is that because the WVS is in the public domain, these problems came to light relatively quickly. Of course, they could have been avoided if the WVS had scrutinized the reliability of the measure before putting the data online, if the authors of the papers using the data had checked the reliability of the items or if the reviewers and editors had asked the right questions. Another good thing is that the people at the WVS (volunteers?) at the WVS twitter account have been frank in tweeting about the problems found in the data.

Summing up:

  1. We still do not know why the BFI-10 measure of the Big Five personality does not perform as in previous research.
  2. It is probably not due to acquiescence bias. Translations may be problematic for some countries.
  3. Do not use the WVS BFI-10 data from countries other than Germany and the Netherlands.
  4. Treat the WVS data from Bahrain and with great caution, and to be on the safe side, just exclude it from your analyses.
  5. The reliability of short Big Five measures is very low in non-WEIRD countries.

The code for the analyses reported in this blog is posted at the Open Science Framework.

Update 22 March 2017. The factor loadings in the table with the results of the analysis of attenuated scores has been updated. The table displayed previously was based on a division of the original scores by the total agreement scores. Rammstedt et al. subtracted the original scores from the total agreement scores. The results of the new analysis are close to the previous one and still confusing. The code on the OSF has been updated. Also a clarification was added that the negative items used in the factor analyses were all recoded such that they scored positively (HT to Christopher Soto).


Filed under Uncategorized

Five Reasons Why Social Science is So Hard 

1. No Laws 

All we have is probabilities. 

2. All Experts 

The knowledge we have is continuously contested. The objects of study think they know why they do what they do. 

3. Zillions of Variables 

Everything is connected, and potentially a cause – like a bowl of well-tossed spaghetti. 

4. Many Levels of Action 

Nations, organizations, networks, individuals, time all have different dynamics. 

5. Imprecise Measures 

Few instruments have near perfect validity and reliability. 


Social science is not as easy as rocket science. It is way more complicated.

Leave a comment

Filed under Uncategorized

Tools for the Evaluation of the Quality of Experimental Research

pdf of this post

Experiments can have important advantages above other research designs. The most important advantage of experiments concerns internal validity. Random assignment to treatment reduces the attribution problem and increases the possibilities for causal inference. An additional advantage is that control over participants reduces heterogeneity of treatment effects observed.

The extent to which these advantages are realized in the data depends on the design and execution of the experiment. Experiments have a higher quality if the sample size is larger, the theoretical concepts are more reliably measured, and have a higher validity. The sufficiency of the sample size can be checked with a power analysis. For most effect sizes in the social sciences, which are small (d = 0.2), a sample of 1300 participants is required to detect it at conventional significance levels (p < .05) and 95% power (see appendix). Also for a stronger effect size (0.4) more than 300 participants are required. The reliability of normative scale measures can be judged with Cronbach’s alpha. A rule of thumb for unidimensional scales is that alpha should be at least .63 for a scale consisting of 4 items, .68 for 5 items, .72 for 6 items, .75 for 7 items, and so on. The validity of measures should be justified theoretically and can be checked with a manipulation check, which should reveal a sizeable and significant association with the treatment variables.

The advantages of experiments are reduced if assignment to treatment is non-random and treatment effects are confounded. In addition, a variety of other problems may endanger internal validity. Shadish, Cook & Campbell (2002) provide a useful list of such problems.

Also it should be noted that experiments can have important disadvantages. The most important disadvantage is that the external validity of the findings is limited to the participants in the setting in which their behavior was observed. This disadvantage can be avoided by creating more realistic decision situations, for instance in natural field experiments, and by recruiting (non-‘WEIRD’) samples of participants that are more representative of the target population. As Henrich, Heine & Norenzayan (2010) noted, results based on samples of participants in Western, Educated, Industrialized, Rich and Democratic (WEIRD) countries have limited validity in the discovery of universal laws of human cognition, emotion or behavior.

Recently, experimental research paradigms have received fierce criticism. Results of research often cannot be reproduced (Open Science Collaboration, 2015), publication bias is ubiquitous (Ioannidis, 2005). It has become clear that there is a lot of undisclosed flexibility, in all phases of the empirical cycle. While these problems have been discussed widely in communities of researchers conducting experiments, they are by no means limited to one particular methodology or mode of data collection. It is likely that they also occur in communities of researchers using survey or interview data.

In the positivist paradigm that dominates experimental research, the empirical cycle starts with the formulation of a research question. To answer the question, hypotheses are formulated based on established theories and previous research findings. Then the research is designed, data are collected, a predetermined analysis plan is executed, results are interpreted, the research report is written and submitted for peer review. After the usual round(s) of revisions, the findings are incorporated in the body of knowledge.

The validity and reliability of results from experiments can be compromised in two ways. The first is by juggling with the order of phases in the empirical cycle. Researchers can decide to amend their research questions and hypotheses after they have seen the results of their analyses. Kerr (1989) labeled the practice of reformulating hypotheses HARKING: Hypothesizing After Results are Known. Amending hypotheses is not a problem when the goal of the research is to develop theories to be tested later, as in grounded theory or exploratory analyses (e.g., data mining). But in hypothesis-testing research harking is a problem, because it increases the likelihood of publishing false positives. Chance findings are interpreted post hoc as confirmations of hypotheses that a priori  are rather unlikely to be true. When these findings are published, they are unlikely to be reproducible by other researchers, creating research waste, and worse, reducing the reliability of published knowledge.

The second way the validity and reliability of results from experiments can be compromised is by misconduct and sloppy science within various stages of the empirical cycle (Simmons, Nelson & Simonsohn, 2011). The data collection and analysis phase as well as the reporting phase are most vulnerable to distortion by fraud, p-hacking and other questionable research practices (QRPs).

  • In the data collection phase, observations that (if kept) would lead to undesired conclusions or non-significant results can be altered or omitted. Also, fake observations can be added (fabricated).
  • In the analysis of data researchers can try alternative specifications of the variables, scale constructions, and regression models, searching for those that ‘work’ and choosing those that reach the desired conclusion.
  • In the reporting phase, things go wrong when the search for alternative specifications and the sensitivity of the results with respect to decisions in the data analysis phase is not disclosed.
  • In the peer review process, there can be pressure from editors and reviewers to cut reports of non-significant results, or to collect additional data supporting the hypotheses and the significant results reported in the literature.

Results from these forms of QRPs are that null-findings are less likely to be published, and that published research is biased towards positive findings, confirming the hypotheses, published findings are not reproducible, and when a replication attempt is made, published findings are found to be less significant, less often positive, and of a lower effect size (Open Science Collaboration, 2015).

Alarm bells, red flags and other warning signs

Some of the forms of misconduct mentioned above are very difficult to detect for reviewers and editors. When observations are fabricated or omitted from the analysis, only inside information, very sophisticated data detectives and stupidity of the authors can help us. Also many other forms of misconduct are difficult to prove. While smoking guns are rare, we can look for clues. I have developed a checklist of warning signs and good practices that editors and reviewers can use to screen submissions (see below). The checklist uses terminology that is not specific to experiments, but applies to all forms of data. While a high number of warning signs in itself does not prove anything, it should alert reviewers and editors. There is no norm for the number of flags. The table below only mentions the warning signs; the paper version of this blog post also shows a column with the positive poles. Those who would like to count good practices and reward authors for a higher number can count gold stars rather than red flags. The checklist was developed independently of the checklist that Wicherts et al. (2016) recently published.

Warning signs

  • The power of the analysis is too low.
  • The results are too good to be true.
  • All hypotheses are confirmed.
  • P-values are just below critical thresholds (e.g., p<.05)
  • A groundbreaking result is reported but not replicated in another sample.
  • The data and code are not made available upon request.
  • The data are not made available upon article submission.
  • The code is not made available upon article submission.
  • Materials (manipulations, survey questions) are described superficially.
  • Descriptive statistics are not reported.
  • The hypotheses are tested in analyses with covariates and results without covariates are not disclosed.
  • The research is not preregistered.
  • No details of an IRB procedure are given.
  • Participant recruitment procedures are not described.
  • Exact details of time and location of the data collection are not described.
  • A power analysis is lacking.
  • Unusual / non-validated measures are used without justification.
  • Different dependent variables are analyzed in different studies within the same article without justification.
  • Variables are (log)transformed or recoded in unusual categories without justification.
  • Numbers of observations mentioned at different places in the article are inconsistent. Loss or addition of observations is not justified.
  • A one-sided test is reported when a two-sided test would be appropriate.
  • Test-statistics (p-values, F-values) reported are incorrect.

With the increasing number of retractions of articles reporting on experimental research published in scholarly journals the awareness of the fallibility of peer review as a quality control mechanism has increased. Communities of researchers employing experimental designs have formulated solutions to these problems. In the review and publication stage, the following solutions have been proposed.

  • Access to data and code. An increasing number of science funders require grantees to provide open access to the data and the code that they have collected. Likewise, authors are required to provide access to data and code at a growing number of journals, such as Science, Nature, and the American Journal of Political Science. Platforms such as Dataverse, the Open Science Framework and Github facilitate sharing of data and code. Some journals do not require access to data and code, but provide Open Science badges for articles that do provide access.
  • Pledges, such as the ‘21 word solution’, a statement designed by Simmons, Nelson and Simonsohn (2012) that authors can include in their paper to ensure they have not fudged the data: “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.”
  • Full disclosure of methodological details of research submitted for publication, for instance through org is now required by major journals in psychology.
  • Apps such as Statcheck, p-curve, p-checker, and r-index can help editors and reviewers detect fishy business. They also have the potential to improve research hygiene when researchers start using these apps to check their own work before they submit it for review.

As these solutions become more commonly used we should see the quality of research go up. The number of red flags in research should decrease and the number of gold stars should increase. This requires not only that reviewers and editors use the checklist, but most importantly, that also researchers themselves use it.

The solutions above should be supplemented by better research practices before researchers submit their papers for review. In particular, two measures are worth mentioning:

  • Preregistration of research, for instance on org. An increasing number of journals in psychology require research to be preregistered. Some journals guarantee publication of research regardless of its results after a round of peer review of the research design.
  • Increasing the statistical power of research is one of the most promising strategies to increase the quality of experimental research (Bakker, Van Dijk & Wicherts, 2012). In many fields and for many decades, published research has been underpowered, using samples of participants that are not large enough the reported effect sizes. Using larger samples reduces the likelihood of both false positives as well as false negatives.

A variety of institutional designs have been proposed to encourage the use of the solutions mentioned above, including reducing the incentives in careers of researchers and hiring and promotion decisions for using questionable research practices, rewarding researchers for good conduct through badges, the adoption of voluntary codes of conduct, and socialization of students and senior staff through teaching and workshops. Research funders, journals, editors, authors, reviewers, universities, senior researchers and students all have a responsibility in these developments.


Bakker, M., Van Dijk, A. & Wicherts, J. (2012). The Rules of the Game Called Psychological Science. Perspectives on Psychological Science, 7(6): 543–554.

Henrich, J., Heine, S.J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33: 61 – 135.

Ioannidis, J.P.A. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8): e124.

Kerr, N.L. (1989). HARKing: Hypothesizing After Results are Known. Personality and Social Psychology Review, 2: 196-217.

Open Science Collaboration (2015). Estimating the Reproducibility of Psychological Science. Science, 349.

Shadish, W.R., Cook, T.D., & Campbell, D.T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22: 1359–1366.

Simmons, J.P., Nelson, L.D. & Simonsohn, U. (2012). A 21 Word Solution. Available at SSRN:

Wicherts, J.M., Veldkamp, C.L., Augusteijn, H.E., Bakker, M., Van Aert, R.C & Van Assen, M.L.A.M. (2016). Researcher degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers of Psychology, 7: 1832.

1 Comment

Filed under academic misconduct, experiments, fraud, incentives, open science, psychology, survey research

Introducing Mega-analysis

How to find truth in an ocean of correlations – with breakers, still waters, tidal waves, and undercurrents? In the old age of responsible research and publication, we would collect estimates reported in previous research, and compute a correlation across correlations. Those days are long gone.

In the age of rat race research and publication it became increasingly difficult to do a meta-analysis. It is a frustrating experience for anyone who has conducted one: endless searches on the Web of Science and Google Scholar to collect all published research, input the estimates in a database, find that a lot of fields are blank, email authors for zero-order correlations and other statistics they had failed to report in their publications and get very little response.

Meta-analysis is not only a frustrating experience, it is also a bad idea when results that authors do not like do not get published. A host of techniques have been developed to find and correct publication bias, but the problem that we do not know the results that do not get reported is not solved easily.

As we enter the age of open science,  we do not have to rely any longer on the far from perfect cooperation from colleagues who have moved to a different university, left academia, died, or think you’re trying to prove them wrong and destroy your career. We can simply download all the raw data and analyze them.

Enter mega-analysis: include all the data points relevant for a certain hypothesis, cluster them by original publication, date, country, or any potentially relevant property of the research design, and add the substantial predictors you find documented in the literature. The results reveal not only the underlying correlations between substantial variables, but also the differences between studies, periods, countries and design properties that affect these correlations.

The method itself is not new. In epidemiology, and Steinberg et al. (1997) labeled it ‘meta-analysis of individual patient data’. In human genetics, genome wide association studies (GWAS) by large international consortia are common examples of mega-analysis.

Mega-analysis includes the file-drawer of papers that never saw the light of day after they were put in. It also includes the universe of papers that have never been written because the results were unpublishable.

If meta-analysis gives you an estimate for the universe of published research, mega-analysis can be used to detect just how unique that universe is in the milky way. My prediction would be that correlations in published research are mostly further from zero than the same correlation in a mega-analysis.

Mega-analysis bears great promise for the social sciences. Samples for population surveys are large, which enables optimal learning from variations in sampling procedures, data collection mode, and questionnaire design. It is time for a Global Social Science Consortium that pools all of its data. As an illustration, I have started a project on the Open Science Framework that mega-analyzes generalized social trust. It is a public project: anyone can contribute. We have reached mark of 1 million observations.

The idea behind mega-analysis originated from two different projects. In the first project, Erik van Ingen and I analyzed the effects of volunteering on trust, to check if results from an analysis of the Giving in the Netherlands Panel Survey (Van Ingen & Bekkers, 2015) would replicate with data from other panel studies. We found essentially the same results in five panel studies, although subtle differences emerged in the quantative estimates. In the second project, with Arjen de Wit and colleagues from the Center for Philanthropic Studies at VU Amsterdam, we analyzed the effects of volunteering on well-being conducted as part of the EC-FP7 funded ITSSOIN study. We collected 845.733 survey responses from 154.970 different respondents in six panel studies, spanning 30 years (De Wit, Bekkers, Karamat Ali & Verkaik, 2015). We found that volunteering is associated with a 1% increase in well-being.

In these projects, the data from different studies were analyzed separately. I realized that we could learn much more if the data are pooled in one single analysis: a mega-analysis.


De Wit, A., Bekkers, R., Karamat Ali, D., & Verkaik, D. (2015). Welfare impacts of participation. Deliverable 3.3 of the project: “Impact of the Third Sector as Social Innovation” (ITSSOIN), European Commission – 7th Framework Programme, Brussels: European Commission, DG Research.

Van Ingen, E. & Bekkers, R. (2015). Trust Through Civic Engagement? Evidence From Five National Panel StudiesPolitical Psychology, 36 (3): 277-294.

Steinberg, K.K., Smith, S.J., Stroup, D.F., Olkin, I., Lee, N.C., Williamson, G.D. & Thacker, S.B. (1997). Comparison of Effect Estimates from a Meta-Analysis of Summary Data from Published Studies and from a Meta-Analysis Using Individual Patient Data for Ovarian Cancer Studies. American Journal of Epidemiology, 145: 917-925.

Leave a comment

Filed under data, methodology, open science, regression analysis, survey research, trends, trust, volunteering

Gevonden: Student assistent voor het onderzoek Geven in Nederland

De werkgroep Filantropische Studies van de Faculteit Sociale Wetenschappen aan de Vrije Universiteit Amsterdam is het expertisecentrum op het gebied van onderzoek naar filantropie in Nederland. De werkgroep houdt zich bezig met vragen zoals: Waarom geven mensen vrijwillig geld aan goede doelen? Waarom verrichten mensen vrijwilligerswerk? Hoeveel geld gaat er om in de filantropische sector? Voor het onderzoek Geven in Nederland heeft de werkgroep Suzanne Felix aangenomen als onderzoeksassistent.



Geven in Nederland is een van de belangrijkste onderzoeksprojecten van de werkgroep. Sinds 1995 wordt het geefgedrag van huishoudens, individuen, fondsen, bedrijven en goededoelenloterijen elke twee jaar in kaart gebracht en samengevoegd tot een macro-economisch overzicht. De werkgroep Filantropische Studies brengt de resultaten van het onderzoek tweejaarlijks uit in het boek ‘Geven in Nederland’. Felix werkt mee aan het onderzoek naar nalatenschappen en giften door vermogensfondsen en huishoudens.
update: 3 september 2016

Leave a comment

Filed under Uncategorized

Brief guide to understand fMRI studies

RQ: Which regions of the brain are active when task X is performed?

Results: Activity in some regions Y is higher than in others.

Leave a comment

Filed under Uncategorized