Category Archives: survey research

The force of everyday philanthropy

Public debates on philanthropy link charitable giving to wealth. In the media we hear a lot about the giving behavior of billionaires – about the giving pledge, the charitable foundations of the wealthy, how the causes they support align their business interests, and how they relate to government programs. Yes – the billions of tech giants go a long way. Imagine a world without support from foundations created by wealthy. But we hear a lot less about the everyday philanthropy of people like you and me. The media rarely report on everyday acts of generosity. The force of philanthropy is not only in its focus and mass, but also in its breadth and popularity.

It is one of the common remarks I hear when family, friends and colleagues return from holidays in ‘developing countries’ like Moldova, Myanmar or Morocco: “the people there have nothing, but they are so kind and generous!” The kindness and generosity that we witness as tourists are manifestations of prosociality, the very same spirit that is the ultimate foundation of everyday philanthropy. And also within our own nations, we find that most people give to charity. Why are people in Europe so strongly engaged in philanthropy?

The answer is trust

In Europe we are much more likely to think that most people can be trusted than in other parts of the world. It is this faith in humanity that is crucial for philanthropy. We can see this in a comparison of countries within Europe. The figure combines data from the World Giving Index reports of CAF from 2010-2017 on the proportion of the population giving to charity with data from the Global Trust Research Consortium on generalized social trust. The figure shows that citizens of more trusting countries in Europe are much more likely to give to charities (you can get the data here, and the code is here). The correlation is .52, which is strong.

Trust_Giving_EU

Egalité et fraternité

One of the reasons why citizens in more trusting countries are more likely to give to charity is that trust is lower in more unequal countries. Combining the data on trust with data from the OECD on income inequality (GINI) reveals a substantial negative correlation of -.37. The larger the differences in income and wealth in a country become, the lower the level of trust that people have in each other. As the wealth of the rich increases, the poor get increasingly envious, and the rich feel an increasing urge to protect their wealth. In such a context, conspiracy theories thrive and institutions that should be impartial and fair to all are trusted less. The criticism that wealthy donors face also stems from this foundation: those concerned with equality and fairness fear the elite power of philanthropy. Et voila: here is the case why it is in the best interest of foundations to reduce inequality.

Advertisements

Leave a comment

Filed under data, Europe, household giving, philanthropy, survey research, trust, wealth

Wat is normaal?

Geeft de gemiddelde Nederlander echt 559 euro per jaar aan goede doelen, zoals Arnon Grunberg gisteren schreef op de voorpagina van de Volkskrant?

Nee, dat is onwaarschijnlijk. Grunberg verwees naar een cijfer dat werd genoemd in het HUMAN televisieprogramma ‘Hoe normaal ben jij?’

Het cijfer klopt niet om twee redenen.

1. Het bedrag is veel hoger dan uit ander onderzoek naar filantropie naar voren komt. Het cijfer van Human komt uit een onderzoek dat waarschijnlijk niet representatief is voor alle Nederlanders. Human geeft geen informatie over de peiling die gehouden is, maar het is waarschijnlijk dat het een zogenaamde gelegenheidsgroep is: op de site kan iedereen deelnemen. Degenen die dat doen zijn bijna nooit representatief voor de Nederlandse bevolking.

Het standaard onderzoek naar filantropie, Geven in Nederland (GIN), voert de Vrije Universiteit Amsterdam uit sinds 1995. Het geeft een navolgbaar representatief beeld. Gemiddeld geven huishoudens 341 euro, zo blijkt uit de laatste editie van het GIN onderzoek uit 2017.

2. Het cijfer over een gemiddelde, en dat is niet normaal. Als je het rekenkundig gemiddelde berekent over alle Nederlandse huishoudens, dan zie je niet goed wat de typische Nederlander geeft. De helft van de Nederlandse huishoudens geeft namelijk minder dan 60 euro, blijkt uit GIN. Het gemiddelde wordt sterk beïnvloed door een klein aantal huishoudens dat heel veel geeft. De grafiek kun je gebruiken om te zien hoe normaal je bent: geef je tussen de €150-€200 per jaar, dan hoor je in het derde kwartiel, de groep van ongeveer een kwart van de bevolking die meer geeft dan helft van de Nederlanders. Het kwart meest gevende Nederlanders geeft vaak meer dan €1.000.

GIN17_kwartielen

 

Leave a comment

Filed under Center for Philanthropic Studies, household giving, survey research, Uncategorized

Research internship @VU Amsterdam

Social influences on prosocial behaviors and their consequences

While self-interest and prosocial behavior are often pitted against each other, it is clear that much charitable giving and volunteering for good causes is motivated by non-altruistic concerns (Bekkers & Wiepking, 2011). Helping others by giving and volunteering feels good (Dunn, Aknin & Norton, 2008). What is the contribution of such helping behaviors on happiness?

The effect of helping behavior on happiness is easily overestimated using cross-sectional data (Aknin et al., 2013). Experiments provide the best way to eradicate selection bias in causal estimates. Monozygotic twins provide a nice natural experiment to investigate unique environmental influences on prosocial behavior and its consequences for happiness, health, and trust. Any differences within twin pairs cannot be due to additive genetic effects or shared environmental effects. Previous research has investigated environmental influences of the level of education and religion on giving and volunteering (Bekkers, Posthuma and Van Lange, 2017), but no study has investigated the effects of helping behavior on important outcomes such as trust, health, and happiness.

The Midlife in the United States (MIDUS) and the German Twinlife surveys provide rich datasets including measures of health, life satisfaction, and social integration, in addition to demographic and socioeconomic characteristics and measures of helping behavior through nonprofit organizations (giving and volunteering) and in informal social relationships (providing financial and practical assistance to friends and family).

In the absence of natural experiments, longitudinal panel data are required to ascertain the chronology in acts of giving and their correlates. The same holds for the alleged effects of volunteering on trust (Van Ingen & Bekkers, 2015) and health (De Wit, Bekkers, Karamat Ali, & Verkaik, 2015). Since the mid-1990s, a growing number of panel studies have collected data on volunteering and charitable giving and their alleged consequences, such as the German Socio-Economic Panel (GSOEP), the British Household Panel Survey (BHPS) / Understanding Society, the Swiss Household Panel (SHP), the Household, Income, Labour Dynamics in Australia survey (HILDA), the General Social Survey (GSS) in the US, and in the Netherlands the Longitudinal Internet Studies for the Social sciences (LISS) and the Giving in the Netherlands Panel Survey (GINPS).

Under my supervision, students can write a paper on social influences of education, religion and/or helping behavior in the form of volunteering, giving, and informal financial and social support on outcomes such as health, life satisfaction, and trust, using either longitudinal panel survey data or data on twins. Students who are interested in writing such a paper are invited to present their research questions and research design via e-mail to r.bekkers@vu.nl.

René Bekkers, Center for Philanthropic Studies, Faculty of Social Sciences, Vrije Universiteit Amsterdam

References

Aknin, L. B., Barrington-Leigh, C. P., Dunn, E. W., Helliwell, J. F., Burns, J., Biswas-Diener, R., … Norton, M. I. (2013). Prosocial spending and well-being: Cross-cultural evidence for a psychological universal. Journal of Personality and Social Psychology, 104(4), 635–652. https://doi.org/10.1037/a0031578

Bekkers, R., Posthuma, D. & Van Lange, P.A.M. (2017). The Pursuit of Differences in Prosociality Among Identical Twins: Religion Matters, Education Does Not. https://osf.io/ujhpm/ 

Bekkers, R., & Wiepking, P. (2011). A Literature Review of Empirical Studies of Philanthropy: Eight Mechanisms That Drive Charitable Giving. Nonprofit and Voluntary Sector Quarterly, 40: https://doi.org/10.1177/0899764010380927

De Wit, A., Bekkers, R., Karamat Ali, D., & Verkaik, D. (2015). Welfare impacts of participation. Deliverable 3.3 of the project: “Impact of the Third Sector as Social Innovation” (ITSSOIN), European Commission – 7th Framework Programme, Brussels: European Commission, DG Research. http://itssoin.eu/site/wp-content/uploads/2015/09/ITSSOIN_D3_3_The-Impact-of-Participation.pdf

Dunn, E. W., Aknin, L. B., & Norton, M. I. (2008). Spending Money on Others Promotes Happiness. Science, 319(5870): 1687–1688. https://doi.org/10.1126/science.1150952

Van Ingen, E. & Bekkers, R. (2015). Trust Through Civic Engagement? Evidence From Five National Panel Studies. Political Psychology, 36 (3): 277-294. https://renebekkers.files.wordpress.com/2015/05/vaningen_bekkers_15.pdf

Leave a comment

Filed under altruism, Center for Philanthropic Studies, data, experiments, happiness, helping, household giving, Netherlands, philanthropy, psychology, regression analysis, survey research, trust, volunteering

Twenty Years of Generosity in the Netherlands

PaperARNOVA 2017 Presentation – Materials at Open Science Framework

In the past two decades, philanthropy in the Netherlands has gained significant attention, from the general public, from policy makers, as well as from academics. Research on philanthropy in the Netherlands has documented a substantial increase in amounts donated to charitable causes since data on giving in the Netherlands have become available in the mid-1990s (Bekkers, Gouwenberg & Schuyt, 2017). What has remained unclear, however, is how philanthropy has developed in relation to the growth of the economy at large and the growth of consumer expenditure. For the first time, we bring together all the data on philanthropy available from eleven editions of the Giving in the Netherlands survey among households (n = 16,344), to answer the research question: how can trends in generosity in the Netherlands in the past 20 years be explained?

 

The Giving in the Netherlands Panel Survey

One of the strengths of the GINPS is the availability of data on prosocial values and attitudes towards charitable causes. In 2002, the Giving in the Netherlands survey among households was transformed from a cross-sectional to a longitudinal design (Bekkers, Boonstoppel & De Wit, 2017). The GIN Panel Survey has been used primarily to answer questions on the development of these values and attitudes in relation to changes in volunteering activities (Bekkers, 2012; Van Ingen & Bekkers, 2015; Bowman & Bekkers, 2009). Here we use the GINPS in a different way. First we describe trends in generosity, i.e. amounts donated as a proportion of income. Then we seek to explain these trends, focusing on prosocial values and attitudes towards charitable causes.

 

How generous are the Dutch?

Vis-à-vis the rich history of charity and philanthropy in the Netherlands (Van Leeuwen, 2012), the current state of giving is rather poor. On average, charitable donations per household in 2015 amounted to €180 per year or 0,4% of household income. The median gift is €50 (De Wit & Bekkers, 2017). In the past fifteen years, the trend in generosity is downward: the proportion of income has declined slowly but steadily since 1999 (Bekkers, De Wit & Wiepking, 2017). In 2015, giving as a proportion of income has declined by one-fifth of its peak in 1999 (see Figure 1).

GIV_CEX

Figure 1: Household giving as a proportion of consumer expenditure (Source: Bekkers, De Wit & Wiepking, 2017)

 

Why has generosity of households in the Netherlands declined?

The first explanation is declining religiosity. Because giving is encouraged by religious communities, the decline of church affiliation and practice has reduced charitable giving, as in the US (Wilhelm, Rooney & Tempel, 2007). The disappearance of religiosity from Dutch society has reduced charitable giving because the non-religious have become more numerous. The decline in religiosity explains about 40% of the decline in generosity we observe in the period 2001-2015. In Figure 2 we see a similar decline in generosity to religion (the red line) as to other organizations (the blue line).

REL_NREL

Figure 2: Household giving to religion (red) and to other causes (blue) as a proportion of household income (Source: Bekkers, De Wit & Wiepking, 2017)

 

We also find that those who are still religious have become much more generous. Figure 3 shows that the amounts donated by Protestants (the green line) have almost doubled in the past 20 years. The amounts donated by Catholics (the red line) have also doubled, but are much lower. The non-religious have not increased their giving at all in the past 20 years. However, the increasing generosity of the religious has not been able to turn the tide.

REL_DEN

Figure 3: Household giving by non-religious (blue), Catholics (red) and Protestants (green) in Euros (Source: Bekkers, De Wit & Wiepking, 2017)

The second explanation is that prosocial values have declined. Because generosity depends on empathic concern and moral values such as the principle of care (Bekkers & Ottoni-Wilhelm, 2016), the loss of such prosocial values has reduced generosity. Prosocial values have lost support, and the loss of prosociality explains about 15% of the decline in generosity. The loss of prosocial values itself, however, is closely connected to the disappearance of religion. About two thirds of the decline in empathic concern and three quarters of the decline in altruistic values is explained by the reduction of religiosity.

In addition, we see that prosocial values have also declined among the religious. Figure 4 shows that altruistic values have declined not only for the non-religious (blue), but also for Catholics (red) and Protestants (green).

REL_AV

Figure 4: Altruistic values among the non-religious (blue), Catholics (red) and Protestants (green) (Source: Giving in the Netherlands Panel Survey, 2002-2014).

Figure 5 shows a similar development for generalized social trust.

REL_TRUST

Figure 5: Generalized social trust among the non-religious (blue), Catholics (red) and Protestants (green)  (Source: Giving in the Netherlands Panel Survey, 2002-2016).

Speaking of trust: as donations to charitable causes rely on a foundation of charitable confidence, it may be argued that the decline of charitable confidence is responsible for the decline in generosity (O’Neill, 2009). However, we find that the decline in generosity is not strongly related to the decline in charitable confidence, once changes in religiosity and prosocial values are taken into account. This finding indicates that the decline in charitable confidence is a sign of a broader process of declining prosociality.

 

What do our findings imply?

What do these findings mean for theories and research on philanthropy and for the practice of fundraising?

First, our research clearly demonstrates the utility of including questions on prosocial values in surveys on philanthropy, as they have predictive power not only for generosity and changes therein over time, but also explain relations of religiosity with generosity.

Second, our findings illustrate the need to develop distinctive theories on generosity. Predictors of levels of giving measured in euros can be quite different from predictors of generosity as a proportion of income.

For the practice of fundraising, our research suggests that the strategies and propositions of charitable causes need modification. Traditionally, fundraising organizations have appealed to empathic concern for recipients and prosocial values such as duty. As these have become less prevalent, propositions appealing to social impact with modest returns on investment may prove more effective.

Also fundraising campaigns in the past have been targeted primarily at loyal donors. This strategy has proven effective and religious donors have shown resilience in their increasing financial commitment to charitable causes. But this is not a feasible long term strategy as the size of this group is getting smaller. A new strategy is required to commit new generations of donors.

 

 

References

Bekkers, R. (2012). Trust and Volunteering: Selection or Causation? Evidence from a Four Year Panel Study. Political Behavior, 32 (2): 225-247.

Bekkers, R., Boonstoppel, E. & De Wit, A. (2017). Giving in the Netherlands Panel Survey – User Manual, Version 2.6. Center for Philanthropic Studies, VU Amsterdam.

Bekkers, R. & Bowman, W. (2009). The Relationship Between Confidence in Charitable Organizations and Volunteering Revisited. Nonprofit and Voluntary Sector Quarterly, 38 (5): 884-897.

Bekkers, R., De Wit, A. & Wiepking, P. (2017). Jubileumspecial: Twintig jaar Geven in Nederland. In: Bekkers, R. Schuyt, T.N.M., & Gouwenberg, B.M. (Eds.). Geven in Nederland 2017: Giften, Sponsoring, Legaten en Vrijwilligerswerk. Amsterdam: Lenthe Publishers.

Bekkers, R. & Ottoni-Wilhelm, M. (2016). Principle of Care and Giving to Help People in Need. European Journal of Personality, 30(3): 240-257.

Bekkers, R., Schuyt, T.N.M., & Gouwenberg, B.M. (Eds.). Geven in Nederland 2017: Giften, Sponsoring, Legaten en Vrijwilligerswerk. Amsterdam: Lenthe Publishers.

De Wit, A. & Bekkers, R. (2017). Geven door huishoudens. In: Bekkers, R., Schuyt, T.N.M., & Gouwenberg, B.M. (Eds.). Geven in Nederland 2017: Giften, Sponsoring, Legaten en Vrijwilligerswerk. Amsterdam: Lenthe Publishers.

O’Neill, M. (2009). Public Confidence in Charitable Nonprofits. Nonprofit and Voluntary Sector Quarterly, 38: 237–269.

Van Ingen, E. & Bekkers, R. (2015). Trust Through Civic Engagement? Evidence From Five National Panel Studies. Political Psychology, 36 (3): 277-294.

Wilhelm, M.O., Rooney, P.M. and Tempel, E.R. (2007). Changes in religious giving reflect changes in involvement: age and cohort effects in religious giving, secular giving, and attendance. Journal for the Scientific Study of Religion, 46 (2): 217–32.

Van Leeuwen, M. (2012). Giving in early modern history: philanthropy in Amsterdam in the Golden Age. Continuity & Change, 27(2): 301-343.

4 Comments

Filed under Center for Philanthropic Studies, data, household giving, Netherlands, survey research, trends

Hunting Game: Targeting the Big Five

Do not use the personality items included in the World Values Survey. That is the recommendation of Steven Ludeke and Erik Gahner Larsen in a recent paper published in the journal Personality and Individual Differences. The journal is owned by Elsevier so the official publication is paywalled. Still I am writing about it because the message of the paper is extremely important. Ludeke and Gahner Larsen formulate their recommendation a little more subtle: “we suggest it is thus hard to justify the use of this data in future research.”

What went wrong here? Join me in a hunting game, targeting the Big Five.

The World Values Survey (WVS) is the largest, non-commercial survey in the world. It is frequently used in social science research. The most recent edition contained a short, 10 item measure of personality characteristics (BFI-10), validated in a well-cited paper by Rammstedt and John in the Journal of Research in Personality. The inclusion of the BFI-10 enables researchers to study how the Big Five personality traits is related to political participation, happiness, education, and health, among many other things.

So what is wrong with the personality data in the WVS? Ludeke and Gahner Larsen found that the pairs of adjectives designed to measure the five personality traits Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism are not correlated as expected. To measure openness, for instance, the survey asked participants to indicate agreement with the statement “I see myself as someone who: has few artistic interests” and “I see myself as someone who: has an active imagination”. One would expect a negative relation between the responses to the two statements. However, the correlation between the two items across all countries is positive, r = .164. This correlation is not strong, but in the wrong direction. Similar discrepancies were found between items designed to measure the four other dimensions of personality.

The BFI-10 included in the WVS is this set of statements (an r indicates a reverse-scored item):

I see myself as someone who:

  • is reserved (E1r)
  • is generally trusting (A1)
  • tends to be lazy (C1r)
  • is relaxed, handles stress well (N1r)
  • has few artistic interests (O1r)
  • is outgoing, sociable (E2)
  • tends to find fault with others (A2r)
  • does a thorough job (C2)
  • gets nervous easily (N2)
  • has an active imagination (O2)

In a factor analysis of the 1o items, we would expect to find the five dimensions. However, that is not the result of an exploratory factor analysis applying the conventional criterion of an Eigen value > 1. In this analysis and all following analyses negative items are reverse scored. Including all countries, a three factor solution emerges that is very difficult to interpret. Multiple items show high loadings on multiple factors. Removing these one by one, as is usually done in inventories with large numbers of items, we are left with a two-factor solution. If a five-factor solution is forced, we obtain the following component matrix. This is a mess.

Component

1

2 3 4

5

O1 not artistic (r)

-.116

-.054 .105 -.049

.961

O2 active imagination

.687

.162 -.031 .197

-.140

C1 lazy (r)

.249

-.004 .836 -.045

.159

C2 thorough

.640

.425 .231 .078

.071

E1 reserved (r)

-.110

-.825 -.022 -.183

-.047

E2 outgoing

.781

.097 -.004 -.105 -.068
A1 trusting

.210

.722 .003 -.160 -.137
A2 fault with others (r)

-.430

.079 .614 -.259

-.051

N1 relaxed (r)

-.461 -.377 .235 .534

.144

N2 nervous

.188 .133 -.291 .770

-.112

So what is wrong with these data?

Upon closer inspection, Ludeke and Gahner Larsen found that the correlations were markedly different across countries. Bahrain is a clear outlier. The weakly positive correlation between O1 and O2r is due in part to the inclusion of data from Bahrain. Without this country, the correlation is only .135. Still positive, but not as strongly. The data for Bahrain are not only strange for openness, but also for other factors. In the table below I have computed the correlations among recoded items for the five dimensions.

Without Bahrain, the correlations are still strange, but a little less strange.

O

C E A N
With Bahrain

-.164

.238 -.207 -.036

.008

Without Bahrain

-.135

.275 -.181 -.009

.044

What is wrong with the data for Bahrain? The patterns of responses for cases from Bahrain, it turns out, are surprisingly often a series of ten exactly the same values, such as 1111111111 or 555555555555. I routinely check data from surveys for such patterns. While it is impossible to prove this, serial response patterns suggest fabrication of data. Participants and/or interviewers skipping questions may follow such patterns. Almost half of all the cases from Bahrain follow such a pattern. Other countries with a relatively high proportion of serial pattern responses are South Africa, Singapore, and China. The two countries for which the BFI-10 behaves close to what previous research has reported, the Netherlands and Germany, have a very low occurrence of serial pattern responses.

Number of serial pattern responses

%
Bahrain

598

49.83%

South Africa

250

7.08%

Singapore

108

5.48%

China

52

2.26%

Netherlands

8

0.42%

Germany

2

0.10%

Even without the data for Bahrain and the serial responses from all other countries, however, the factor structure is err…not what one would expect. Still a mess.

Component

1

2 3 4

5

O1 not artistic (r)

-.094 -.040 .086 -.031 .968
O2 active imagination

.691

.150 -.046 .158 -.130
C1 lazy (r)

.297

.023 .815 -.017 .146
C2 thorough

.637

.410 .241 .050 .088
E1 reserved (r)

-.098

-.828 -.033 -.158 -.058

E2 outgoing

.771 .070 -.001 -.140

-.052

A1 trusting

.192 .710 .022 -.190

-.133

A2 fault with others (r)

-.405 .080 .628 -.230

-.048

N1 relaxed (r)

-.421 -.352 .218 .592

.123

N2 nervous

.192 .133 -.315 .750

-.104

Only for Germany and the Netherlands the factor structure is somewhat in line with previous research. Here is the solution for the two countries combined. In both countries, the two statements for agreeableness do not correlate as expected. Also the second statement for conscientiousness (thorough) has a cross-loading with one of the agreeableness items (trusting).

Component

1

2 3 4

5

O1 not artistic (r)

-.047

-.056 .842 .120

-.089

O2 active imagination

.208

.050 .729 -.140 .173
C1 lazy (r)

.061

-.083 -.040 .865 -.087
C2 thorough

-.064

.053 .057 .627 .440
E1 reserved (r)

.715

-.113 .130 .032 -.219

E2 outgoing

.732 -.166 .126 .166

.210

A1 trusting

-.008 -.100 .042 .049

.853

A2 fault with others (r)

-.657 -.272 .090 .177

-.001

N1 relaxed (r)

.012 .804 -.002 .116

-.259

N2 nervous

-.052 .835 -.006 -.160

.117

This leaves us with three possibilities.

One possibility was raised by Christopher Soto on Twitter: acquiescence bias could be driving the results. In a study using data from another multi-country survey in the International Social Survey Program (ISSP), Rammstedt, Kemper & Borg subtracted each respondent’s mean response across all BFI-10 items from his or her score on each single item. Doing this, however, does not clear the sky. Looking again at the correlations for the pairs of items measuring the same constructs, we see that they are not ‘better’ in the second row. In contrast, they are less positive.

O

C E A

N

Unadjusted

-.122

.286 -.166 .001

.053

Attenuated

-.310

.078 -.235 -.107 .049

Also the factor structure of the attenuated scores is not anything like the ‘regular’ five-factor structure. Still a mess.

Component

1

2 3 4

5

O1a

-.192

-.025 .096 -.117

-.957

O2a

.509

.190 -.319 .034

.269

C1a

-.133

.469 .617 -.460

.174

C2a

.351

.681 -.005 .050

.071

E1a

-.043

-.846 .080 -.250

.017

E2a

.823

.029 .034 .045

.114

A1a

.086

.285 .026 .821

.148

A2a

-.497

-.246 .555 .274

.067

N1a

-.598

-.345 -.223 -.345

-.047

N2a

-.123

.043 -.854 -.031

.178

The second possibility is that things went wrong in the translation of the questionnaire. The same adjectives or statements may mean different things in different countries or languages, which makes them useless as operationalizations of the same underlying construct. It will require a detailed study of the translations to see if anything went wrong. The questionnaires are available at the World Values Survey website. The Dutch questionnaire is good. I looked at a few other languages. The Spanish questionnaire for Ecuador also seems right. “Me veo como alguien que…… es confiable” is quite close to “I see myself as someone who is… generally trusting”. My Spanish is not very good though. Rene Gempp wrote on Twitter that the BFI-10 is a Likert-type scale, but the Spanish translation asks about the frequency, and one of the options, “para nada frecuentemente” is *very* confusing in Spanish.

I am not sure about your fluency in Kinyarwanda, the language spoken in Rwanda, but the backtranslation of the questionnaire in English does not give me much confidence. Apparently, “…wizera muri rusange” is the translation of “is generally trusting”. The backtranslation is “…believe in congregation”.

rwanda_back

The third possibility is that personality structure may indeed be different in different countries. This would be the most problematic one.

Data from the 2010 AmericasBarometer Study, conducted by the Latin American Public Opinion Project (LAPOP) support this interpretation. The survey included a different short form of the Big Five, the TIPI, developed by Gosling, Rentfrow, and SwannA recent study by Weinschenk published in Social Science Quarterly shows that personality scores based on the TIPI are hardly related to turnout in elections in the Americas. This result may be logical in countries where voting is mandatory, such as Brazil. But the more disconcerting methodological problem is that the Big Five are not reliably measured with pairs of statements in most of the countries included in the survey. Here are the correlations between the pairs of items for each of the five dimensions, taken from the supplementary online materials of the Weinschenk paper.

Big5_rel_LAPOP

The graphs show that the TIPI items only work well in the US and Canada – the two ‘WEIRD’ countries in the study. In Brazil, to take one example, the correlations are <.10 for extraversion, agreeableness and conscientiousness, and lower than .25 for emotional stability and openness.

Back to the WVS case, which raises important questions about the peer review process. Two journal articles based on the WVS (here and here) were able to pass peer review because neither the reviewers nor the editors asked questions about the reliability of the items being used. Neither did the authors check, apparently. Obviously, researchers should check the reliability of measures they use in an analysis. In case authors fail to check this, reviewers and editors should ask. Weinschenk reported the low correlations in the online supplementary materials, but did not report reliability coefficients in the paper.

The good thing is that because the WVS is in the public domain, these problems came to light relatively quickly. Of course, they could have been avoided if the WVS had scrutinized the reliability of the measure before putting the data online, if the authors of the papers using the data had checked the reliability of the items or if the reviewers and editors had asked the right questions. Another good thing is that the people at the WVS (volunteers?) at the WVS twitter account have been frank in tweeting about the problems found in the data.

Summing up:

  1. We still do not know why the BFI-10 measure of the Big Five personality does not perform as in previous research.
  2. It is probably not due to acquiescence bias. Translations may be problematic for some countries.
  3. Do not use the WVS BFI-10 data from countries other than Germany and the Netherlands.
  4. Treat the WVS data from Bahrain and with great caution, and to be on the safe side, just exclude it from your analyses.
  5. The reliability of short Big Five measures is very low in non-WEIRD countries.

The code for the analyses reported in this blog is posted at the Open Science Framework.

Update 22 March 2017. The factor loadings in the table with the results of the analysis of attenuated scores has been updated. The table displayed previously was based on a division of the original scores by the total agreement scores. Rammstedt et al. subtracted the original scores from the total agreement scores. The results of the new analysis are close to the previous one and still confusing. The code on the OSF has been updated. Also a clarification was added that the negative items used in the factor analyses were all recoded such that they scored positively (HT to Christopher Soto).

2 Comments

Filed under personality, survey research

Five Reasons Why Social Science is So Hard 

1. No Laws

All we have is probabilities.

2. All Experts

The knowledge we have is continuously contested. The objects of study think they know why they do what they do.

3. Zillions of Variables

Everything is connected, and potentially a cause – like a bowl of well-tossed spaghetti.

4. Many Levels of Action

Nations, organizations, networks, individuals, time all have different dynamics.

5. Imprecise Measures

Few instruments have near perfect validity and reliability.

Conclusion

Social science is not as easy as rocket science. It is way more complicated.

Leave a comment

Filed under survey research

Tools for the Evaluation of the Quality of Experimental Research

pdf of this post

Experiments can have important advantages above other research designs. The most important advantage of experiments concerns internal validity. Random assignment to treatment reduces the attribution problem and increases the possibilities for causal inference. An additional advantage is that control over participants reduces heterogeneity of treatment effects observed.

The extent to which these advantages are realized in the data depends on the design and execution of the experiment. Experiments have a higher quality if the sample size is larger, the theoretical concepts are more reliably measured, and have a higher validity. The sufficiency of the sample size can be checked with a power analysis. For most effect sizes in the social sciences, which are small (d = 0.2), a sample of 1300 participants is required to detect it at conventional significance levels (p < .05) and 95% power (see appendix). Also for a stronger effect size (0.4) more than 300 participants are required. The reliability of normative scale measures can be judged with Cronbach’s alpha. A rule of thumb for unidimensional scales is that alpha should be at least .63 for a scale consisting of 4 items, .68 for 5 items, .72 for 6 items, .75 for 7 items, and so on. The validity of measures should be justified theoretically and can be checked with a manipulation check, which should reveal a sizeable and significant association with the treatment variables.

The advantages of experiments are reduced if assignment to treatment is non-random and treatment effects are confounded. In addition, a variety of other problems may endanger internal validity. Shadish, Cook & Campbell (2002) provide a useful list of such problems.

Also it should be noted that experiments can have important disadvantages. The most important disadvantage is that the external validity of the findings is limited to the participants in the setting in which their behavior was observed. This disadvantage can be avoided by creating more realistic decision situations, for instance in natural field experiments, and by recruiting (non-‘WEIRD’) samples of participants that are more representative of the target population. As Henrich, Heine & Norenzayan (2010) noted, results based on samples of participants in Western, Educated, Industrialized, Rich and Democratic (WEIRD) countries have limited validity in the discovery of universal laws of human cognition, emotion or behavior.

Recently, experimental research paradigms have received fierce criticism. Results of research often cannot be reproduced (Open Science Collaboration, 2015), publication bias is ubiquitous (Ioannidis, 2005). It has become clear that there is a lot of undisclosed flexibility, in all phases of the empirical cycle. While these problems have been discussed widely in communities of researchers conducting experiments, they are by no means limited to one particular methodology or mode of data collection. It is likely that they also occur in communities of researchers using survey or interview data.

In the positivist paradigm that dominates experimental research, the empirical cycle starts with the formulation of a research question. To answer the question, hypotheses are formulated based on established theories and previous research findings. Then the research is designed, data are collected, a predetermined analysis plan is executed, results are interpreted, the research report is written and submitted for peer review. After the usual round(s) of revisions, the findings are incorporated in the body of knowledge.

The validity and reliability of results from experiments can be compromised in two ways. The first is by juggling with the order of phases in the empirical cycle. Researchers can decide to amend their research questions and hypotheses after they have seen the results of their analyses. Kerr (1989) labeled the practice of reformulating hypotheses HARKING: Hypothesizing After Results are Known. Amending hypotheses is not a problem when the goal of the research is to develop theories to be tested later, as in grounded theory or exploratory analyses (e.g., data mining). But in hypothesis-testing research harking is a problem, because it increases the likelihood of publishing false positives. Chance findings are interpreted post hoc as confirmations of hypotheses that a priori  are rather unlikely to be true. When these findings are published, they are unlikely to be reproducible by other researchers, creating research waste, and worse, reducing the reliability of published knowledge.

The second way the validity and reliability of results from experiments can be compromised is by misconduct and sloppy science within various stages of the empirical cycle (Simmons, Nelson & Simonsohn, 2011). The data collection and analysis phase as well as the reporting phase are most vulnerable to distortion by fraud, p-hacking and other questionable research practices (QRPs).

  • In the data collection phase, observations that (if kept) would lead to undesired conclusions or non-significant results can be altered or omitted. Also, fake observations can be added (fabricated).
  • In the analysis of data researchers can try alternative specifications of the variables, scale constructions, and regression models, searching for those that ‘work’ and choosing those that reach the desired conclusion.
  • In the reporting phase, things go wrong when the search for alternative specifications and the sensitivity of the results with respect to decisions in the data analysis phase is not disclosed.
  • In the peer review process, there can be pressure from editors and reviewers to cut reports of non-significant results, or to collect additional data supporting the hypotheses and the significant results reported in the literature.

Results from these forms of QRPs are that null-findings are less likely to be published, and that published research is biased towards positive findings, confirming the hypotheses, published findings are not reproducible, and when a replication attempt is made, published findings are found to be less significant, less often positive, and of a lower effect size (Open Science Collaboration, 2015).

Alarm bells, red flags and other warning signs

Some of the forms of misconduct mentioned above are very difficult to detect for reviewers and editors. When observations are fabricated or omitted from the analysis, only inside information, very sophisticated data detectives and stupidity of the authors can help us. Also many other forms of misconduct are difficult to prove. While smoking guns are rare, we can look for clues. I have developed a checklist of warning signs and good practices that editors and reviewers can use to screen submissions (see below). The checklist uses terminology that is not specific to experiments, but applies to all forms of data. While a high number of warning signs in itself does not prove anything, it should alert reviewers and editors. There is no norm for the number of flags. The table below only mentions the warning signs; the paper version of this blog post also shows a column with the positive poles. Those who would like to count good practices and reward authors for a higher number can count gold stars rather than red flags. The checklist was developed independently of the checklist that Wicherts et al. (2016) recently published.

Warning signs

  • The power of the analysis is too low.
  • The results are too good to be true.
  • All hypotheses are confirmed.
  • P-values are just below critical thresholds (e.g., p<.05)
  • A groundbreaking result is reported but not replicated in another sample.
  • The data and code are not made available upon request.
  • The data are not made available upon article submission.
  • The code is not made available upon article submission.
  • Materials (manipulations, survey questions) are described superficially.
  • Descriptive statistics are not reported.
  • The hypotheses are tested in analyses with covariates and results without covariates are not disclosed.
  • The research is not preregistered.
  • No details of an IRB procedure are given.
  • Participant recruitment procedures are not described.
  • Exact details of time and location of the data collection are not described.
  • A power analysis is lacking.
  • Unusual / non-validated measures are used without justification.
  • Different dependent variables are analyzed in different studies within the same article without justification.
  • Variables are (log)transformed or recoded in unusual categories without justification.
  • Numbers of observations mentioned at different places in the article are inconsistent. Loss or addition of observations is not justified.
  • A one-sided test is reported when a two-sided test would be appropriate.
  • Test-statistics (p-values, F-values) reported are incorrect.

With the increasing number of retractions of articles reporting on experimental research published in scholarly journals the awareness of the fallibility of peer review as a quality control mechanism has increased. Communities of researchers employing experimental designs have formulated solutions to these problems. In the review and publication stage, the following solutions have been proposed.

  • Access to data and code. An increasing number of science funders require grantees to provide open access to the data and the code that they have collected. Likewise, authors are required to provide access to data and code at a growing number of journals, such as Science, Nature, and the American Journal of Political Science. Platforms such as Dataverse, the Open Science Framework and Github facilitate sharing of data and code. Some journals do not require access to data and code, but provide Open Science badges for articles that do provide access.
  • Pledges, such as the ‘21 word solution’, a statement designed by Simmons, Nelson and Simonsohn (2012) that authors can include in their paper to ensure they have not fudged the data: “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.”
  • Full disclosure of methodological details of research submitted for publication, for instance through psychdisclosure.org is now required by major journals in psychology.
  • Apps such as Statcheck, p-curve, p-checker, and r-index can help editors and reviewers detect fishy business. They also have the potential to improve research hygiene when researchers start using these apps to check their own work before they submit it for review.

As these solutions become more commonly used we should see the quality of research go up. The number of red flags in research should decrease and the number of gold stars should increase. This requires not only that reviewers and editors use the checklist, but most importantly, that also researchers themselves use it.

The solutions above should be supplemented by better research practices before researchers submit their papers for review. In particular, two measures are worth mentioning:

  • Preregistration of research, for instance on Aspredicted.org. An increasing number of journals in psychology require research to be preregistered. Some journals guarantee publication of research regardless of its results after a round of peer review of the research design.
  • Increasing the statistical power of research is one of the most promising strategies to increase the quality of experimental research (Bakker, Van Dijk & Wicherts, 2012). In many fields and for many decades, published research has been underpowered, using samples of participants that are not large enough the reported effect sizes. Using larger samples reduces the likelihood of both false positives as well as false negatives.

A variety of institutional designs have been proposed to encourage the use of the solutions mentioned above, including reducing the incentives in careers of researchers and hiring and promotion decisions for using questionable research practices, rewarding researchers for good conduct through badges, the adoption of voluntary codes of conduct, and socialization of students and senior staff through teaching and workshops. Research funders, journals, editors, authors, reviewers, universities, senior researchers and students all have a responsibility in these developments.

References

Bakker, M., Van Dijk, A. & Wicherts, J. (2012). The Rules of the Game Called Psychological Science. Perspectives on Psychological Science, 7(6): 543–554.

Henrich, J., Heine, S.J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33: 61 – 135.

Ioannidis, J.P.A. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8): e124. http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124

Kerr, N.L. (1989). HARKing: Hypothesizing After Results are Known. Personality and Social Psychology Review, 2: 196-217.

Open Science Collaboration (2015). Estimating the Reproducibility of Psychological Science. Science, 349. http://www.sciencemag.org/content/349/6251/aac4716.full.html

Shadish, W.R., Cook, T.D., & Campbell, D.T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22: 1359–1366.

Simmons, J.P., Nelson, L.D. & Simonsohn, U. (2012). A 21 Word Solution. Available at SSRN: http://ssrn.com/abstract=2160588

Wicherts, J.M., Veldkamp, C.L., Augusteijn, H.E., Bakker, M., Van Aert, R.C & Van Assen, M.L.A.M. (2016). Researcher degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers of Psychology, 7: 1832. http://journal.frontiersin.org/article/10.3389/fpsyg.2016.01832/abstract

1 Comment

Filed under academic misconduct, experiments, fraud, incentives, open science, psychology, survey research