GDPR | Rene Bekkers

Scientific research is based on data. How should researchers treat and document the data they analyze? Research Data Management policies recommend that data should be “as open as possible, and as closed as necessary”. What does that mean in practice?

Only publish data you are allowed to publish. Certainly “as open as possible” does not mean that researchers should make public all data they have. Researchers cannot share sensitive data of living individuals, because doing so would violate the GDPR. To arrive at data researchers can share responsibly, they have to process the data in such a manner that individuals cannot be identified.

Only publish data others can understand. In addition, researchers should process the data in such a manner that other researchers can work with the data. Fellow researchers should be able to understand how the conclusions drawn from a study were based on the data. This requires researchers to document each of the steps they took to collect the data, and to prepare, process, and analyze them.

Varieties of data: seven sources of data at five stages of processing. In practice, the things researchers do to prepare and process raw data vary enormously from one study to the the other. Also the things that researchers end up with to draw conclusions from can be very different. In the matrix below, you find examples and suggestions for data management for seven common sources of data, at five stages of processing.

	1. Raw Data	2. Preparation	3. Processed Data	4. Data Analysis	5. Analyzed Data
Surveys	Answers to survey questions	Anonymization, aggregating, recoding, cleaning	Anonymized survey data in .sav, .dta, .xls(x) or .csv format	Analysis script in .sps, .do or .rmd format	Numbers, graphs, and tables
Interviews and focus groups	Interview recordings and notes	Anonymization, pseudonymization, aggregating, cleaning	Anonymized interview transcripts in .doc(x) or .txt format	Coding tree, open coding, axial coding	Interpretative deductions, quotes, tables
Observations	Physical measures, digital traces, log entries, photos, videos	Anonymization, pseudonymization, aggregating, recoding, cleaning	Anonymized files and descriptions in .doc(x), .txt, xls(x) or .csv format	Analysis script in .sps, .do or .rmd format; coding tree, open coding, axial coding	Numbers, graphs, tables, Interpretative deductions, vignettes
Participant observation	Field notes, photos, videos, collected objects, memories	Coding	Notes and descriptions in .doc or .txt format	Anonymization and reflection on positionality	Interpretative deductions, vignettes
Administrative data	Entries in official registers	Anonymization, aggregating, recoding, cleaning	Anonymized register data in .sav, .dta, .xls(x) or .csv format	Analysis script in .sps, .do or .rmd format	Numbers, graphs, and tables
News and social media data	News items, posts on social media platforms	Web scraping or API script compiling, aggregating, recoding, cleaning	Anonymized textual and visual media data	Analysis script in .sps, .do or .rmd format	Numbers, graphs, and tables
Materials	Original materials of human or natural origin, or fragments thereof	Copying, reproducing, treating, sampling, described in lab or field notebook	Copies, facsimile reproductions, images and database entries in .xls(x), .csv, json	Analysis script in .py or .rmd format	Numbers, graphs, and tables
Archiving, documentation and sharing requirements	Should be preserved and archived	Should be documented and shared	Should be documented and shared	Should be documented and shared	Should be documented and shared

Research Data Management suggestions and examples for seven sources of data at five stages of processing

Raw, Processed, and Analyzed Data. A common distinction in research data management is between Raw Data (the first column), Processed Data (the third column) and Analyzed Data (the fifth and final column). Raw Data are the original sources that researchers ultimately rely upon to do their work. Once they’ve worked with the Raw Data in one way or another, by filtering them, categorizing, copying, treating, preserving, or in any way modifying them, they become Processed Data. Often researchers engage in multiple rounds of processing before they analyze the data. It is also possible that researchers redo some of the data processing after they have conducted an initial analysis. The data analysis produces the Analyzed Data that end up in the research report in the form of an exemplary picture, a table of results, an infographic, a figure, or a literal quote from an interview.

Access to data can be restricted. At all stages, access to the data can be restricted to specific groups (e.g., scientists), for specific purposes (e.g., a review, audit) or specific conditions (e.g., physical sites or secure online environments). In most cases, raw data include personal information that researchers should never publicly share.

Transparency is crucial. Especially when the raw data cannot be shared, providing documentation of the data processing steps (the second column) allows fellow researchers to retrace the production of the analysis file from the raw data. Also researchers can produce synthetic datasets that contain the same properties as the original raw data, but do not contain data on actual individuals. To provide transparency in research decisions and enable others to trace and redo the steps taken between raw, processed and analyzed data, researchers should identify and document processed data (the third column) along with executable code files (the fourth column) that produce the results they report as analyzed data in their publications.

Example: interview data

Raw interview data. Suppose for example that a researcher conducts interviews with persons about stressful life events for a study on coping strategies. If all went well, researchers ask those to be interviewed for permission to record the conversation, after providing them with information about the purposes and topics of the interview, that participation is voluntary, and what rights the interviewed have with respect to the information they provide. In such a study, the Raw Data are the recording of the interview, on a device such as a tape, dictaphone, smartphone, laptop, or on a cloud server.

Processed interview data. After the interview, the researcher transcribes the conversation in an electronic document, which indicates who said what in chronological order. The transcription is Processed Data. Documenting how the transcription was made is essential: did the researcher type up the conversation, did someone else do it, or was it done by an automatic transcription service? Typically, as a first step, the recording of the interview is transcribed verbatim, including all “eh”s, “ah”s and “….” (silences) in the conversation. Next, the transcription is edited to hide the identity of persons interviewed and other persons and organizations mentioned in the interview. Researchers can hide the identities of persons and organizations by using other names (pseudonymization) or using numbers and letters (“interview 1 at organization A”) to represent those interviewed and mentioned. The pseudonymized data is also Processed Data (version 2). Usually such editing is not enough to effectively hide the identities of those interviewed or mentioned, because details in the conversations will be unique enough for others to still be able to identify them. In this case, a third round of data processing will be needed, omitting unique details altogether or presenting specific information at a higher level of aggregation: “a colleague at organization A” or “my aunt” can be represented as “a person I know”. For transparency it is important that researchers document the rules they applied to edit the transcription.

Analyzed interview data. Researchers use a wide variety of techniques to analyze interview data. A common technique is to assign codes or tags to (parts of) sentences, organize them into a hierarchical tree, and note the co-occurrence of certain codes. Typically, quotes from the interviews that illustrate patterns in the data are presented in the research report. These quotes are the Analyzed Data. Another common category of Analyzed Data in interviews is an overview of codes and their co-occurrences in the form of a table.

Data transparency

Regardless of the exact methods of analysis researchers use, it is important to uniquely identify which data they have started from to draw a certain conclusion. A persistent digital object identifier (DOI) is the best way to do so. The DOI does not change, and is unique, so that it is clear upon which data the conclusions of research are based. In the example of the interview data above, the raw data of the audio recordings are not publicly accessible, but have their own DOI that is different from the DOI of the processed data of the edited and de-identified interview transcripts that can be accessed for review purposes. The analyzed data of the coding tree also has its own DOI.

Varieties of data

The seven sources of data cover most types of data researchers use. The groups are very broad categories. There are many varieties of observations for example, including in-person observation by researchers (such as leadership behavior by managers in firms, parent-child interactions in a home setting, interruptions in parliamentary debates), and observation through measurement equipment (such as images of traffic violations on cameras, heart rate measurements by wristbands, DNA samples from swabs). The same holds for the other sources. Materials in behavioral genetics are very different from materials in cultural heritage studies. Furthermore, for each of these sources, researchers use a wide variety of techniques to collect, process, and analyze data. Though the table above does not include an exhaustive list of all varieties within each category, the seven categories are intended to be comprehensive. If you work with a category of data that is not included in the table, I’d love to hear from you!

Email Subscription

Meta

Recent Posts

Categories

Archives