Data

Common data terms

Data can be described using terms that indicate how it has been “collected”, “stored”, “organized”, or “processed”. As data is the core of BDM, it is worth describing such terms quickly to make explicit what they mean in the context of BDM.

Common terms

Process: Raw vs. Derived

Data that has not been processed is called raw data. And data that is computed from other data, for example summary statistics, is called derived data.

For reproducible research, it is essential to share raw data along with the analysis codes. This allows others to replicate all analyses within a research project from the raw data, rather than relying only on partial derived data.

Format: Native vs. Tidy

Different systems may record data in a variety of forms (e.g., structured, unstructured) and file formats (e.g., CSV, JSON, MAT); this is data in its native format. Tidy data is organized in long-format table where each row is an observational record and each column is a variable or property of that record¹. “BDM recommends storing data in a tidy format.

Storage: Datalake vs. Warehouse vs. Lakehouse

A datalake refers to an unstructured collection of data, typically from many different sources. On the one hand, the data lake can be seen as maximally flexible (accepting all sorts of data including tables in different formats, images, texts) and holding the potential for limitless data analysis; on the other hand, it can be seen as a chaotic data dump or swamp.

A data warehouse organizes the incoming data in a consistent way (schema-on-write) before storing or querying it. In most cases, data from warehouses can be readily used for downstream analysis. Data from data lakes, however, require significant background knowledge and effort to be processed.

There is, however, a new architecture that combines the best of both worlds: the data lakehouse. A data lakehouse benefits from the flexibility of data lakes and provides an intermediate semi-structured language (along with schema-on-read) that provides quick and flexible access to the data in real-time.

The term data mart is sometimes used to refer to a subset or a view of the data warehouse that is optimized for specific data use and users.

Organize: Database vs. Dataset

A database usually refers to a collection of data that is managed by a database management system (DBMS). A database typically implies some form of interaction between the data and the end users who read/write data from or into the database.

A dataset also refers to a collection of data, but while a database is a “living” entity, a dataset is typically a static resource that can be downloaded and shared as a whole (e.g., the PISA 2012 dataset). A dataset can be defined by a data collection campaign (e.g., all the data collected within a research project) or a data analysis campaign (e.g., a dataset created from a larger medical database that focuses specifically on mental health issues during the COVID pandemic).

Confusing terms

It is sometimes easy to confuse the terms described above. For example, data in native format is often called raw data even though the data may in fact contain derived data (for example, the overall score a person achieved on a school test alongside all individual responses and their correctness). Data in its native format is often messy but could in some cases be tidy.

Note also that it is not always clear what is meant by “processing” and what still counts as “raw data”. In BDM, data is no longer considered “raw” when the data has been processed in a way that reflects assumptions about the scientific values of the variables and relationships between them. For example, filtering out trials with response times shorter than 150 milliseconds because one believes that such short durations imply those responses are not valid, is a form of data processing which is “biased” and thus the resulting, filtered data cannot no longer be considered “raw”. On the other hand, if for some reason, a system logs the same events multiple times, removing duplicates is a pre-processing step that is not biased and hence, the resulting, filtered data should still be considered “raw” data.

In general, the following operations may be performed on original native-format data without compromising the “raw” status of the resulting data:

selection of variables
renaming variables and enum values
changes of units (from example, converting milliseconds to seconds)
reordering of columns and rows
removing duplicates
joining or splitting tables

Which data to store and which to share?

More data is stored than shared. Researchers may store participants’ contact information or specific parameters of the hardware being used, but they may not share that information.

What information should be collected initially, what subset of that data should be shared, and whether the native-format data should be shared along with the tidy data extracted from it?

Regarding what data to store, the answer depends on the purpose of the study. In any case, participants need to be informed about what data is being collected and the data collection must follow relevant regulations. It is generally well advised to be parsimonious regarding which personal data to collect (see Datensparsamkeit).

In BDM, native data should not be shared if one has the ability to extract and share better formatted data. There are three main reasons not to share source data:

native data may not be useable: they can be messy, come in diverse, sometimes proprietary formats, and lack documentation.
native data may contain personal data the data sharer is not aware of (e.g., participants’ full name, IP address); by defining what data is to be shared and extracting only that data it seems that such accidental privacy breaches could be avoided.
the extraction of data from the native data into usable data is outside the scope of responsibility of the data analyst. This last point deserves further explaining. For research to be reproducible one may want to test all the steps from the data collection up to the final results. While all the steps might be tested, they are not necessarily tested by the same person. For instance, a software company may run tests to determine that the recorded timestamps are accurate and a lab technician may run tests to calibrate the monitor and other hardware equipments. The question of what data to share is related to what quality assurance requirements are expected to be fulfilled by the data analyst. If both the data in the native format and data extracted from that are handed over to the data analyst it becomes the data analyst’s responsibility to verify that both sets of data are in fact in agreement. If there is an error in the pre-processing codes, the data analysit becomes responsible because he/she had access to that and could thus have spotted and corrected the error. For that, in BDM, the preparation of usable data from the original data in native format is not the responsibility of the data analyst; it is the responsibility of the data engineer and the entity that is sharing the data. It remains however that in any case tests must be conducted to verify the validity and accuracy of the process (put simply, this is not the role of the person receiving the data).

In short then, “don’t share original unformatted data in its native format”, instead share well documented, well formatted raw data.

Versioning data

Datasets may change across time as new data is collected or errors are corrected; it is thus necessary to version datasets. We do not recommend appending version labels to file names as this raises many issues (e.g., do we append the updated version label to all files? Only the files that changed? Only the root directory?). Instead, we recommend using the data version control system DVC and to encode the version information in the metadata of the dataset. Indeed, schema.org or calver.org allow to encode such information and popular data sharing platforms, such as Zenodo, display that information.

Tidy tables

“Tidy data allows one to start analyzing the data right away”

Tidy data paper, Hadley Wickham

There are many different ways of organizing data in tabular form. We recommend that data tables should be tidy as defined by the following criteria:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

For more details see Hadley (2014).

Although these rules seem straightforward, there are use cases where it is not obvious what counts as tidy (questionnaire data, we believe, is such a use case; more on this later).

In addition to these tidy data rules, we recommend the following rules for sorting rows and columns within a table. When rows refer to events, rows should be ordered in chronological order (e.g., the first row of a response table would indicate the first response of the first activity in the first session made by that subject). When observations do not refer to events but instead entities, the observations should be ordered either alphabetically by the names of the main entity (e.g., subject_index, instrument_name) or by the key measurement of the dataset (e.g., gdp in descending order in a gdp dataset).

Regarding the ordering of the columns, there is an order for categories of columns within the table, and an order of columns within a category. In general, for behavioral data, we keep to the left of the table variables that are used for scoping (e.g., to find or filter particular observations; e.g., subject_index, session_index, trial_index, trial_datetime); next we have variables describing a particular situation (e.g., stimulus_description, option_id), variables describing subjects’s responses (e.g., response_time) and variables describing the evaluation of the responses (this order may correspond to the order of events within a trial; see spec). Within such categories, variables should be grouped when they are semantically related, from more abstract to more concrete, or in alphabetic order. For example, it would make sense to group all the variables that describe a block and to order them as block_index > block_type > block_name.

Footnotes

For more details on tidy data, see “Tidy Data”.↩︎