Data
Common data terms
Data can be described using terms that indicate how it has been “collected”, “stored”, “organized”, or “processed”. As data is the core of BDM, it is worth describing such terms quickly to make explicit what they mean in the context of BDM.
Common terms
Process: Raw vs. Derived
Data that has not been processed is called raw data. And data that is computed from other data, for example summary statistics, is called derived data.
For reproducible research, it is essential to share raw data along with the analysis codes. This allows others to replicate all analyses within a research project from the raw data, rather than relying only on partial derived data.
Format: Native vs. Tidy
Different systems may record data in a variety of forms (e.g., structured, unstructured) and file formats (e.g., CSV, JSON, MAT); this is data in its native format. Tidy data is organized in long-format table where each row is an observational record and each column is a variable or property of that record1. “BDM recommends storing data in a tidy format.
Storage: Datalake vs. Warehouse vs. Lakehouse
A datalake refers to an unstructured collection of data, typically from many different sources. On the one hand, the data lake can be seen as maximally flexible (accepting all sorts of data including tables in different formats, images, texts) and holding the potential for limitless data analysis; on the other hand, it can be seen as a chaotic data dump or swamp.
A data warehouse organizes the incoming data in a consistent way (schema-on-write) before storing or querying it. In most cases, data from warehouses can be readily used for downstream analysis. Data from data lakes, however, require significant background knowledge and effort to be processed.
There is, however, a new architecture that combines the best of both worlds: the data lakehouse. A data lakehouse benefits from the flexibility of data lakes and provides an intermediate semi-structured language (along with schema-on-read) that provides quick and flexible access to the data in real-time.
Organize: Database vs. Dataset
A database usually refers to a collection of data that is managed by a database management system (DBMS). A database typically implies some form of interaction between the data and the end users who read/write data from or into the database.
A dataset also refers to a collection of data, but while a database is a “living” entity, a dataset is typically a static resource that can be downloaded and shared as a whole (e.g., the PISA 2012 dataset). A dataset can be defined by a data collection campaign (e.g., all the data collected within a research project) or a data analysis campaign (e.g., a dataset created from a larger medical database that focuses specifically on mental health issues during the COVID pandemic).
Confusing terms
It is sometimes easy to confuse the terms described above. For example, data in native format is often called raw data even though the data may in fact contain derived data (for example, the overall score a person achieved on a school test alongside all individual responses and their correctness). Data in its native format is often messy but could in some cases be tidy.
Note also that it is not always clear what is meant by “processing” and what still counts as “raw data”. In BDM, data is no longer considered “raw” when the data has been processed in a way that reflects assumptions about the scientific values of the variables and relationships between them. For example, filtering out trials with response times shorter than 150 milliseconds because one believes that such short durations imply those responses are not valid, is a form of data processing which is “biased” and thus the resulting, filtered data cannot no longer be considered “raw”. On the other hand, if for some reason, a system logs the same events multiple times, removing duplicates is a pre-processing step that is not biased and hence, the resulting, filtered data should still be considered “raw” data.
In general, the following operations may be performed on original native-format data without compromising the “raw” status of the resulting data:
- selection of variables
- renaming variables and enum values
- changes of units (from example, converting milliseconds to seconds)
- reordering of columns and rows
- removing duplicates
- joining or splitting tables
Versioning data
Datasets may change across time as new data is collected or errors are corrected; it is thus necessary to version datasets. We do not recommend appending version labels to file names as this raises many issues (e.g., do we append the updated version label to all files? Only the files that changed? Only the root directory?). Instead, we recommend using the data version control system DVC and to encode the version information in the metadata of the dataset. Indeed, schema.org or calver.org allow to encode such information and popular data sharing platforms, such as Zenodo, display that information.
Tidy tables
“Tidy data allows one to start analyzing the data right away”
Tidy data paper, Hadley Wickham
There are many different ways of organizing data in tabular form. We recommend that data tables should be tidy as defined by the following criteria:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
For more details see Hadley (2014).
Although these rules seem straightforward, there are use cases where it is not obvious what counts as tidy (questionnaire data, we believe, is such a use case; more on this later).
In addition to these tidy data rules, we recommend the following rules for sorting rows and columns within a table. When rows refer to events, rows should be ordered in chronological order (e.g., the first row of a response table would indicate the first response of the first activity in the first session made by that subject). When observations do not refer to events but instead entities, the observations should be ordered either alphabetically by the names of the main entity (e.g., subject_index, instrument_name) or by the key measurement of the dataset (e.g., gdp in descending order in a gdp dataset).
Regarding the ordering of the columns, there is an order for categories of columns within the table, and an order of columns within a category. In general, for behavioral data, we keep to the left of the table variables that are used for scoping (e.g., to find or filter particular observations; e.g., subject_index, session_index, trial_index, trial_datetime); next we have variables describing a particular situation (e.g., stimulus_description, option_id), variables describing subjects’s responses (e.g., response_time) and variables describing the evaluation of the responses (this order may correspond to the order of events within a trial; see spec). Within such categories, variables should be grouped when they are semantically related, from more abstract to more concrete, or in alphabetic order. For example, it would make sense to group all the variables that describe a block
and to order them as block_index
> block_type
> block_name
.
Footnotes
For more details on tidy data, see “Tidy Data”.↩︎