Data
Data can be described using terms that indicate how it has been “collected”, “stored”, “organized”, or “processed”. As data is the core of BDM, it is worth describing such terms quickly to make explicit what we mean when we use them.
Common terms
Process: Raw vs. Derived
Data that has not been processed is termed raw data. And data that is computed from other data, for example summary statistics, is called derived data. For reproducible research, it is essential to share raw data along with the analysis code. This allows others to replicate all analyses within a research project from the raw data, rather than relying only on derived data.
Format: Native vs. Tidy
Different systems may record data in a variety of forms (e.g., structured, unstructured) and file formats (e.g., CSV, JSON, MAT); this is data in its native, and commonly wide, format. Tidy data is organized in long-format table where each row is an observational record and each column is a variable or property of that record. See Tidy Data paper for more information. BDM recommends that data be stored in a tidy format.
Storage: Datalake vs. Warehouse vs. Lakehouse
A datalake refers to an unstructured collection of data, typically from many different sources. On the one hand, the data lake can be seen as maximally flexible (accepting all sorts of data including tables in different formats, images, texts) and holding the potential for limitless data analysis; on the other hand, it can be seen as a chaotic data dump or swamp.
A data warehouse organizes the incoming data in a consistent way (schema-on-write) before storing or querying it. In most cases, data from warehouses can be readily used for downstream analysis. Data from data lakes, however, require significant background knowledge and effort to be processed.
There is, however, a new architecture that combines the best of both worlds: the data lakehouse. A data lakehouse benefits from the flexibility of data lakes and provides an intermediate semi-structured language (along with schema-on-read) that provides quick and flexible access to the data in real-time.
Organize: Database vs. Dataset
A database usually refers to a collection of data that is managed by a database management system (DBMS). A database typically implies some form of interaction between the data and the end users who read/write data from or into the database.
A dataset also refers to a collection of data, but while a database is a “living” entity, a dataset is typically a static resource that can be downloaded and shared as a whole (e.g., the PISA 2012 dataset). A dataset can be defined by a data collection campaign (e.g., all the data collected within a research project) or a data analysis campaign (e.g., a dataset created from a larger medical database that focuses specifically on mental health issues during the COVID pandemic).
Confusing terms
It is sometimes easy to confuse the terms described above. For example, data in native format is often called raw data even though the data may in fact contain derived data (for example, the overall score a person achieved on a school test alongside all individual responses and their correctness). Data in its native format is often messy but could in some cases be tidy.
Note also that it is not always clear what is meant by “processing” and what still counts as “raw data”. In BDM, data is no longer considered “raw” when the data has been processed in a way that reflects assumptions about the scientific values of the variables and relationships between them. For example, filtering out trials with response times shorter than 150 milliseconds because one believes that such short durations imply those responses are not valid, is a form of data processing which is “biased” and thus the resulting, filtered data cannot no longer be considered “raw”. On the other hand, if for some reason, a system logs the same events multiple times, removing duplicates is a pre-processing step that is not biased and hence, the resulting, filtered data should still be considered “raw” data.
In general, we believe that the following operations may be performed on original native-format data without compromising the “raw” status of the resulting data: - selection of variables - renaming variables and enum values - changes of units (from example, converting milliseconds to seconds) - reordering of columns and rows - removing duplicates - joining or splitting tables
Tidy tables
“Tidy data allows one to start analyzing the data right away”
Wickham, Hadley (20 February 2013). “Tidy Data”. Journal of Statistical Software.
There are many different ways of organizing data in tabular form. We recommend that data tables should be tidy as defined by the following criteria:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
For more details see Hadley (2014).
Although these rules seem straightforward, there are use cases where it is not obvious what counts as tidy (questionnaire data, we believe, is such a use case; more on this later).
In addition to these tidy data rules, we recommend the following rules for sorting rows and columns within a table. When rows refer to events, rows should be ordered in chronological order (e.g., the first row of a response table would indicate the first response of the first activity in the first session made by that subject). When observations do not refer to events but instead entities, the observations should be ordered either alphabetically by the names of the main entity (e.g., subject_index, instrument_name) or by the key measurement of the dataset (e.g., gdp in descending order in a gdp dataset).
Regarding the ordering of the columns, there is an order for categories of columns within the table, and an order of columns within a category. In general, for behavioral data, we keep to the left of the table variables that are used for scoping (e.g., to find or filter particular observations; e.g., subject_index, session_index, trial_index, trial_datetime); next we have variables describing a particular situation (e.g., stimulus_description, option_id), variables describing subjects’s responses (e.g., response_time) and variables describing the evaluation of the responses (this order may correspond to the order of events within a trial; see spec). Within such categories, variables should be grouped when they are semantically related, from more abstract to more concrete, or in alphabetic order. For example, it would make sense to group all the variables that describe a block
and to order them as block_index
> block_type
> block_name
.
Files and folders
File formats
In general, open file formats (e.g., .csv
, .json
, .md
) are preferred over proprietary file formats (e.g., .mat
, .sav
, .docx
). Open formats ensure that the data is accessible to a wide audience. Specifically, use the following formats:
Data Type | Recommended Format | When? |
---|---|---|
Tabular data | .csv |
Always |
Structured data | .json |
For machine-focused applications |
.yml |
When human readability and editability matter | |
Text | .md |
Always in Markdown, except for official, static, or administrative documents which may be in PDF |
Do NOT save the same data in multiple output formats (e.g., .csv
and .sav
) to avoid redundancy and version mismatches. However, lossless data conversion must be used if converting data to other formats.
Naming files and folders
We want file and folder names to be short and meaningful. Instead of encoding numerous pieces of information in the filename (e.g., the subject index, the session, the task) we prefer short names that describe only what type of data a given file contains, leaving it to the names of the parent directories that contain that file to encode its context. Conceptually, this is equivalent to calling a file README.md
within a project folder rather than calling it project_subproject_README.md
.
There are no widespread standards or styleguides regarding how to name data files and folders. Some use lower case with hypens (e.g., the uses mad-men
and masculinity-survey
in FiveThirtyEight) or underscores (e.g., arxiv_qa
) or a mixture (e.g., sub-03_task-rest_space-MNI152_bold.nii.gz
in BIDS). Others use uppercase, either with hyphens, underscores or camelCase/PascalCase (e.g, CIFAR10
in PyTorch, BIG5
in Open Psychometrics)or a mixture thereof (e.g., CoT-Collection
). It is not uncommon to find organizations that mix different naming styles (e.g., Kaggle, HuggingFace, R). Furthermore, some people seem to distinguish the name of a dataset from the name of the corresponding file or folder (e.g., the “Fashion-MNIST” dataset is stored as “fashion-mnist”).
Perhaps the reason for this state of affairs is that a dataset is subject to multiple constraints:
- When a dataset becomes a package or repository or module, the dataset must follow the corresponding conventions (e.g.,
fashion-mnist
). - Datasets may contain files and folders that are subject to their own conventions, e.g.,
README.md
file. - Data files may be related to variable names and to code; in which case they must form valid variable names, e.g., no spaces or hypens.
Within BDM we use following styling rules:
- Data and filenames can only include lowercase letters, numbers, underscores, and hyphens (except for common files, e.g., README.md).
- Dataset names follow the convention of repository names (i.e., no underscores).
- Data files and code-related files have names that are valid variable names (e.g., under PEP8 in Python).
- Hyphens and underscores are used for human readability, not to encode structural relationships (e.g., key-value pairs or entity_feature).
We recommend using simple, short but clear names for data files so they are easier to type and can be fully displayed on screen. The location of a file provides additional context. For example dataset/data/subject_001/session_1/response.csv
is preferred over dataset/data/subject_001/subject-001_session-1_response.csv
as it is shorter and yet contains the same information.
We also do not recommend including data in the filenames (e.g., sub-03_task-rest_space-MNI152_bold.nii.gz
, 2019_
) as this data should be either accessible in the data or obvious from the context.
Here are a few examples of file and folder names that follow these rules:
masculinity-survey/
big-5/
response.csv
subject_001/
Naming datasets
There are four main strategies to name datasets; they focus respectively on:
- The project name (e.g., the ABCD (Adolescent Brain Cognitive Development) dataset);
- the name of the data provider (e.g.,
CIFAR-10
from the Canadian Institute For Advanced Research); - the data content (e.g.,
election-forecasts-2022
); - a related publication (e.g., the
Harmann74.cor
dataset refers to the publication “Harman, H. H. (1976) Modern Factor Analysis, Third Edition Revised, University of Chicago Press, Table 7.4.”).
For simple datasets that focus on a specific type of observation (e.g., air quality data) it makes sense to use content-based filenames (e.g., air-quality
). In such cases, one should use meanifngful and explicit names. However, for more complex datasets such a naming scheme is typically not possible.
In general we recommend that datasets be published (e.g., on Zenodo) and to name the corresponding datasets by referring to the last name of the first author and the date of the publication (e.g., steyvers2020
).
For bigger projects and datasets, it makes sense however to use the project name instead (e.g., abcd2021
).
We do not recommend using meaningless strings (e.g., g9zkf
) or names that are not specific (e.g., titanic
, sleep
, dolphin
) beyond simple toy datasets.
We do not recommend including the version in the dataset name (see below.)
Versioning data
Datasets may change across time as new data is collected or errors are corrected; it is thus necessary to version datasets. We do not recommend appending version labels to file names as this raises many issues (e.g., do we append the updated version label to all files? Only the files that changed? Only the root directory?). Instead, we recommend using the data version control system DVC and to encode the version information in the metadata of the dataset. Indeed, schema.org or calver.org allow to encode such information and popular data sharing platforms, such as Zenodo, display that information.