Data

Data can be described using terms that indicate how it has been “collected”, “stored”, “organized”, or “processed”. As data is the core of BDM, it is worth describing such terms quickly to make explicit what we mean when we use them.

Common terms

Process: Raw vs. Derived

Data that has not been processed is termed raw data. And data that is computed from other data, for example summary statistics, is called derived data. For reproducible research, it is essential to share raw data along with the analysis code. This allows others to replicate all analyses within a research project from the raw data, rather than relying only on derived data.

Format: Native vs. Tidy

Different systems may record data in a variety of forms (e.g., structured, unstructured) and file formats (e.g., CSV, JSON, MAT); this is data in its native, and commonly wide, format. Tidy data is organized in long-format table where each row is an observational record and each column is a variable or property of that record. See Tidy Data paper for more information. BDM recommends that data be stored in a tidy format.

Storage: Datalake vs. Warehouse vs. Lakehouse

A datalake refers to an unstructured collection of data, typically from many different sources. On the one hand, the data lake can be seen as maximally flexible (accepting all sorts of data including tables in different formats, images, texts) and holding the potential for limitless data analysis; on the other hand, it can be seen as a chaotic data dump or swamp.

A data warehouse organizes the incoming data in a consistent way (schema-on-write) before storing or querying it. In most cases, data from warehouses can be readily used for downstream analysis. Data from data lakes, however, require significant background knowledge and effort to be processed.

There is, however, a new architecture that combines the best of both worlds: the data lakehouse. A data lakehouse benefits from the flexibility of data lakes and provides an intermediate semi-structured language (along with schema-on-read) that provides quick and flexible access to the data in real-time.

The term data mart is sometimes used to refer to a subset or a view of the data warehouse that is optimized for specific data use and users.

Organize: Database vs. Dataset

A database usually refers to a collection of data that is managed by a database management system (DBMS). A database typically implies some form of interaction between the data and the end users who read/write data from or into the database.

A dataset also refers to a collection of data, but while a database is a “living” entity, a dataset is typically a static resource that can be downloaded and shared as a whole (e.g., the PISA 2012 dataset). A dataset can be defined by a data collection campaign (e.g., all the data collected within a research project) or a data analysis campaign (e.g., a dataset created from a larger medical database that focuses specifically on mental health issues during the COVID pandemic).

Confusing terms

It is sometimes easy to confuse the terms described above. For example, data in native format is often called raw data even though the data may in fact contain derived data (for example, the overall score a person achieved on a school test alongside all individual responses and their correctness). Data in its native format is often messy but could in some cases be tidy.

Note also that it is not always clear what is meant by “processing” and what still counts as “raw data”. In BDM, data is no longer considered “raw” when the data has been processed in a way that reflects assumptions about the scientific values of the variables and relationships between them. For example, filtering out trials with response times shorter than 150 milliseconds because one believes that such short durations imply those responses are not valid, is a form of data processing which is “biased” and thus the resulting, filtered data cannot no longer be considered “raw”. On the other hand, if for some reason, a system logs the same events multiple times, removing duplicates is a pre-processing step that is not biased and hence, the resulting, filtered data should still be considered “raw” data.

In general, we believe that the following operations may be performed on original native-format data without compromising the “raw” status of the resulting data: - selection of variables - renaming variables and enum values - changes of units (from example, converting milliseconds to seconds) - reordering of columns and rows - removing duplicates - joining or splitting tables

Which data to store and which to share?

It is typically the case that more data is stored than shared. For example, researchers may store participants’ contact information or specific parameters of the hardware being used. The question then arises: what information should be collected initially, what subset of that data should be shared, and whether the native-format data should be sent along with the tidy data extracted from it.

Regarding what data to store, the answer depends on the purpose of the study. In any case, participants need to be informed about what data is being collected and the data collection must follow relevant regulations. It is generally well advised to be parsimonious regarding which personal data to collect (https://martinfowler.com/bliki/Datensparsamkeit.html).

In BDM, native data should not be shared if one has the ability to extract and share better formatted data. There are three main reasons not to share source data:

  • native data may not be useable: they can be messy, come in diverse, sometimes proprietary formats, and lack documentation.
  • native data may contain personal data the data sharer is not aware of (e.g., participants’ full name, IP address); by defining what data is to be shared and extracting only that data it seems that such accidental privacy breaches could be avoided.
  • the extraction of data from the native data into usable data is outside the scope of responsibility of the data analyst. This last point deserves further explaining. For research to be reproducible one may want to test all the steps from the data collection up to the final results. While all the steps might be tested, they are not necessarily tested by the same person. For instance, a software company may run tests to determine that the recorded timestamps are accurate and a lab technician may run tests to calibrate the monitor and other hardware equipments. The question of what data to share is related to what quality assurance requirements are expected to be fulfilled by the data analyst. If both the data in the native format and data extracted from that are handed over to the data analyst it becomes the data analyst’s responsibility to verify that both sets of data are in fact in agreement. If there is an error in the pre-processing codes, the data analysit becomes responsible because he/she had access to that and could thus have spotted and corrected the error. For that, in BDM, the preparation of usable data from the original data in native format is not the responsibility of the data analyst; it is the responsibility of the data engineer and the entity that is sharing the data. It remains however that in any case tests must be conducted to verify the validity and accuracy of the process (put simply, this is not the role of the person receiving the data).

In short then, “don’t share original unformatted data in its native format”, instead share well documented, well formatted raw data.

Tidy tables

“Tidy data allows one to start analyzing the data right away”

Wickham, Hadley (20 February 2013). “Tidy Data”. Journal of Statistical Software.

There are many different ways of organizing data in tabular form. We recommend that data tables should be tidy as defined by the following criteria:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

For more details see Hadley (2014).

Although these rules seem straightforward, there are use cases where it is not obvious what counts as tidy (questionnaire data, we believe, is such a use case; more on this later).

In addition to these tidy data rules, we recommend the following rules for sorting rows and columns within a table. When rows refer to events, rows should be ordered in chronological order (e.g., the first row of a response table would indicate the first response of the first activity in the first session made by that subject). When observations do not refer to events but instead entities, the observations should be ordered either alphabetically by the names of the main entity (e.g., subject_index, instrument_name) or by the key measurement of the dataset (e.g., gdp in descending order in a gdp dataset).

Regarding the ordering of the columns, there is an order for categories of columns within the table, and an order of columns within a category. In general, for behavioral data, we keep to the left of the table variables that are used for scoping (e.g., to find or filter particular observations; e.g., subject_index, session_index, trial_index, trial_datetime); next we have variables describing a particular situation (e.g., stimulus_description, option_id), variables describing subjects’s responses (e.g., response_time) and variables describing the evaluation of the responses (this order may correspond to the order of events within a trial; see spec). Within such categories, variables should be grouped when they are semantically related, from more abstract to more concrete, or in alphabetic order. For example, it would make sense to group all the variables that describe a block and to order them as block_index > block_type > block_name.

Files and folders

File formats

In general, open file formats (e.g., .csv, .json, .md) are preferred over proprietary file formats (e.g., .mat, .sav, .docx). Open formats ensure that the data is accessible to a wide audience. Specifically, use the following formats:

Data Type Recommended Format When?
Tabular data .csv Always
Structured data .json For machine-focused applications
.yml When human readability and editability matter
Text .md Always in Markdown, except for official, static, or administrative documents which may be in PDF

Do NOT save the same data in multiple output formats (e.g., .csv and .sav) to avoid redundancy and version mismatches. However, lossless data conversion must be used if converting data to other formats.

Naming files and folders

We want file and folder names to be short and meaningful. Instead of encoding numerous pieces of information in the filename (e.g., the subject index, the session, the task) we prefer short names that describe only what type of data a given file contains, leaving it to the names of the parent directories that contain that file to encode its context. Conceptually, this is equivalent to calling a file README.md within a project folder rather than calling it project_subproject_README.md.

There are no widespread standards or styleguides regarding how to name data files and folders. Some use lower case with hypens (e.g., the uses mad-men and masculinity-survey in FiveThirtyEight) or underscores (e.g., arxiv_qa) or a mixture (e.g., sub-03_task-rest_space-MNI152_bold.nii.gz in BIDS). Others use uppercase, either with hyphens, underscores or camelCase/PascalCase (e.g, CIFAR10 in PyTorch, BIG5 in Open Psychometrics)or a mixture thereof (e.g., CoT-Collection). It is not uncommon to find organizations that mix different naming styles (e.g., Kaggle, HuggingFace, R). Furthermore, some people seem to distinguish the name of a dataset from the name of the corresponding file or folder (e.g., the “Fashion-MNIST” dataset is stored as “fashion-mnist”).

Perhaps the reason for this state of affairs is that a dataset is subject to multiple constraints:

  • When a dataset becomes a package or repository or module, the dataset must follow the corresponding conventions (e.g., fashion-mnist).
  • Datasets may contain files and folders that are subject to their own conventions, e.g., README.md file.
  • Data files may be related to variable names and to code; in which case they must form valid variable names, e.g., no spaces or hypens.

Within BDM we use following styling rules:

  • Data and filenames can only include lowercase letters, numbers, underscores, and hyphens (except for common files, e.g., README.md).
  • Dataset names follow the convention of repository names (i.e., no underscores).
  • Data files and code-related files have names that are valid variable names (e.g., under PEP8 in Python).
  • Hyphens and underscores are used for human readability, not to encode structural relationships (e.g., key-value pairs or entity_feature).

We recommend using simple, short but clear names for data files so they are easier to type and can be fully displayed on screen. The location of a file provides additional context. For example dataset/data/subject_001/session_1/response.csv is preferred over dataset/data/subject_001/subject-001_session-1_response.csv as it is shorter and yet contains the same information.

We also do not recommend including data in the filenames (e.g., sub-03_task-rest_space-MNI152_bold.nii.gz, 2019_) as this data should be either accessible in the data or obvious from the context.

Here are a few examples of file and folder names that follow these rules:

  • masculinity-survey/
  • big-5/
  • response.csv
  • subject_001/

Naming datasets

There are four main strategies to name datasets; they focus respectively on:

  • The project name (e.g., the ABCD (Adolescent Brain Cognitive Development) dataset);
  • the name of the data provider (e.g., CIFAR-10 from the Canadian Institute For Advanced Research);
  • the data content (e.g., election-forecasts-2022);
  • a related publication (e.g., the Harmann74.cor dataset refers to the publication “Harman, H. H. (1976) Modern Factor Analysis, Third Edition Revised, University of Chicago Press, Table 7.4.”).

For simple datasets that focus on a specific type of observation (e.g., air quality data) it makes sense to use content-based filenames (e.g., air-quality). In such cases, one should use meanifngful and explicit names. However, for more complex datasets such a naming scheme is typically not possible.

In general we recommend that datasets be published (e.g., on Zenodo) and to name the corresponding datasets by referring to the last name of the first author and the date of the publication (e.g., steyvers2020).

For bigger projects and datasets, it makes sense however to use the project name instead (e.g., abcd2021).

We do not recommend using meaningless strings (e.g., g9zkf) or names that are not specific (e.g., titanic, sleep, dolphin) beyond simple toy datasets.

We do not recommend including the version in the dataset name (see below.)

Versioning data

Datasets may change across time as new data is collected or errors are corrected; it is thus necessary to version datasets. We do not recommend appending version labels to file names as this raises many issues (e.g., do we append the updated version label to all files? Only the files that changed? Only the root directory?). Instead, we recommend using the data version control system DVC and to encode the version information in the metadata of the dataset. Indeed, schema.org or calver.org allow to encode such information and popular data sharing platforms, such as Zenodo, display that information.