Files and folders

Formatting, naming, and storing data

File formats

In general, open file formats (e.g., .csv, .json, .md) are preferred over proprietary file formats (e.g., .mat, .sav, .docx). Open formats ensure that the data is accessible to a wide audience. Specifically, use the following formats:

Data Type Recommended Format When?
Tabular data .csv Always
Structured data .json For machine-focused applications
.yml When human readability and editability matter
Text .md Always in Markdown, except for official, static, or administrative documents which may be in PDF

Do NOT save the same data in multiple output formats (e.g., .csv and .sav) to avoid redundancy and version mismatches. However, lossless data conversion must be used if converting data to other formats.

Naming files and folders

We want file and folder names to be short and meaningful. Instead of encoding numerous pieces of information in the filename (e.g., the subject index, the session, the task) we prefer short names that describe only what type of data a given file contains, leaving it to the names of the parent directories that contain that file to encode its context. Conceptually, this is equivalent to calling a file README.md within a project folder rather than calling it project_subproject_README.md.

There are no widespread standards or styleguides regarding how to name data files and folders. Some use lower case with hypens (e.g., the uses mad-men and masculinity-survey in FiveThirtyEight) or underscores (e.g., arxiv_qa) or a mixture (e.g., sub-03_task-rest_space-MNI152_bold.nii.gz in BIDS). Others use uppercase, either with hyphens, underscores or camelCase/PascalCase (e.g, CIFAR10 in PyTorch, BIG5 in Open Psychometrics)or a mixture thereof (e.g., CoT-Collection). It is not uncommon to find organizations that mix different naming styles (e.g., Kaggle, HuggingFace, R). Furthermore, some people seem to distinguish the name of a dataset from the name of the corresponding file or folder (e.g., the “Fashion-MNIST” dataset is stored as “fashion-mnist”).

Perhaps the reason for this state of affairs is that a dataset is subject to multiple constraints:

  • When a dataset becomes a package or repository or module, the dataset must follow the corresponding conventions (e.g., fashion-mnist).
  • Datasets may contain files and folders that are subject to their own conventions, e.g., README.md file.
  • Data files may be related to variable names and to code; in which case they must form valid variable names, e.g., no spaces or hypens.

Within BDM we use following styling rules:

  • Data and filenames can only include lowercase letters, numbers, underscores, and hyphens (except for common files, e.g., README.md).
  • Dataset names follow the convention of repository names (i.e., no underscores).
  • Data files and code-related files have names that are valid variable names (e.g., under PEP8 in Python).
  • Hyphens and underscores are used for human readability, not to encode structural relationships (e.g., key-value pairs or entity_feature).

We recommend using simple, short but clear names for data files so they are easier to type and can be fully displayed on screen. The location of a file provides additional context. For example dataset/data/subject_001/session_1/response.csv is preferred over dataset/data/subject_001/subject-001_session-1_response.csv as it is shorter and yet contains the same information.

We also do not recommend including data in the filenames (e.g., sub-03_task-rest_space-MNI152_bold.nii.gz, 2019_) as this data should be either accessible in the data or obvious from the context.

Here are a few examples of file and folder names that follow these rules:

  • masculinity-survey/
  • big-5/
  • response.csv
  • subject_001/

Naming datasets

There are four main strategies to name datasets; they focus respectively on:

  • The project name (e.g., the ABCD (Adolescent Brain Cognitive Development) dataset);
  • the name of the data provider (e.g., CIFAR-10 from the Canadian Institute For Advanced Research);
  • the data content (e.g., election-forecasts-2022);
  • a related publication (e.g., the Harmann74.cor dataset refers to the publication “Harman, H. H. (1976) Modern Factor Analysis, Third Edition Revised, University of Chicago Press, Table 7.4.”).

For simple datasets that focus on a specific type of observation (e.g., air quality data) it makes sense to use content-based filenames (e.g., air-quality). In such cases, one should use meanifngful and explicit names. However, for more complex datasets such a naming scheme is typically not possible.

In general we recommend that datasets be published (e.g., on Zenodo) and to name the corresponding datasets by referring to the last name of the first author and the date of the publication (e.g., steyvers2020).

For bigger projects and datasets, it makes sense however to use the project name instead (e.g., abcd2021).

We do not recommend using meaningless strings (e.g., g9zkf) or names that are not specific (e.g., titanic, sleep, dolphin) beyond simple toy datasets.

We do not recommend including the version in the dataset name (see below.)

Back to top