Metadata

Metadata provides information about a data record, in our case, a dataset. It can serve many functions, most important of which is dataset discovery beyond the scope of the repository where they have been published. There are numerous standards for such metadata which in practice may not be aimed for the same audience: some standards are destined to be machine-readable (e.g., for search engines) while other ones are designed for human readability.

BDM relies on schema.org and its Croissant extension to describe the data, as they are both extensive and used by major search interfaces like Google Dataset Search and HuggingFace datasets.

Where is the metadata?

Metadata may be embedded in a file (e.g., in an markdown front matter) or it may be stored as a separate file within the dataset (e.g., dataset_description.json as in BIDS). As these strategies are not always consistent in practice, it is important here to consider who (or what) the metadata is destined for, humans or machines.

Long, nested JSON files are notoriously hard to read and edit by humans, so exposing them directly to consumers is not recommended. Instead, all the relevant metadata can be presented within the datasets’ README file in a human-readable way (see Dataset Card). The structured YAML header of the README file can be displayed in a human-readable way, while still allowing its processing by machines.

It is also recommended to adopt a minimal approach to the content of the metadata and to only fill out key information that may be useful for locating a particular dataset and determining if the data is of relevance to the dataset consumer. More detailed information about the structure of data can be included in the dataset itself (see Codebook).

Zenodo

For sharing metadata along scientific datasets, you can use Zenodo platform. When uploading, Zenodo offers a form for describing the datasets and. This form populates a schema.org JSON file that can be downloaded by the user, but most importantly, can be processed by search engines that supports schema.org and help people find the dataset.

Also note that, when we use Zenodo, the metadata is stored in a file that is separate from the dataset. This is a good practice as it allows search engines to index the metadata separately from the dataset (the metadata is external to the dataset), and it also allows the metadata to be updated without changing the dataset itself.