Metadata
Metadata provides information about a data record, in our case, a dataset. Metadata can serve many functions, most important of which are to facilitate data discovery. There are numerous standards for such metadata which in practice may not be used in consistent ways. It is for instance important to note that some metadata are destined to be processed by machines (e.g., search engines) while other metadata (or views thereof) are destined to be read by the dataset consumers.
We use the standard defined by schema.org, as this is a solid, industry standard that is both extensive and used by major dataset search engines (e.g., https://datasetsearch.research.google.com/).
We also recommend sharing data via Zenodo which a major platform for sharing scientific datasets. When uploading datasets, Zenodo offers an interface for describing the dataset which uses the fields defined by schema.org and generates metadata following that standard that will be attached to the dataset.
We also recommend adopting a minimal approach to the content of this metadata and to only fill out key information that may be useful for locating a particular dataset and determining if the dataset is of relevance to the data consumer. More detailed information about the data (e.g., the codebook) should be included in the dataset itself. Below we present the key properties to describe datasets within schema.org. For the official full specification, see https://schema.org/Dataset.
Metadata file and location
Metadata may be embedded in a file (e.g., in an html page) and/or it may be stored as a separate dedicated file within the dataset folder (e.g., dataset_description.json
) and/or outside the folder. As these strategies are used somewhat inconsistently in practiced, it is important here to consider who (or what) the metadata is destined for.
As JSON files are notoriously hard to read and edit by humans (in particular when they are long, with nested structure) we do not recommend exposing them directly to data consumers. Instead we prefer the option of presenting to humans all relevant metadata within the datasets’ README file. Note that the JSON metadata may be embedded in the README file and displayed in a more human readable way, while still allowing its processing by machines (for an example of this approach see HuggingFace dataset cards.)
Furthermore, when sharing data on Zenodo it is necessary to fill out a form describing the shared dataset. This form is used to populate a JSON file that can be downloaded by the user, but most importantly, can be processed by search engines and help people find your dataset (in this case the metadata is external to the dataset).