Dataset cards

There are two main sources of metadata in BDM: dataset card and codebook.

Dataset card
A structured representation of metadata about the dataset. It is generally defined in the README.md front matter of a dataset (see folder structure) and provides information about its creator, license, and other relevant information.
Codebook
Part of the dataset card that provides information about the variables in the data (such as their names, types, and descriptions). It also provides information about the values that each variable can take on and any missing data that may exist in the dataset. The codebook is an essential component of the dataset, as it provides context and semantic that helps researchers interpret the data.

Here is an example README.md file with a dataset card (in the front matter), a codebook (in the front matter), and human-readable description (in markdown):

---
title: "Example dataset"
description: "This is an example dataset that contains some example data."
version: "1.0"
doi: "10.1234/5678"
url: "https://example.com/dataset"
creator:
  - name: "John One Doe"
    orcid: "0000-0000-0000-0000"
  - name: "Jane Two Doe"
    orcid: "0000-0000-0000-0001"
license: "CC BY-SA 4.0"
codebook:
  variables:
    - name: "age"
      description: "The age group of the subject"
      type: "enumeration"
      missing: "NA"
      values:
        - "18-25": "between 18 and 25 years old (inclusive)"
        - "26-35": "between 26 and 35 years old (inclusive)"
        - "36-45": "between 36 and 45 years old (inclusive)"
        - "46-55": "between 46 and 55 years old (inclusive)"
        - "56-": "56 years old (inclusive) or older"
---

# Example Dataset

This is an example dataset that contains some example data.

## Citation

If you use this dataset, please cite it as follows:

...

## License

This dataset is licensed under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/).

Required fields

Property Type Description
name Text The name of the dataset.
description Text Description of the dataset.
version Number or Text The version of the dataset.
license URL The license that applies to the dataset.
creator Organization or Person The creator(s)/author(s) of the published dataset.

Derived datasets

Sometimes a dataset is curated/aggregated and shared by people who were not the original collectors of the data. In those cases it might be unclear how exactly to attribute proper credit to each type of contribution.

In BDM, the creator(s) of a dataset are the people or organizations that published that specific version of the dataset, not the people or organizations that collected the data. For instance, if the shared dataset has been processed inadequately, it would not make sense to blame to original data collectors for such mistakes.

This being said, the original data collectors must be appropriately credited. There are 5 mechanisms we recommend

  • name of the dataset refers to original data collectors;
  • description of the dataset states clearly that data collection was performed by other people;
  • citation provides a link to cite the original study
  • creditText provides credit to both the curated and the original versions
  • isBasedOn provides information and links to the original dataset.
Property Type Description
isBasedOn URL or CreativeWork A resource from which this work is derived or from which it is a modification or adaptation.
citation CreativeWork or Text A citation or reference to another creative work, such as another publication, web page, scholarly article, etc.
Back to top