Getting started

Understand what BDM is and how to use it

What is BDM?

Behaverse Data Model (BDM) is an interoperable dataset architecture for a wide range of cognitive tests and questionnaires. It provides a set of conventions to organize behavioral data in a way that is easy to understand by both domain experts and technical experts. It is opinionated but also flexible, so you can adapt it to your specific needs.

In the following sections, you will learn about the structure of a BDM dataset and how to create one. You will also learn about the different types of data in BDM and how to describe them with metadata.

How does a BDM dataset look like?

A BDM dataset is a collection of files and directories that follow a specific structure. Here is an example of a simple BDM dataset:

dataset/
1├── README.md
2├── agents.csv
3├── instruments/
4└── data/
    ├── agent_1/
    ├── agent_2/
    └── agent_3/

1: The README.md file contains both human-readable description of the dataset (as markdown) and machine-readable metadata (as YAML front matter).
2: The agents.csv file contains information about the human subjects and artificial agents in the dataset. It is a CSV file with one row per agent and columns for attributes.
3: The instruments/ folder contains parameters for the tasks, instructions, and questionnaires. Each instrument is represented as a YAML file that describes the instrument and its parameters. The instrument files are named after the instrument name, e.g., instruments/DigitSpan.yaml, and contain the various parameters for the instrument.
4: The data/ folder contains the data files. Within this folder, there is one folder per each agent (human or machine participants of the study). This directory-based partitioning makes it easier to access and manage the data.

Data files

The data/ folder contains the data files. Within this folder, data files are organized in a hierarchical structure: <AGENT>/<SESSION>/<ACTIVITY>/<DATA_FILE>. This is commonly called directory-based partitioning and allows for easy access to the data files.

The data/<AGENT>/ folder contains the data for a single human subject or computer agent. It is organize by session folders for each session of the study. It may also contain other agent-specific data types, such as studyflow.csv that describes the order the activities for the agent.
the data/<AGENT>/<SESSION>/ folder contains the data for a single session. Each session folder can contain multiple activities.
The data/<AGENT>/<SESSION>/<ACTIVITY>/ are activity folders. Activities are the cognitive tests, questionnaires, or other data collection instruments used in the study. They contain data files for one or more attempts of the same activity.

Here is an example data/ folder of a dataset of three subjects, each with two sessions, and each session with two activities (UFOV and DigitSpan):

...
data/
├── agent_1/
│   ├── studyflow.csv
│   ├── session_1/
│   │   ├── UFOV/
│   │   │   └── ...
│   │   └── DigitSpan/
│   │       └── ...
│   └── session_2/
│       ├── UFOV/
│       │   └── ...
│       └── DigitSpan/
│           └── ...
├── agent_2/
│   ├── studyflow.csv
│   └── session_1/
│       ├── UFOV/
│       │   └── ...
│       └── DigitSpan/
│           └── ...
└── agent_3/
    ├── studyflow.csv
    ├── session_1/
    │   ├── UFOV/
    │   │   └── ...
    │   └── DigitSpan/
    │       └── ...
    └── session_2/
        ├── UFOV/
        │   └── ...
        └── DigitSpan/
            └── ...

Note that:

Each agent has a folder named agent_<ID> where <ID> is a unique identifier for the human subject or computer agent. <ID> can be a number or a string, but must be unique within the dataset. For convenience, if your dataset contains only human subjects, you may also use the data/subject_<ID>/ format, but the root data folder remains data/. But it is recommended to use the agent_<ID> naming for all datasets, as it is more flexible and extensible.
The studyflow.csv file contains the order and metadata of the all activities for the agent.
Each session has a folder named session_<ID> where <ID> is a unique identifier for the session. Session identifiers are unique within the dataset, so session_01 for agent_01 has the same session structure as session_01 for agent_02.
Each activity has a folder named <AGENT>/<SESSION>/<ACTIVITY>/ where <ACTIVITY> is the name of the instrument used in the activity. In the example, the activities are UFOV and DigitSpan. For each activity, there is an instrument file in the instruments/ folder with the same name.
The general idea of using file path to partition data is similar to the dataset partitioning in Apache Arrow or DuckDB.

Levels of data

BDM organizes data in three levels: events, trials, and models. Each level of data is represented by a different type of file.

Events: Events are the lowest level of data in BDM. They represent the raw data collected during an activity. For example, in a DigitSpan activity, an event might be a single digit that the subject has to remember and the timestamp when the digit was presented. Events are stored in the events.csv file within the activity folder.
Trials: Trials are a higher level of data that represent a single attempt at an activity accompanied by the subject’s response and experimenter’s interpretation of the response. For example, in a DigitSpan activity, a trial might be a sequence of digits that the subject has to remember. Trials are stored in the response_<ATTEMPT>.csv file within the activity folder along with the supporting tables for stimuli and options.
Statistics & Models: Models are the highest level of data that represent the data analysis and interpretation of the trials. For example, in a DigitSpan activity, a model might be the subject’s working memory capacity or a more complex one, like a deep learning model that predicts the subject’s performance.

The main data in behavioral data analysis is the trials. Events are the raw data collected during an activity, and statistics and models are interpretations of the trials. The following sections assumes trials as the main data.

Activities data

See the Response table for more information. Trials are stored in the response_<ATTEMPT>.csv file within the activity folder and can be accompanied by the optional supporting tables for stimuli, options, etc.

Within the activity folders, you will find the data files collected during the activity. Here is an example of a DigitSpan activity folder:

...
DigitSpan/
├── response_1.csv
├── response_2.csv
├── stimulus_1.csv
├── stimulus_2.csv
├── option_1.csv
└── option_2.csv

The response_<ATTEMPT>.csv contains the data collected during the DigitSpan activity, where <ATTEMPT> is 1-indexed suffix and partitions data by how many times the agent has initiated the activity, e.g., first attempt, second attempt, etc.
If there was only one attempt to complete the activity, then it can be named response_1.csv. If there were multiple attempts, then the file can be named response_1.csv, response_2.csv, etc.
In the example above, there are two attempts for the DigitSpan activity, so there are two tables: response_1.csv and response_2.csv.

Instruments

The instruments/ folder contains the instrument files. Each instrument file is a YAML file that describes the instrument and its parameters. Here is an example of a UFOV instrument file:

name: UFOV
description: |
  The Useful Field of View (UFOV) test is a measure of visual attention. The test consists of three subtests: processing speed, divided attention, and selective attention.

parameters:
  - name: subtest
    description: The subtest of the UFOV test
    type: enum
    values:
      - processing_speed
      - divided_attention
      - selective_attention