Glossary

Controlled vocabulary for behavioral data terms and BDM concepts

General

accuracy float

Refers to a measure of performance. In many behavioral tasks, it reflects the percentage (0-100%) or fraction (0-1) of correct responses. Always use accuracy to refer to a performance measure that is a real number (float) and bounded to the [0-1] range.

Range

0 to 1 (inclusive)

correct boolean

A boolean which indicates whether a response in a given trial was correct or not. When no response was given when it should (i.e., timeout), correct evaluates to FALSE rather than N/A. This is to avoid the case where subjects would be given a high performance score when in fact they avoided all difficult trials and responded correctly only to easy trials.

response_time float

The meaning of response time or reaction time (and its unit) is not consistent across studies. In BDM, response_time is the duration in seconds between a) the moment the subjects fully completed their response on a given trial, and b) the moment that the earliest possible correct response could have been completed by a hypothetical agent with perfect knowledge of the task and ability to instantaneously execute the response.

Range

In seconds

Other measures of durations exist and may be useful to describe subjects’ responses. If such additional measures are needed, they should be specified explicitly; for example: response_onset, response_offset, or response_duration.

Units for response times are not consistent across papers and publicly available datasets. One can find them expressed in either seconds or milliseconds. BDM uses seconds as the default unit for response times to: - avoid “exception” by always using seconds as the temporal unit; - avoid additional computation by keeping the units as they currently are in our raw data and task speicifications; - avoid the temptation to round times to integers when expressed in milliseconds; - take advantage of the fact that many popular packages to analyse response time seem to be using seconds as the default unit; - be consistent with what seems to be the default unit in fMRI data standards (e.g., BIDS or DICOMs).

It is tempting to abbreviate response_time as rt. However, there are several other variables prefixed response_ which do not have abbreviations. Spelling the names out, while making the name longer, makes the overall data structure more consistent and explicit.

timed_out boolean

Indicates whether the subject failed to respond within the allocated time period.

Demographics

age float

Age is typically expressed in years. However, we don’t recommend rounding “age” to get integer values, as rounding implies losing data. It is better to leave variables as real numbers (floats when they are floats) and let the data analysts decide whether or not rounding this variable is necessary for their specific use case.

gender/sex enum

Gender and sex are not exactly the same. Sex refers to a biological sex while gender is a more complex construct. A person may have a male biological sex but identify as a women for example. Depending on the question asked, the variable should therefore be either sex or gender.

For example, “What sex were you assigned at birth, such as on an original birth certificate?” is a question about biological sex and should be coded as sex. The possible values for sex are:

Range

female: female (girl, woman)

male: male (boy, man)

other: other non-binary

skip: prefer not to say

Generic Suffixes

length float

Refers to the length in centimeters of a physical object. When possible use a more specific word (e.g., height, width, distance).

Don’t use length to mean count or size. This is contrary to the terms used in arrays/lists in programming languages.

height float

Refers to the height of a physical object in centimeters.

width float

Refers to the width of a physical object in centimeters.

weight float

Refers to the weight of a physical object in kilogram.

*_count integer

Refers to the cardinality of that entity. A variable named page_count indicates the number of pages. Or, if an observation/row has car_count = 5 this means that this particular observation involves a total count of 5 cars; this 5 is unrelated to other rows in the table.

Avoid the use of size as this term is ambiguous; it could refer to the height of a person, the screen width \(\times\) times height dimensions, or a level within a likert scale (e.g., “Medium”).

“Note that”count” is different from “sum” (e.g., one can sum negative float values while count involves positive integers only) and from “index” (e.g., “this is the second” versus “there are two”).

Avoid the use of n to refer to counts. While using n to refer to counts is much shorter and might be standard in some circles, count is more explicit and less error-prone than n which may mean different things in other contexts (e.g., the length of the variable, an iterator).”

type enum

Type is always an enum with known values. The meaning of the particular enum value needs to be explained in a codebook.

It can be tempting to use synonyms of “type”, in particular when “type” is already used for something else. Such synonyms include things like “category”, “kind” or “set”. When those terms are not required, they should be avoided and replaced by “type”.

description string

Description is always a text (string) for human consumption. While it is not strictly necessary, a textual description can greatly facilitate the understanding and processing of the data by humans.

Aggregation Suffixes

mean float

The average of a numeric variable.

Don’t use avg or average to refer to the mean value.

median float

The median of the variable.

Don’t use med to refer to the median.

mode

The mode of a variable.

min

The minimum value of a variable.

max

The maximum value of a variable.

The standard deviation of a variable.

Don’t use std or SD to refer to the standard deviation.

var

The variance of a variable.

iqr

The interquartile range of a variable.

Don’t use IQR to refer to the interquartile range.

sum float

The sum of all values of a variable (e.g., item_price_sum = sum(item_price)).

Don’t use total to designate the result of a sum operation.

quantile* float

Quantile is similar to percentile, as both refer to the value of a parameter Q that splits the data such that a given fraction of the data is smaller than Q. Quantile expresses that fraction as a number between 0 and 1 while percentiles express it as a percentage (between 0 and 100).

Use quantiles rather than percentiles because they allow naming the resulting variables in a simpler way. BDM uses the following convention to name the parameter X: - quantile(x, q = 0.23) -> quantile23 - quantile(x, q = 0.145) -> quantile145

Note that quantile(x, q = 1) can not be expressed using this convention. However, quantile(x, q = 1) is in fact equivalent to max(x) which is the preferred expression.

rank integer

Rank of a value in a set (ascending or first to last).

Variables can be sorted (for example from the smallest to the largest values) and some values can be tied (in which case the rank may no longer be represented by integers). Also, it might not be clear if the ranks are descending or ascending (e.g., age_rank). If such confusion arises, it is prefered to use a more explicit name (e.g., youngest_to_oldest or youngest_first_rank).

Transformation Suffixes

log float

Natural log.

log2 float

Log of base 2.

log10 float

Log of base 10.

Always specify the base when using the log except for the natural log.

sqrt

Square root.

pow2

Power of 2.

floor integer

Flooring of a number (e.g., 3.6 becomes 3).

ceil integer

Ceiling of a number (e.g., 3.6 becomes 4).

round integer

Rounding of a number to the closest integer (e.g., 3.6 becomes 4).

Referencing

*_id stringinteger

If a column or variable name is suffixed with _id (e.g., participant_id, task_id), it is expected that there exists a supplementary table which has the same name (“participant”, “task”), with a primary key named id such that a value of in the first (particiapant_id = 215) refers to an entry in the second (a row in the participant table where id = 215). It is expected that the values in a variable postfixed _id are unique within a “local scope” of the source table; however, it is not expected that they are unique globally—for such purposes one should use the _uuid.

Range

Unique within a table or within an explicit context

Note that “id” typically implies a context, within which the “id’ is unique. That context must be made explicit. For example, trial_id may identify trials within a trial table for one activity completed by one subject.

If there is a column named id (i.e., without prefix), it is expected to be a primary key and there exists other tables or files that refer to this column; if such a link between tables does not exist, use index or name instead.

The postfix _id does not imply a particular data type: both integers and strings are valid.

*_name string

Sometimes “name” is used in a way that is similar to a unique id (e.g., study_name or task_name). The difference between “id” and “name” is that “name” is expected to be a readable text (e.g., n-back versus f346-r23v). As with “id”, it is expected that it refers to other tables and that it is unique within a certain context (contrary to, for example, “label”).

*_uuid string

Universally Unique Identifier (UUID) is a random 32-digit label that can be generated on the fly and will most likely be unique in computer systems. UUID can be used to assign a record a unique identifier without having to ensure that that number is not yet used by some other records or tables.

Range

UUIDv7 or later

*_uuids are expected to be globally unique.

*_uuids are not expected to be human interpretable.

Avoid using _uid suffix to refer to a UUID variable.

Within BDM, string-formatted Version 7 UUIDs are preferred over older versions or corresponding 128-bit integers. For example: 01934efd-35d5-79db-9aca-fc29b0451cd1.

*_hash string

It is sometimes useful to create a reproducible keys based on some data. A hash is not strictly necessary as it can be recreated using different data but it can be convenient for data processing.

There is no single widespread standard for hashing; rather there are multiple algorithms that can be used depending on the use case. You can use either CRC32 (32 hexadecimal characters; e.g., “098f6bcd4621d373cade4e832627b4f6”) or SHA256 (base64 characters, e.g., “d14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f”) depending on the probability of collision (i.e., two hashes for different data being identical). When that collision probability is deemed high, use SHA256.

*_index integer

Indices should be favored over labels and ids when a variable is used for referencing and when order is important (often, but not always, the chronological order). For example, a variable named stimulus_position_index implies its value points to an entry in a list of possible stimulus positions.

Range

1-based indices

Note that “index” typically implies a context, within which the indexing occurs and that context must be made explicit. For example, trial_index may index trials within a block.

BDM follows the convention of 1-based indexing: always starting counting/indexing from 1 rather than 0.

Avoid the use of *_number because it is ambiguous.

*_repetition integer

Repetition counts the number of times the same “thing” occurred, e.g., a participant completes the same test twice, the same stimulus appears multiple times.

Range

0-based

As with index and id, repetition assumes a context which must be clarified when ambiguous.

Repetition is 0-based: it starts “counting” at 0 rather than 1; *_iteration instead of *_repetition would make it 1-based like indices, but it is less explicit and thus less preferred.

*_label string

A text attached to a variable and identifies it. It is expected to be human readable, but not always unique.

Categories

General

Demographics

Generic Suffixes

Aggregation Suffixes

Transformation Suffixes

Referencing