Glossary
Controlled vocabulary for behavioral data terms and BDM concepts
- 
accuracy float
- 
Refers to a measure of performance. In many behavioral tasks, it reflects the percentage (0-100%) or fraction (0-1) of correct responses. Always use accuracy to refer to a performance measure that is a real number (float) and bounded to the [0-1] range. Range 0 to 1 (inclusive) 
- 
correct boolean
- 
A boolean which indicates whether a response in a given trial was correct or not. When no response was given when it should (i.e., timeout), correct evaluates to FALSErather thanN/A. This is to avoid the case where subjects would be given a high performance score when in fact they avoided all difficult trials and responded correctly only to easy trials.
- 
response_time float
- 
The meaning of response time or reaction time (and its unit) is not consistent across studies. In BDM, response_timeis the duration in seconds between a) the moment the subjects fully completed their response on a given trial, and b) the moment that the earliest possible correct response could have been completed by a hypothetical agent with perfect knowledge of the task and ability to instantaneously execute the response.Range In seconds 
- 
timed_out boolean
- 
Indicates whether the subject failed to respond within the allocated time period. 
- 
age float
- 
Age is typically expressed in years. However, we don’t recommend rounding “age” to get integer values, as rounding implies losing data. It is better to leave variables as real numbers (floats when they are floats) and let the data analysts decide whether or not rounding this variable is necessary for their specific use case. 
- 
gender/sex enum
- 
Gender and sex are not exactly the same. Sex refers to a biological sex while gender is a more complex construct. A person may have a male biological sex but identify as a women for example. Depending on the question asked, the variable should therefore be either sexorgender.For example, “What sex were you assigned at birth, such as on an original birth certificate?” is a question about biological sex and should be coded as sex. The possible values forsexare:Range - female: female (girl, woman)
- male: male (boy, man)
- other: other non-binary
- skip: prefer not to say
 
- 
length float
- 
Refers to the length in centimeters of a physical object. When possible use a more specific word (e.g., height, width, distance). 
- 
height float
- 
Refers to the height of a physical object in centimeters. 
- 
width float
- 
Refers to the width of a physical object in centimeters. 
- 
weight float
- 
Refers to the weight of a physical object in kilogram. 
- 
*_count integer
- 
Refers to the cardinality of that entity. A variable named page_countindicates the number of pages. Or, if an observation/row hascar_count = 5this means that this particular observation involves a total count of 5 cars; this 5 is unrelated to other rows in the table.
- 
type enum
- 
Type is always an enum with known values. The meaning of the particular enum value needs to be explained in a codebook. 
- 
description string
- 
Description is always a text (string) for human consumption. While it is not strictly necessary, a textual description can greatly facilitate the understanding and processing of the data by humans. 
- 
mean float
- 
The average of a numeric variable. 
- 
median float
- 
The median of the variable. 
- mode
- 
The mode of a variable. 
- min
- 
The minimum value of a variable. 
- max
- 
The maximum value of a variable. 
- sd
- 
The standard deviation of a variable. 
- var
- 
The variance of a variable. 
- iqr
- 
The interquartile range of a variable. 
- 
sum float
- 
The sum of all values of a variable (e.g., item_price_sum = sum(item_price)).
- 
quantile* float
- 
Quantile is similar to percentile, as both refer to the value of a parameter Q that splits the data such that a given fraction of the data is smaller than Q. Quantile expresses that fraction as a number between 0 and 1 while percentiles express it as a percentage (between 0 and 100). 
- 
rank integer
- 
Rank of a value in a set (ascending or first to last). 
- 
log float
- 
Natural log. 
- 
log2 float
- 
Log of base 2. 
- 
log10 float
- 
Log of base 10. 
- sqrt
- 
Square root. 
- pow2
- 
Power of 2. 
- 
floor integer
- 
Flooring of a number (e.g., 3.6 becomes 3). 
- 
ceil integer
- 
Ceiling of a number (e.g., 3.6 becomes 4). 
- 
round integer
- 
Rounding of a number to the closest integer (e.g., 3.6 becomes 4). 
- 
*_id stringinteger
- 
If a column or variable name is suffixed with _id(e.g.,participant_id,task_id), it is expected that there exists a supplementary table which has the same name (“participant”, “task”), with a primary key namedidsuch that a value of in the first (particiapant_id = 215) refers to an entry in the second (a row in the participant table whereid = 215). It is expected that the values in a variable postfixed_idare unique within a “local scope” of the source table; however, it is not expected that they are unique globally—for such purposes one should use the_uuid.Range Unique within a table or within an explicit context 
- 
*_name string
- 
Sometimes “name” is used in a way that is similar to a unique id (e.g., study_nameortask_name). The difference between “id” and “name” is that “name” is expected to be a readable text (e.g.,n-backversusf346-r23v). As with “id”, it is expected that it refers to other tables and that it is unique within a certain context (contrary to, for example, “label”).
- 
*_uuid string
- 
Universally Unique Identifier (UUID) is a random 32-digit label that can be generated on the fly and will most likely be unique in computer systems. UUID can be used to assign a record a unique identifier without having to ensure that that number is not yet used by some other records or tables. Range UUIDv7 or later 
- 
*_hash string
- 
It is sometimes useful to create a reproducible keys based on some data. A hash is not strictly necessary as it can be recreated using different data but it can be convenient for data processing. 
- 
*_index integer
- 
Indices should be favored over labels and ids when a variable is used for referencing and when order is important (often, but not always, the chronological order). For example, a variable named stimulus_position_indeximplies its value points to an entry in a list of possible stimulus positions.Range 1-based indices 
- 
*_repetition integer
- 
Repetition counts the number of times the same “thing” occurred, e.g., a participant completes the same test twice, the same stimulus appears multiple times. Range 0-based 
- 
*_label string
- 
A text attached to a variable and identifies it. It is expected to be human readable, but not always unique. 
General
Other measures of durations exist and may be useful to describe subjects’ responses. If such additional measures are needed, they should be specified explicitly; for example: response_onset, response_offset, or response_duration.
Units for response times are not consistent across papers and publicly available datasets. One can find them expressed in either seconds or milliseconds. BDM uses seconds as the default unit for response times to: - avoid “exception” by always using seconds as the temporal unit; - avoid additional computation by keeping the units as they currently are in our raw data and task speicifications; - avoid the temptation to round times to integers when expressed in milliseconds; - take advantage of the fact that many popular packages to analyse response time seem to be using seconds as the default unit; - be consistent with what seems to be the default unit in fMRI data standards (e.g., BIDS or DICOMs).
It is tempting to abbreviate response_time as rt. However, there are several other variables prefixed response_ which do not have abbreviations. Spelling the names out, while making the name longer, makes the overall data structure more consistent and explicit.
Demographics
Generic Suffixes
Don’t use length to mean count or size. This is contrary to the terms used in arrays/lists in programming languages.
Avoid the use of size as this term is ambiguous; it could refer to the height of a person, the screen width \(\times\) times height dimensions, or a level within a likert scale (e.g., “Medium”).
“Note that”count” is different from “sum” (e.g., one can sum negative float values while count involves positive integers only) and from “index” (e.g., “this is the second” versus “there are two”).
Avoid the use of n to refer to counts. While using n to refer to counts is much shorter and might be standard in some circles, count is more explicit and less error-prone than n which may mean different things in other contexts (e.g., the length of the variable, an iterator).”
It can be tempting to use synonyms of “type”, in particular when “type” is already used for something else. Such synonyms include things like “category”, “kind” or “set”. When those terms are not required, they should be avoided and replaced by “type”.
Aggregation Suffixes
Don’t use avg or average to refer to the mean value.
Don’t use med to refer to the median.
Don’t use std or SD to refer to the standard deviation.
Don’t use IQR to refer to the interquartile range.
Don’t use total to designate the result of a sum operation.
Use quantiles rather than percentiles because they allow naming the resulting variables in a simpler way. BDM uses the following convention to name the parameter X: - quantile(x, q = 0.23) -> quantile23 - quantile(x, q = 0.145) -> quantile145
Note that quantile(x, q = 1) can not be expressed using this convention. However, quantile(x, q = 1) is in fact equivalent to max(x) which is the preferred expression.
Variables can be sorted (for example from the smallest to the largest values) and some values can be tied (in which case the rank may no longer be represented by integers). Also, it might not be clear if the ranks are descending or ascending (e.g., age_rank). If such confusion arises, it is prefered to use a more explicit name (e.g., youngest_to_oldest or youngest_first_rank).
Transformation Suffixes
Always specify the base when using the log except for the natural log.
Referencing
Note that “id” typically implies a context, within which the “id’ is unique. That context must be made explicit. For example, trial_id may identify trials within a trial table for one activity completed by one subject.
If there is a column named id (i.e., without prefix), it is expected to be a primary key and there exists other tables or files that refer to this column; if such a link between tables does not exist, use index or name instead.
The postfix _id does not imply a particular data type: both integers and strings are valid.
*_uuids are expected to be globally unique.
*_uuids are not expected to be human interpretable.
Avoid using _uid suffix to refer to a UUID variable.
Within BDM, string-formatted Version 7 UUIDs are preferred over older versions or corresponding 128-bit integers. For example: 01934efd-35d5-79db-9aca-fc29b0451cd1.
There is no single widespread standard for hashing; rather there are multiple algorithms that can be used depending on the use case. You can use either CRC32 (32 hexadecimal characters; e.g., “098f6bcd4621d373cade4e832627b4f6”) or SHA256 (base64 characters, e.g., “d14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f”) depending on the probability of collision (i.e., two hashes for different data being identical). When that collision probability is deemed high, use SHA256.
Note that “index” typically implies a context, within which the indexing occurs and that context must be made explicit. For example, trial_index may index trials within a block.
BDM follows the convention of 1-based indexing: always starting counting/indexing from 1 rather than 0.
Avoid the use of *_number because it is ambiguous.
As with index and id, repetition assumes a context which must be clarified when ambiguous.
Repetition is 0-based: it starts “counting” at 0 rather than 1; *_iteration instead of *_repetition would make it 1-based like indices, but it is less explicit and thus less preferred.
