Glossary
Controlled vocabulary for behavioral data terms and BDM concepts
-
accuracy
float
-
Refers to a measure of performance. In many behavioral tasks, it reflects the percentage (0-100%) or fraction (0-1) of correct responses. Always use accuracy to refer to a performance measure that is a real number (float) and bounded to the [0-1] range.
Range
0 to 1 (inclusive)
-
correct
boolean
-
A boolean which indicates whether a response in a given trial was correct or not. When no response was given when it should (i.e., timeout), correct evaluates to
FALSE
rather thanN/A
. This is to avoid the case where subjects would be given a high performance score when in fact they avoided all difficult trials and responded correctly only to easy trials. -
response_time
float
-
The meaning of response time or reaction time (and its unit) is not consistent across studies. In BDM,
response_time
is the duration in seconds between a) the moment the subjects fully completed their response on a given trial, and b) the moment that the earliest possible correct response could have been completed by a hypothetical agent with perfect knowledge of the task and ability to instantaneously execute the response.Range
In seconds
-
timed_out
boolean
-
Indicates whether the subject failed to respond within the allocated time period.
-
age
float
-
Age is typically expressed in years. However, we don’t recommend rounding “age” to get integer values, as rounding implies losing data. It is better to leave variables as real numbers (floats when they are floats) and let the data analysts decide whether or not rounding this variable is necessary for their specific use case.
-
gender/sex
enum
-
Gender and sex are not exactly the same. Sex refers to a biological sex while gender is a more complex construct. A person may have a male biological sex but identify as a women for example. Depending on the question asked, the variable should therefore be either
sex
orgender
.For example, “What sex were you assigned at birth, such as on an original birth certificate?” is a question about biological sex and should be coded as
sex
. The possible values forsex
are:Range
female
: female (girl, woman)male
: male (boy, man)other
: other non-binaryskip
: prefer not to say
-
length
float
-
Refers to the length in centimeters of a physical object. When possible use a more specific word (e.g., height, width, distance).
-
height
float
-
Refers to the height of a physical object in centimeters.
-
width
float
-
Refers to the width of a physical object in centimeters.
-
weight
float
-
Refers to the weight of a physical object in kilogram.
-
*_count
integer
-
Refers to the cardinality of that entity. A variable named
page_count
indicates the number of pages. Or, if an observation/row hascar_count = 5
this means that this particular observation involves a total count of 5 cars; this 5 is unrelated to other rows in the table. -
type
enum
-
Type is always an enum with known values. The meaning of the particular enum value needs to be explained in a codebook.
-
description
string
-
Description is always a text (string) for human consumption. While it is not strictly necessary, a textual description can greatly facilitate the understanding and processing of the data by humans.
-
mean
float
-
The average of a numeric variable.
-
median
float
-
The median of the variable.
- mode
-
The mode of a variable.
- min
-
The minimum value of a variable.
- max
-
The maximum value of a variable.
- sd
-
The standard deviation of a variable.
- var
-
The variance of a variable.
- iqr
-
The interquartile range of a variable.
-
sum
float
-
The sum of all values of a variable (e.g.,
item_price_sum = sum(item_price)
). -
quantile*
float
-
Quantile is similar to percentile, as both refer to the value of a parameter Q that splits the data such that a given fraction of the data is smaller than Q. Quantile expresses that fraction as a number between 0 and 1 while percentiles express it as a percentage (between 0 and 100).
-
rank
integer
-
Rank of a value in a set (ascending or first to last).
-
log
float
-
Natural log.
-
log2
float
-
Log of base 2.
-
log10
float
-
Log of base 10.
- sqrt
-
Square root.
- pow2
-
Power of 2.
-
floor
integer
-
Flooring of a number (e.g., 3.6 becomes 3).
-
ceil
integer
-
Ceiling of a number (e.g., 3.6 becomes 4).
-
round
integer
-
Rounding of a number to the closest integer (e.g., 3.6 becomes 4).
-
*_id
string
integer
-
If a column or variable name is suffixed with
_id
(e.g.,participant_id
,task_id
), it is expected that there exists a supplementary table which has the same name (“participant”, “task”), with a primary key namedid
such that a value of in the first (particiapant_id = 215
) refers to an entry in the second (a row in the participant table whereid = 215
). It is expected that the values in a variable postfixed_id
are unique within a “local scope” of the source table; however, it is not expected that they are unique globally—for such purposes one should use the_uuid
.Range
Unique within a table or within an explicit context
-
*_name
string
-
Sometimes “name” is used in a way that is similar to a unique id (e.g.,
study_name
ortask_name
). The difference between “id” and “name” is that “name” is expected to be a readable text (e.g.,n-back
versusf346-r23v
). As with “id”, it is expected that it refers to other tables and that it is unique within a certain context (contrary to, for example, “label”). -
*_uuid
string
-
Universally Unique Identifier (UUID) is a random 32-digit label that can be generated on the fly and will most likely be unique in computer systems. UUID can be used to assign a record a unique identifier without having to ensure that that number is not yet used by some other records or tables.
Range
UUIDv7 or later
-
*_hash
string
-
It is sometimes useful to create a reproducible keys based on some data. A hash is not strictly necessary as it can be recreated using different data but it can be convenient for data processing.
-
*_index
integer
-
Indices should be favored over labels and ids when a variable is used for referencing and when order is important (often, but not always, the chronological order). For example, a variable named
stimulus_position_index
implies its value points to an entry in a list of possible stimulus positions.Range
1-based indices
-
*_repetition
integer
-
Repetition counts the number of times the same “thing” occurred, e.g., a participant completes the same test twice, the same stimulus appears multiple times.
Range
0-based
-
*_label
string
-
A text attached to a variable and identifies it. It is expected to be human readable, but not always unique.
General
Other measures of durations exist and may be useful to describe subjects’ responses. If such additional measures are needed, they should be specified explicitly; for example: response_onset
, response_offset
, or response_duration
.
Units for response times are not consistent across papers and publicly available datasets. One can find them expressed in either seconds or milliseconds. BDM uses seconds as the default unit for response times to: - avoid “exception” by always using seconds as the temporal unit; - avoid additional computation by keeping the units as they currently are in our raw data and task speicifications; - avoid the temptation to round times to integers when expressed in milliseconds; - take advantage of the fact that many popular packages to analyse response time seem to be using seconds as the default unit; - be consistent with what seems to be the default unit in fMRI data standards (e.g., BIDS or DICOMs).
It is tempting to abbreviate response_time
as rt
. However, there are several other variables prefixed response_
which do not have abbreviations. Spelling the names out, while making the name longer, makes the overall data structure more consistent and explicit.
Demographics
Generic Suffixes
Don’t use length to mean count or size. This is contrary to the terms used in arrays/lists in programming languages.
Avoid the use of size
as this term is ambiguous; it could refer to the height of a person, the screen width \(\times\) times height dimensions, or a level within a likert scale (e.g., “Medium”).
“Note that”count” is different from “sum” (e.g., one can sum negative float values while count involves positive integers only) and from “index” (e.g., “this is the second” versus “there are two”).
Avoid the use of n
to refer to counts. While using n
to refer to counts is much shorter and might be standard in some circles, count
is more explicit and less error-prone than n
which may mean different things in other contexts (e.g., the length of the variable, an iterator).”
It can be tempting to use synonyms of “type”, in particular when “type” is already used for something else. Such synonyms include things like “category”, “kind” or “set”. When those terms are not required, they should be avoided and replaced by “type”.
Aggregation Suffixes
Don’t use avg
or average
to refer to the mean value.
Don’t use med
to refer to the median.
Don’t use std
or SD
to refer to the standard deviation.
Don’t use IQR
to refer to the interquartile range.
Don’t use total to designate the result of a sum operation.
Use quantiles rather than percentiles because they allow naming the resulting variables in a simpler way. BDM uses the following convention to name the parameter X: - quantile(x, q = 0.23)
-> quantile23
- quantile(x, q = 0.145)
-> quantile145
Note that quantile(x, q = 1)
can not be expressed using this convention. However, quantile(x, q = 1)
is in fact equivalent to max(x)
which is the preferred expression.
Variables can be sorted (for example from the smallest to the largest values) and some values can be tied (in which case the rank may no longer be represented by integers). Also, it might not be clear if the ranks are descending or ascending (e.g., age_rank
). If such confusion arises, it is prefered to use a more explicit name (e.g., youngest_to_oldest
or youngest_first_rank
).
Transformation Suffixes
Always specify the base when using the log except for the natural log.
Referencing
Note that “id” typically implies a context, within which the “id’ is unique. That context must be made explicit. For example, trial_id
may identify trials within a trial table for one activity completed by one subject.
If there is a column named id
(i.e., without prefix), it is expected to be a primary key and there exists other tables or files that refer to this column; if such a link between tables does not exist, use index
or name
instead.
The postfix _id
does not imply a particular data type: both integers and strings are valid.
*_uuid
s are expected to be globally unique.
*_uuid
s are not expected to be human interpretable.
Avoid using _uid
suffix to refer to a UUID variable.
Within BDM, string-formatted Version 7 UUIDs are preferred over older versions or corresponding 128-bit integers. For example: 01934efd-35d5-79db-9aca-fc29b0451cd1
.
There is no single widespread standard for hashing; rather there are multiple algorithms that can be used depending on the use case. You can use either CRC32 (32 hexadecimal characters; e.g., “098f6bcd4621d373cade4e832627b4f6”) or SHA256 (base64 characters, e.g., “d14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f”) depending on the probability of collision (i.e., two hashes for different data being identical). When that collision probability is deemed high, use SHA256.
Note that “index” typically implies a context, within which the indexing occurs and that context must be made explicit. For example, trial_index
may index trials within a block.
BDM follows the convention of 1-based indexing: always starting counting/indexing from 1 rather than 0.
Avoid the use of *_number
because it is ambiguous.
As with index and id, repetition assumes a context which must be clarified when ambiguous.
Repetition is 0-based: it starts “counting” at 0 rather than 1; *_iteration instead of *_repetition would make it 1-based like indices, but it is less explicit and thus less preferred.