Units and values

The ‘root cause’ of the loss of the spacecraft was the failed translation of English units into metric units in a segment of ground-based, navigation-related mission software, …

Arthur Stephenson
Chairman of the NASA Mars Climate Orbiter Mission Failure Investigation Board
From CMIXF-12


Some variables have measurement units. Variables should use the standard measurement unit for their type (e.g., duration in seconds; defined below), or the default unit assigned to them specifically within this document. Variables that use a non-standard, non-default measurement unit MUST specify the units explicitly using the format <variable>_in_<unit>.

Examples:

When the units are a single word it’s OK to use the complete word; if it has multiple words, it’s better to abbreviate them unless the abbreviation can lead to confusion or ambiguity. For example, prefer seconds to s but mpg to miles_per_gallon.

Note

There are default units, these are highlighted in bold below. In addition there are default units for specific variables. If a “standard” variable has a default unit, that default unit overrides the default in this list. For example, the default for duration is seconds. age is a duration but its default is years rather than seconds.

This document does not cover the subject of units exhaustively and will be updated as new use cases arise.

Formats

Numeric values

It can be sometimes tempting to floor, ceil or round a float variable into an integer. For example, one might assume that age when expressed in years should be an integer or that response times when expressed in milliseconds should be integers as well.

Such data transformations should be avoided as a default because they discard information. A data analyst who wishes to round a variable for a particular purpose can still perform that operation later on and rename the variables accordingly (e.g., age_floor).

Date, date time and timezones

Variables that express a “datetime” must be formatted using ISO 8601. Whenever possible, encode time at the millisecond resolution (or smaller) and include the timezone corresponding to participants’ local date and time in day (and not the timezone of the server collecting the data).

To make this last point clear, consider an online study with participants all over the world. At 6pm UTC, it is 11am in Los Angeles, 2pm in New York, 8pm in Paris and 3am (of the next day) in Tokyo. One would clearly expect a cognitive test completed at 6pm UTC to yield very different results depending on participants’ local timezone (e.g., 11am versus 3am). From a psychological point of view then, it is participants’ local time that matters (for a system administrator, on the other hand, UTC time would be more useful). Hence, we should express datetime variables in participants’ local timezone.

There are cases where recording local timezones may infringe privacy. In such cases there are several possible solutions: 1. remove timezones (11am Los Angeles -> 11am, 2pm New York -> 2pm); 2. replace timezones while keeping date and time in day (11am Los Angeles -> 11am UTC, 2pm New York -> 2pm UTC) 3. keep timezones but shift days (cf BIDS) 4. remove timezones and shift days

We recommend solution 1, where timezones are “missing”, rather than the other solutions which inject “lies” in the data. Hence, when timezones are not specified we assume (contrary to BIDS) that date_times are expressed in participants’ local (but unknown) timezone.

Example: “2009-10-31T01:48:52.512Z”

Note: the “year-month-day-hour-minute-second-ms” ordering is consistent with the hierarchy principle and provides a better view of the raw data than the format that would display day-month-year.

Languages

When referring to particular languages, we use the ISO_639-1 standard.

Colors

Colors can be expressed in many different ways,including hex (e.g., #A9A9A9), RGB (e.g., (169, 169, 169)) or CMYK (e.g., (0,0,0, 33.73)). We use two conventions to describe colors depending on the use case.

If the exact color value is not relevant and the information is destined for human readers, we refer to colors using color words (e.g., “red”, “green”, “blue”).

If, however, colors need to be specified exactly, we use the RGBA format which defines a color using 4 floating point values, each within the [0-1] range, and where the the first three number define the red, green and blue components of the color and the last number it’s opacity (0 = transparent, 1 = opaque).

Format of data values

Sometimes it is necessary to spread a table by one of its variables, meaning that the values of a particular variable become the names of new columns. In order for those new columns to be named in a way that is consistent with this style guide (i.e., lower_snake_case) it is necessary that variable values follow that same standard.

Missing values and errors

There is considerable variability in what is considered a missing value and how to deal with them. For example, in a go/no-go task where the correct response to a no-go stimulus is to NOT respond, should a non-response be considered a missing value? Can a missing value evaluate to a correct response on a trial? In a different test, if a person had to choose between two options within a certain time interval and failed to respond within that time, should the accuracy of the trial be set to NA as the person did not in fact make a choice between the two options?

First of all, it is useful to have a variable that explicitly indicates whether the participant timed out or not (rather than having to infer this information by looking at the presence of missing values within a particular task context).

Second, missing values should be applied only to the data that is in fact missing. If a person did not press a key in the go/no-go task, the response time would be missing but that would be considered to reflect the participant’s choice for a “no-go” which in that context is a valid response that could evaluate to correct.

Third, a response evaluates to correct if a specific set of conditions are met and evaluates to incorrect in all other cases. In particular, one should not simply ignore responses that are missing and evaluate average performance on only those trials where a response was given: a weak student that answers (correctly) only the easy questions in a test should not get a higher score than a better student who responded to all questions but made some errors on the hard questions.

Regarding the coding of missing values, some people use various numeric codes to “tag” missing values (e.g., -9, -99 and -999). This can lead to errors down the line (e.g., missing values are not noticed as missing or actual values look similar). This practice should be avoided and missing values should be coded explicitly as NA with information in other variables providing further context about the type of “missingness” (e.g., timed_out=True.)

Default units

Below we list the names we use for different measurement units by measurement dimension (e.g., duration); these are the names that should be used when suffixing variables (to explicitly indicate units; e.g., response_time_in_milliseconds). Default units for each measurement dimension are highlighted in bold; those units need not be explicit in the variable names. Note that some specific variables listed in this document have their own default units (e.g., age in years); the variable specific default unit overrides the dimension default (i.e., duration is typically expressed in seconds, but for age we use years).

Duration

  • milliseconds
  • seconds
  • minutes
  • hours
  • days
  • months
  • years
  • decades
  • centuries

Weight

  • mg: milligrams
  • g: grams
  • kg: kilograms
  • pounds

Length

  • mm: millimeters
  • cm: centimeters
  • m: meters
  • km: kilometers
  • inches
  • feet
  • feet_inch: mix of feet and inches (e.g., 5”3’)
  • pixels: number of pixels; the pixel size in cm depends on subjects screen size and resolution.
  • screen_fraction: fraction of subject’s screen diagonal expressed between 0 and 1, from bottom-left to top-right. For example, if the diagonal of the screen is 40cm long, a screen_fraction of 0.25 corresponds to 10cm.
Note

For more information on viewports, see CSS viewport-percentage length on MDN Web Docs.

Frequency

  • hertz: count per second