Design process

How BDM is being developed

Here we discuss the guiding principles and how they are applied to the development of the BDM.

BDM is similar to the domain-specific data models used in various domains, specifically designed to address the unique diversity of data in cognitive and behavioral sciences. The development of BDM is an iterative process guided by two design patterns: Domain-Driven Design and Data Lakehouse.

Domain-Driven Design aims to develop a shared understanding of data that works for both domain experts and technical experts.

And Data Lakehouse helps with harmonizing a diverse se of data without sacrificing flexibility or speed (cognitive data can be thought as multi-modal, even at the level of tasks).

These two patterns address the shortcomings of metadata-centric methods, which are often costly and ineffective at achieving reusability 1,2, and emphasize the need for a shift from ad-hoc metadata to robust data structures. They encourage a ubiquitous language for many cognitive tasks, crucial for developing tools and fostering collaboration in research.

Moreover, BDM aims to address data management during all the steps of a research workflow, from the start to its completion:

Main use cases of the BDM, from the start of a research project to its completion

Domain-Driven Design

Behavioral data is inherently complex due to the variety of data types, the diversity of experimental designs, and the need for precise data interpretation. Traditional data management approaches often struggle with the variability and context-dependence of behavioral data. Such domain-specific complexity are best addressed in software engineering by Domain-Driven Design (DDD), which is a structured approach that aligns technical design with the specific needs of the domain, which in our case is behavioral science.

One of the primary benefits of applying DDD to behavioral data is that it serves as the foundation for all study design, data collection, pre-processing, and analysis phases. It also ensures consistency across different studies while accommodating the unique aspects of different experiments. DDD’s focus on bounded contexts allows for the clear definition of key concepts like “Trial,” ensuring that these concepts are used consistently across different settings.

Unlike rigid standards, flexibility of DDD allows for cost-effective and optional opt-in for some of its features. Ontologies and standards that are purely based on domain knowledge often lead to unstructured, raw knowledge. DDD, on the other hand, addresses the need for developing tools to conduct research, and provides a language that bridges the gap between domain experts and technical experts.

Events

Most of the time, raw data is collected as a series of events. These events are often recorded in a transactional way, meaning that they are stored as they occur, without waiting for some final state. This approach is more convenient for software engineers, as most code is written to control user interactions and store the corresponding data.

Trial: aggregate root

Trial is the main aggregate root of the BDM. It represents a single instance of an agent performing a task, including all the data collected during that instance. The Trial aggregate includes various components, such as: Responses, Stimuli, Options, Outcomes, Evaluation, etc. These components are essential for understanding the context and results of the trial.

Here is a diagram of the trial concept. The diagram shows the relationships between the domain-specific concepts and their attributes.

Core domain concepts

  • Response
  • Stimulus
  • Option

Supporting domain concepts

  • Agent
  • Task projection (task-specific events-to-trials mapping/transformation)
  • Experimental design (e.g., studyflow, adaptive parameters, conditions, etc)
  • Instruments (i.e., the environment in which the task is performed)

Data Lakehouse

A missing but important aspect in behavioral datasets is how to make analysis of data easy, cost-effective, and consistent while keeping the harmonizing/federation of data straightforward. For this reason, BDM relies on data lakehouse architecture which, unlike previous approaches, provides a lost-cost ingestion/storage/consumption based on a schema-on-read. Lakehouse allows, during study design and data collection, to only focus on the ubiquitous language of trials and events and later one, during the analysis and modeling, to apply loosely-coupled techniques to standardize datasets.

Overall, considering events and trials as the ubiquitous domain concepts and data lakehouse as the harmonization approach make it possible to scale and manage behavioral analysis systems. New challenges involve storage, debugging, increased traffic, and extending the data architecture to new tasks and modalities.

Data partitioning

Identical data structure

Levels of data

Behavioral experiments require and produce a variety of interrelated data, from instructions shown to the subjects, task parameters, and recorded interactions, to aggregated trials data that are interpretable in accord with the scientific question. If we consider cognitive tasks as a set of interactions that participants do, trial only captures the final state of those interactions. Events on the other hand allow us to reconstruct the history of interactions. However, events are not commonly used during the analysis and most researchers are only interested in the final states of events, i.e., trials.

For software engineers however, it’s more convenient to collect transactional events rather than waiting for some final states; most codes are written as controlling some user interactions and storing the corresponding data. Trial data is an implicit post-processing of those event data. For a full reconstruction of what happened during the task, events data is more useful and it also leads to datasets that can be used in more modern disciplines such as RL where we are interested in learning how an agent interacts with the environment, rather than the final decisions.

Data governance

  • Versioning
  • Metadata
  • MLOps
Back to top

Footnotes

  1. N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and L. M. Aroyo, “Everyone wants to do the model work, not the data work: Data Cascades in High-Stakes AI,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, in CHI’21. New York, NY, USA: Association for Computing Machinery, May 2021. doi: 10.1145/3411764.3445518.↩︎

  2. M. Hulsebos, W. Lin, S. Shankar, and A. Parameswaran, “It Took Longer than I was Expecting: Why is Dataset Search Still so Hard?,” in Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics, in HILDA 24. New York, NY, USA: Association for Computing Machinery, June 2024. doi: 10.1145/3665939.3665959.↩︎