Model a preprocessing pipeline
Studyflow’s data operations let you describe what each task in a pipeline does, not just that data flows through it. This guide covers the common operations and the patterns they compose into.
The operations
Each operation is a marker (a small f badge) on an otherwise normal BPMN task. The task remains executable as a script or service task; the marker tells readers and tooling what kind of transformation it performs.
| Operation | Cardinality | When to use |
|---|---|---|
Map |
1 → 1 | Element-wise transform (rescale, compute a column, parse a field). |
Filter |
N → subset(N) | Drop rows that fail a criterion (RT < 100 ms, missing data). |
FlatMap |
1 → N | Unnest, expand events per trial, explode arrays. |
Reduce |
N → 1 (per group) | Aggregate to summary statistics (mean RT, accuracy). |
Group |
N → G groups | Partition by participant, condition, or block. |
Transform |
generic | Any pure function not better described by the above. |
Compose |
n/a | Bundle several primitive operations into one logical step. |
In the modeler palette these elements use Title Case names (Map, Filter, Reduce, …) while the underlying operationType values serialized in the .studyflow file are lowercase (map, filter, reduce, …) – see Data.
Filter changes the dataset by removing rows. A data-driven gateway changes the control flow. They are not interchangeable – see Data for the distinction.
Ready-made preprocessing templates – an fMRIPrep task (PreprocessfMRI) and an EEGPrep subprocess (PreprocessEEG) – ship with the optional omniprocess schema; see Extensions.
A typical pipeline
A behavioral RT analysis pipeline usually has this shape:
- Read raw data – a normal task with a data input from a
DatasetorTable. - Filter invalid trials (RT outliers, missing responses).
- Map to compute derived columns (log RT, condition labels).
- Group by participant and condition.
- Reduce to per-group summary statistics.
- Write output to a new
DatasetorTable.
Modeled in Studyflow, each step is a task with the corresponding marker, connected by sequence flows. Data flows are dashed edges from/to the data stores.
The canonical RT pipeline figure should live at docs/assets/img/guides/preprocessing-pipeline.svg.
Why this matters
Three concrete benefits over describing the pipeline only in prose or only in code:
- Reviewers can verify the analysis plan without reading the code.
- Pre-registration of analysis is the diagram itself – there’s no “what we did” vs “what we planned” drift.
- Reuse is straightforward – copy the
Filter -> Map -> Group -> Reducesub-process into the next study.
Compose vs primitive
For exploratory work, primitive markers (Map, Filter, Reduce) keep the diagram explicit. For published pipelines, Compose lets you encapsulate a multi-step operation as one task with a name like outlier-trimmed RT summary. Use Compose when the operation is a unit of meaning (you’d give it a name in a paper), and primitives when the steps are themselves the story.
Streaming and batch
Map and Filter are stateless and work on either batch or streaming data. Reduce and Group are stateful – for streaming, they require windowing or accumulator semantics, typically passed as arguments to the bound function.