Abstract
Keywords
Evidence-accumulation models (EAMs) are powerful tools for understanding human and animal decision-making (Donkin & Brown, 2018; Evans & Wagenmakers, 2019; Gold & Shadlen, 2007; P. L. Smith & Ratcliff, 2024). They enable quantitative measurement of latent decision processes that are confounded in typical (e.g., linear model) analyses of response time (RT) and error rate (Lerche & Voss, 2020). EAMs explain key benchmark phenomena that arise in decision-making tasks (e.g., speed/accuracy trade-offs, asymmetries in the speed of correct and incorrect responses, and the characteristic positive skew of RT distributions; Ratcliff & McKoon, 2008). Since their introduction in the 1960s and 1970s (Audley & Pike, 1965; Laming, 1968; Link & Heath, 1975; M. Stone, 1960; Vickers, 1970), EAMs have become one of the most successful theoretical frameworks in cognitive psychology (Evans & Wagenmakers, 2019; Ratcliff et al., 2016; Ratcliff & McKoon, 2008; P. L. Smith & Ratcliff, 2024) and cognitive neuroscience (Forstmann et al., 2016; Forstmann, Wagenmakers, et al., 2011; Gold & Shadlen, 2007; Mulder et al., 2014; Schall, 2019; P. L. Smith & Ratcliff, 2004). Furthermore, they are increasingly being used to answer questions in domains such as behavioral economics (Busemeyer et al., 2019; Krajbich et al., 2014; Krajbich & Rangel, 2011) and human factors/ergonomics (Boag et al., 2023) and in clinical/health-care settings (Copeland et al., 2023; Ratcliff et al., 2022; White et al., 2010).
Obtaining valid inferences from EAMs relies on achieving a close match between model assumptions and features of the task and data to which the model is applied. Failing to achieve an appropriate task-model match can lead to misleading or spurious conclusions (e.g., Cassey et al., 2014; Ratcliff & Kang, 2021). However, the EAM literature lacks a comprehensive articulation of how to achieve a good task-model match. In this article, we provide practical guidance for designing tasks appropriate for EAMs, relating experimental manipulations to EAM parameters, planning for sample size, collecting and preparing data, and conducting and reporting an EAM analysis. We point out problems that can arise if the models are used without sufficient regard for the factors that determine their validity. Sometimes, there is no one-size-fits-all answer, and finding an appropriate design may require careful judgment and consideration of trade-offs (e.g., collecting more trials vs. maintaining participant engagement). To aid this process, we highlight the key issues and potential pitfalls affecting EAM analyses so that readers can better plan experiments for reliable EAM analysis. Our advice is grounded in prior methodological studies and our years of collective experience using EAMs to understand human and animal decision-making.
By encouraging good task-design practices, we hope to improve the quality and trustworthiness of future EAM research and applications. To make our advice as broadly applicable as possible, we do not focus on the details of specific EAMs. Instead, we focus on the common properties and design considerations shared by the most prominent basic EAM architectures (i.e., relative-evidence models, e.g., Ratcliff, 1978; Wagenmakers et al., 2007; Wagenmakers, van der Maas, et al., 2008; racing-accumulator models, e.g., Brown & Heathcote, 2008; Tillman et al., 2020; Usher & McClelland, 2001). Our advice is intended for researchers and students who wish to apply an existing “off-the-shelf” EAM to an experimental task to measure the cognitive processes driving decision-making behavior. Although our recommendations are intended for EAMs, many also apply more broadly to other cognitive-modeling approaches (e.g., reinforcement learning, R. C. Wilson & Collins, 2019).
In the next section, we outline the general features and assumptions of EAMs. The remainder of the article is structured according to a typical EAM-study workflow, illustrated in Figure 1. We first consider whether an EAM is the appropriate tool for a given research question. Next, we look at how to design EAM-appropriate experimental tasks and strategies for collecting informative data. We cover sample-size planning and discuss best practices for experimental procedure, assessing the quality of collected data, and fitting and evaluating models to obtain valid and reliable inferences. We discuss interpreting and reporting the results of an EAM analysis and close with advice on what to do when the standard models fail.

Overview of an idealized EAM workflow. Key steps in the workflow (bold) are shown with a selection of important methodological considerations and potential pitfalls (discussed in detail throughout the main text). Ignoring these considerations can compromise the robustness and informativeness of an EAM analysis. EAM = evidence-accumulation model.
The Architecture of Standard EAMs
EAMs assume that when presented with a stimulus (e.g., a left- or right-facing arrow), the decision maker samples evidence for the available actions or choice options (e.g., “Should I press the left or right arrow key?”) until a threshold amount of evidence is reached. Many prominent models assume within-trials noise in this accumulation process (Ratcliff, 1978; Tillman et al., 2020; Usher & McClelland, 2001), although it is possible to capture key RT phenomena assuming only (nonsystematic) between-trials noise (Brown & Heathcote, 2008). Reaching a threshold immediately triggers the motor movement for the overt response (e.g., pressing the left arrow key). Total RT is assumed to be the sum of three strictly sequential processing stages: (a) stimulus encoding, (b) decision-making (evidence accumulation), and (c) motor-response execution 1 (Bompas et al., 2023; Kelly et al., 2021; Servant et al., 2021; Weindel, Gajdos, et al., 2021). As we show, this places constraints on the timing and structure of decision-making tasks appropriate for use with EAMs.
Figure 2 depicts the two prominent classes of EAM architectures. In relative-evidence models, decisions are based on accumulating the difference in evidence between response options (e.g., Ratcliff, 1978; Ratcliff & McKoon, 2008; Ratcliff & Rouder, 1998; van Ravenzwaaij et al., 2017; Wagenmakers et al., 2007; Wagenmakers, van der Maas, et al., 2008). Relative-evidence models have historically been limited to decisions involving two choice options (but see Churchland et al., 2008; Ditterich, 2010; Kvam, 2019a; Niwa & Ditterich, 2008; P. L. Smith et al., 2020). By contrast, in racing-accumulator models, decisions are based on accumulating the absolute evidence for response options in separate modular accumulators (e.g., Bogacz et al., 2007; Brown & Heathcote, 2008; Heathcote & Love, 2012; Kirkpatrick et al., 2021; Rouder et al., 2015; Teodorescu & Usher, 2013; Tillman et al., 2020; Tsetsos et al., 2011; Usher et al., 2002; Usher & McClelland, 2001). Racing-accumulator models can accommodate any number of choice options, typically with an accumulator per choice. Although relative- and absolute-evidence models differ regarding how they conceptualize evidence, they have similar requirements for achieving a good task-model match and often arrive at the same substantive conclusions (Donkin, Brown, Heathcote, & Wagenmakers, 2011). In both architectures, decision-making is governed by the same three or four parameters, which are interpreted similarly across models (Voss et al., 2004). Moreover, both architectures have similar data-quality requirements and often give convergent results when applied to the same data (Donkin, Brown, Heathcote, & Wagenmakers, 2011; Dutilh et al., 2019).

Illustration of two standard EAM architectures. In (a) relative-evidence models, decisions are based on accumulating the difference in evidence between response options. The first threshold to be reached determines the overt response and RT. In (b) racing-accumulator models, decisions are based on accumulating the absolute evidence for
A comprehensive overview of key model parameters and their uses is given in the section Mapping EAM Parameters to Experimental Manipulations. However, briefly, the models contain parameters controlling the evidence starting point (allowing for a priori biases), accumulation rate (controlling the speed of processing), threshold/boundary separation (controlling the amount of evidence required to make a response), and nondecision time (the sum of time taken for stimulus encoding and motor-response production). The basic frameworks also allow for nonsystematic across-trials variability in accumulation rate, starting point, and nondecision time, which account for commonly observed differences in the speed of correct and incorrect responses (Ratcliff & Rouder, 1998; Ratcliff & Tuerlinckx, 2002).
As will be discussed (see section Going Beyond the Standard Models), the basic architecture has been extended to include additional mechanisms (e.g., Fific et al., 2010; McDougle & Collins, 2021; Miletić et al., 2021; Nosofsky & Palmeri, 1997, 2015; Pedersen et al., 2017) and to account for tasks/situations that violate various processing assumptions of the standard models (e.g., Diederich, 2024; Diederich & Trueblood, 2018; Hawkins et al., 2015; Holmes et al., 2016; Holmes & Trueblood, 2018; P.-S. Lee & Sewell, 2024; Little et al., 2018; P. L. Smith & Ratcliff, 2022; Ulrich et al., 2015; Voss et al., 2019; White et al., 2011; Zhang et al., 2014; for a review, see Evans & Wagenmakers, 2019). Most of the advice in this article will apply when working with these models. However, researchers should be aware that extended models may use a different set of processing assumptions and thus have idiosyncratic (mechanism-specific) design constraints.
Processing Assumptions of Standard EAMs
Here, we outline the core assumptions of the basic EAM framework that have implications for the design of tasks suitable for EAMs (summarized in Table 1). For data from an experimental task to be suitable, the task must satisfy the assumptions of the model. The core structural assumption of the models is that each decision is the result of a single, continuous (uninterrupted) evidence-accumulation process and culminates in a single discrete response. In short, the models apply to tasks in which one decision is followed by one response (Brown & Heathcote, 2008; Busemeyer & Townsend, 1993; Ratcliff, 1978; Usher & McClelland, 2001). Misapplying the models to decisions/tasks with different processing assumptions undermines their interpretability.
Standard EAM Assumptions and Implications for Task Design
Note: EAM = evidence-accumulation model; RT = response time.
During a trial/decision, the models assume within-trials stationarity, which refers to the assumption that model parameters (e.g., accumulation rates and thresholds) do not change systematically while a decision is in progress (Ratcliff, 1978). For accumulation rates, this means that evidence accumulates at a constant average rate 2 (although potentially with substantial nonsystematic noise) for the duration of the trial (i.e., from stimulus onset to response onset; Brown & Heathcote, 2008; Ratcliff, 1978; for alternatives, see Stine et al., 2020). In practice, this means that stimuli should provide a constant input to the evidence-accumulation process (i.e., the stimulus representation should not change in strength or sign over the course of a trial; P.-S. Lee & Sewell, 2024; P. L. Smith & Lilburn, 2020). For thresholds, within-trials stationarity means that thresholds are set before stimulus onset and do not change in value during a trial. This means that individuals are assumed to keep the same cognitive control/speed-accuracy trade-off settings throughout a decision and to not increase or decrease in caution during a trial (for models that allow dynamic thresholds, see Hawkins et al., 2015; P. L. Smith & Ratcliff, 2022; Voskuilen et al., 2016). Misapplying the models to tasks involving nonstationary evidence or thresholds can lead to biased or misleading parameter estimates.
Across trials, the standard application of EAMs assumes within-conditions stationarity, which refers to the assumption that model parameters do not change systematically across trials (of the same type) within a condition. This assumption is important for model fitting, which relies on pooling information across trials of the same type. Theoretically, the assumption is that trials of the same type are independent measurements of the same underlying process (generated from the same cognitive settings), which can include random (nonsystematic) trial-to-trial variability. Empirically, the expectation is that participant performance is stable for the duration of the experiment 3 (e.g., RT distributions do not systematically change in shape or scale over time). As with within-trials stationarity, failing to account for systematic across-trials trends can compromise model inferences and interpretability.
The reviewed EAM assumptions have implications for the (choice-RT) data to which they are applied. For one, the standard models can predict only positively skewed RT distributions. This owes to the geometry of EAMs with constant (flat) response thresholds whereby equal differences in accumulation rate are projected as unequal differences in decision time (see Fig. 2; Ratcliff & McKoon, 2008). In practice, this means that the models can fit only empirical RT distributions with characteristic positive skewness and fail to fit RT distributions that are normal or negatively skewed in shape (Evans, Hawkins, & Brown, 2020). For example, short response deadlines can induce collapsing bounds (thresholds that decrease with the passage of time), which produce more normally distributed RTs. Ignoring issues of skewness can lead to biases in parameter estimation (Verdonck & Tuerlinckx, 2016). The section Planning Tasks That Meet EAM Assumptions contains advice on ensuring data satisfy this assumption.
Finally, EAMs assume the data are free of contaminant processes. That is, data come from an evidence-accumulation process and not some other process, such as random guessing or nonresponding (Ratcliff, 1993; Ratcliff & Tuerlinckx, 2002), that can lead to biased parameter estimates if ignored. Strategies for identifying and accounting for contaminants are discussed throughout the article.
With this background in place, in the remainder of this article, we step through the components of a typical EAM-study workflow, giving advice on how to plan and conduct a robust study. In doing so, we regularly refer back to the model assumptions outlined in this section.
Planning Research Questions for EAM Analysis
Before the task-design and modeling process can begin, the researcher must first decide whether an EAM analysis is the appropriate tool to answer the research question. Although EAMs have many uses (Crüwell, Stefan, & Evans, 2019), our present focus is on using EAMs as a cognitive-measurement model (Donkin & Brown, 2018; M. D. Lee et al., 2019; see also, Batchelder, 2010, 2016; Batchelder & Riefer, 1999; J. B. Smith & Batchelder, 2010). Measurement studies typically focus on interpreting the parameters of an existing “off-the-shelf” EAM that is taken a priori to adequately characterize the processes individuals use to perform the target task (e.g., Huang-Pollock et al., 2017; Janczyk & Lerche, 2019; Klauer et al., 2007; Ratcliff & Rouder, 2000; Ratcliff, Thapar, & McKoon, 2004). To understand what kinds of research questions are suitable for EAM analysis, it is helpful to consider the output of an EAM that has been fit to participant data. For each participant, the model provides parameters that represent measurements of that individual’s latent cognitive settings (e.g., accumulation rate, threshold, bias, and nondecision time). Additional population-level parameters characterizing group differences can be obtained using hierarchical-modeling approaches (e.g., Chávez De la Peña & Vandekerckhove, 2023; Gunawan et al., 2020; Heathcote et al., 2019; Stevenson, Innes, et al., 2024; Wiecki et al., 2013). Changes in cognitive processes are quantified by changes in the values of this set of model parameters. Therefore, suitable research questions involve assessing how model parameters differ within or between groups (e.g., Ratcliff et al., 2003; Steyvers et al., 2019), individuals (e.g., Evans, Steyvers, & Brown, 2018), or experimental conditions/treatments (e.g., Heathcote, Loft, & Remington, 2015; Ratcliff et al., 2003; Strickland et al., 2023) and how parameters relate to other individual-level covariates (e.g., eye tracking, Cavanagh et al., 2014; Fiedler & Glöckner, 2012; Krajbich & Rangel, 2011; neurophysiological measures, e.g., electroencephalogram, magnetoencephalography, and functional [MRI], Forstmann, Tittgemeyer, et al., 2011; Harris & Hutcherson, 2022; Nunez et al., 2023, 2024; Turner et al., 2013; Turner, Forstmann, & Steyvers, 2019; Turner, Palestro, et al., 2019). EAMs allow multiple data sources to be analyzed under a common model and results interpreted in terms of well-supported cognitive theory (Forstmann, Wagenmakers, et al., 2011).
For an EAM analysis to be useful, questions must map to the cognitive processes represented by EAM parameters (i.e., accumulation rate, threshold, bias, and nondecision time). Questions are typically posed in a similar manner to traditional confirmatory experimental research, in which the goal is to understand the effect of particular experimental manipulations, treatments/interventions, or clinical disorders on some measured outcome variable (Donkin & Brown, 2018). For example, in a series of studies, Ratcliff and collaborators asked whether age-related slowing is due to slower evidence accumulation (cognitive-impairment hypothesis), higher thresholds (conservative-responding hypothesis), or longer nondecision time (physical-slowing hypothesis; Ratcliff et al., 2003, 2006; Ratcliff, Thapar, Gomez, & McKoon, 2004; Ratcliff, Thapar, & McKoon, 2004; Thapar et al., 2003). This question presents a clear test of three competing hypotheses that can be instantiated in EAMs and evaluated. To give an example involving a subject-level covariate, Forstmann et al. (2008) asked whether cue-induced threshold adjustments (a measure of top-down cognitive control) are correlated with fMRI blood oxygen level dependent (BOLD) signal in the striatum and presupplementary motor area (two structures hypothesized to be involved in such adaptive control). This question, posed in terms of individual-differences correlations, presents a clear test of the relationship between the model-based measure (magnitude of threshold adjustment) and the hypothesized neural covariates (striatal and presupplementary motor area BOLD signal). Operationalizing questions in this way is necessary to develop clear, testable hypotheses, that is, hypotheses that can be instantiated in an EAM and subjected to model comparison and evaluation. We explore this topic further in the section Mapping Experimental Manipulations to EAM Parameters.
Unsuitable questions for standard EAMs are those that involve violations of their assumptions. For example, asking questions about how parameters change from trial to trial (violating within-conditions stationarity) requires extended models/methods that allow trial-wise parameter estimation (Boehm et al., 2014; Ho et al., 2012; Van Maanen et al., 2011) or the ability to specify systematic across-trials trends (e.g., by linking parameters to trial-wise covariates; Stevenson, Donzallaz, et al., 2024; Wiecki et al., 2013). Likewise, asking questions about how parameters change within a trial (e.g., “How does late-presented evidence affect the accumulation process during a decision?” or “Do thresholds decrease as the response deadline approaches?”) requires (computationally expensive) models with dynamic evidence or threshold mechanisms (Diederich, 2024; Hawkins, Forstmann, et al., 2015; Holmes & Trueblood, 2018; P. L. Smith & Ratcliff, 2022). Formulating good research questions requires a sound understanding of theory of both EAMs and the target domain. The EAM literature, especially measurement studies in which the focus is on interpreting parameter effects (e.g., Boag et al., 2023; Evans, Steyvers, & Brown, 2018; Huang-Pollock et al., 2017; Ratcliff & Rouder, 2000; Weigard et al., 2018), can be a rich source of ideas and help build intuition for developing suitable research questions. Getting the research question right is important because it ultimately dictates many experimental design and analysis choices (e.g., sample-size planning and whether to use hierarchical or independent-subjects approaches).
Planning Tasks That Meet EAM Assumptions
Having formulated a research question, focus turns to designing an experimental task that will be informative for the research question and that meets the processing assumptions of EAMs. In this section, we discuss EAM-specific constraints on task design, relating each back to the relevant EAM assumptions. Our advice is intended to assist researchers in designing tasks that satisfy the assumptions of the basic EAM framework but allows for judicious deviations, such as when the focus is on developing a new model (Crüwell, Stefan, & Evans, 2019).
One decision, one response
As noted earlier, EAMs assume decisions involve a single, uninterrupted evidence-accumulation stage, culminating in a discrete response. Evidence is assumed to accumulate continuously from stimulus onset to the response. EAM-appropriate tasks need clearly defined stimulus and response onsets that do not overlap with processes outside of the response window. Stimulus evidence should be of fixed strength within a trial. Ideally, stimuli should be presented for the entire duration of the response window (from stimulus onset to response initiation) to ensure there is a consistent input to the decision process until a response is initiated. Stimuli can be briefly flashed (e.g., as occurs in visual-signal-detection paradigms), provided it can be assumed that a durable representation of the stimulus is maintained in visual short-term memory for the time needed to make a decision (Ratcliff & Rouder, 2000; P. L. Smith & Ratcliff, 2009). Ultimately, the primary concern here is to ensure one can assume a consistent (stationary) input to the evidence-accumulation process for the duration of the decision.
Furthermore, each decision should culminate in a single, discrete response chosen from a set of two or more choice options. This is because in standard EAMs, evidence always terminates at a single, discrete response threshold. Consequently, tasks that involve open-ended response options (e.g., free-recall tasks) or the possibility of submitting more than one response during a single trial (e.g., change-of-mind tasks, C. Stone et al., 2022; double-response paradigms, Evans, Dutilh, et al., 2020) require extensions beyond standard EAMs.
Within-trials stationarity
EAMs assume that the parameter settings of the model do not change systematically during a decision. Specifically, EAMs assume that threshold and bias settings are unaltered in response to stimulus features used to make a decision, and most assume that evidence accumulates at a constant average rate from stimulus onset to response onset. When designing an experiment, researchers should be aware that any information intended to affect threshold or bias settings must be presented before the onset of the stimulus. Likewise, any information not intended to affect decision-making and cognitive-control settings should be kept outside of the response window. With regard to experimental design, this means that the evidence input to the decision process should not change systematically during a trial, meaning that decision-relevant stimulus features (or their representation in visual short-term memory) should be constant throughout a trial (P. L. Smith & Lilburn, 2020). For example, stimuli in a perceptual decision-making task should not change in brightness or contrast partway through a trial because this would require a corresponding change in accumulation rate. Tasks involving dynamic evidence can be modeled using (computationally expensive) extensions to the basic EAMs (e.g., Diederich, 2024; Diederich & Trueblood, 2018; Holmes et al., 2016; Holmes & Trueblood, 2018;).
Within-conditions stationarity
EAMs also assume stationarity across trials of the same type within a condition. This is because model fitting requires trials of the same type to be treated as independent observations of the same latent cognitive settings. Aside from nonsystematic trial-to-trial variation accounted for in the model’s across-trials variability parameters, there should be no systematic changes in threshold or mean accumulation rate across trials of the same type. This assumption is important for statistical power and measurement precision, which relies on information pooled across many observations (trials; P. L. Smith & Little, 2018). When designing experiments, researchers should attempt to minimize factors that could cause parameters to change systematically across trials. For example, accumulation rates are known to increase with learning, initially rising steeply before tapering off to a stable asymptotic level (e.g., Fontanesi et al., 2019; Miletić et al., 2021; Pedersen et al., 2017; Sewell et al., 2019). Rates can also decrease with fatigue or inattention/task disengagement (Huang-Pollock et al., 2020; Ratcliff & Van Dongen, 2011; Walsh et al., 2017). Thresholds may also decrease over the course of an experiment because of participants becoming impatient and trading accuracy for speed in an effort to complete the experiment sooner (Hawkins et al., 2012; Larson & Hawkins, 2023).
Trial-to-trial variability is unavoidable (Aschenbrenner et al., 2018; Rouder et al., 2023) because of noise at many levels, including the noise inherent in neural systems (Faisal et al., 2008; P. L. Smith, 2010, 2023) and dynamic fluctuations in cognitive and affective states (Miletić et al., 2024; Schurr et al., 2024). Standard EAMs account for such noise sources via their across-trials variability parameters. Nevertheless, researchers should take reasonable measures to ensure such variability is kept as nonsystematic as possible.
Stimuli
Stimuli provide the critical input to the decision-making process. Stimuli supply the evidence on which decisions are based and largely determine the cognitive domain engaged by a task. For example, in a psychophysics task, evidence might be based on the objective luminance values of stimuli (e.g., Sewell & Smith, 2012; van Ravenzwaaij et al., 2020). By contrast, evidence in a preferential-choice task could be subjective value elicited by viewing images of food items (e.g., Huseynov & Palma, 2021; Milosavljevic et al., 2010). In working-memory and categorization tasks, evidence may derive from the strength with which items are activated in memory (Ratcliff, 1978; Shadlen & Shohamy, 2016) or the strength of learned associations between stimuli and expected response outcomes (Dutilh et al., 2009; Dutilh, Krypotos, & Wagenmakers, 2011; Miletić et al., 2021; Sewell et al., 2019). As noted, the evidence supplied by stimuli should be fixed within a trial (i.e., unchanging in strength for the duration of the trial) to provide a consistent (stationary) input to the decision process.
Across trials or blocks, stimuli are often the target of manipulations designed to affect the signal-to-noise ratio of the evidence entering the decision process (e.g., discriminability, difficulty). When designing experiments, it is important to calibrate stimuli to be of an appropriate difficulty level. This is because EAMs can struggle to fit floor effects 4 (chance-level accuracy) and ceiling effects (e.g., near-perfect accuracy with too few errors; Dutilh, Wagenmakers, et al., 2011). Floor effects occur when a task is too difficult and usually mean that participants cannot discriminate between choice options. Consequently, participants may be using a guessing strategy rather than sampling evidence, as assumed in EAMs. By contrast, ceiling effects occur when a task is too easy, causing very few incorrect responses to be observed. As we discuss in the section on sample-size planning, it is important to elicit enough error observations for reliable model estimation (Lüken et al., 2025). We recommend calibrating stimuli to produce error rates of 5% to 35% (Dutilh, Wagenmakers, et al., 2011; Lüken et al., 2025; Ratcliff & Childers, 2015). Calibration can be achieved through pilot testing or via more advanced optimization methods that perform individualized calibration based on task performance (e.g., methods based on “adaptive staircase” algorithms; Myung et al., 2009, 2013; J. Yang et al., 2021). Individual calibration is especially important in individual-differences research because floor/ceiling effects compress the observed across-persons variability (Draheim et al., 2021). To prevent the calibration scheme from introducing undesirable nonstationarities across trials (e.g., because of increasing/decreasing difficulty), calibration can be done in a pretest training phase before the experimental trials proper.
Response modality
Standard EAMs assume that the onset of the response coincides with termination of the evidence-accumulation process (Fig. 2). That is, the decision and motor-response processes occur sequentially (i.e., a motor response is initiated only once a decision has been reached). Thus, we recommend using response modalities with a sharp, clearly defined response onset and short execution times, such as manual key presses (
The most critical consideration here is that the chosen modality should enable the precise measurement of RT. For most purposes, a standard computer keyboard provides sufficiently precise RT measurements (up to the limit of the internal refresh rate). However, highly precise (i.e., to the millisecond) timing can be obtained with specialized computer systems and precision-timing software/apparatus (Bridges et al., 2020; Plant et al., 2002).
Mapping Experimental Manipulations to EAM Parameters
It is important to establish clear theoretical links between experimental manipulations (e.g., speed vs. accuracy instructions, task difficulty, or working-memory load) and their expected effects on EAM parameters and data. Understanding the behavioral signatures of experimental manipulations can give confidence that a manipulation is working as intended. Becoming familiar with EAM theory and reading published EAM studies can help build intuition for which model parameters are likely to be affected by a given manipulation. Much of the key theoretical EAM literature and a variety of application studies are cited in this article.
Not all EAM parameters will be relevant to every analysis. For example, a researcher studying consumer-choice preferences (e.g., preference for one product over another) may be uninterested in nondecision time but be highly interested in using accumulation rates to measure preference strength and starting point (or thresholds) to measure choice biases (Busemeyer & Townsend, 1993; Cerracchio et al., 2023; Krajbich et al., 2012, 2015). In addition, it is common practice to not estimate variability parameters (e.g., by fixing them to zero) unless they are needed to account for certain data features (e.g., fast guesses; Lerche & Voss, 2016; Ratcliff & Rouder, 1998).
Below, we briefly review common manipulations that have been used to selectively influence each standard EAM parameter (see Box 1). The primary uses of each model parameter, common mappings to experimental manipulations, and expected effects on behavior are summarized in Table 2.
Selective Influence
Mapping Experimental Manipulations to Evidence-Accumulation-Model Parameters
Note: RT = response time.
Stimulus-response (decision outcome) mapping
Some tasks will have stimulus-response mappings that naturally correspond to objectively correct or incorrect decision outcomes (e.g., pressing the left arrow key in response to a predominantly left-moving stimulus). However, standard EAMs can easily accommodate tasks with subjective or probabilistic stimulus-response mappings (e.g., preferential-choice tasks, probabilistic-categorization tasks, and tasks with probabilistic rewards/payoffs; D. G. Lee & Usher, 2023; Milosavljevic et al., 2010; Sewell & Stallman, 2020). In relative-evidence models (e.g., Ratcliff, 1978; Wagenmakers et al., 2007), which are limited to two-choice tasks, each threshold is mapped to one of the possible response options, and a single accumulation rate measures the difference in evidence between options. However, in race models (e.g., Brown & Heathcote, 2008; Tillman et al., 2020), which can accommodate an arbitrary number of response options, each latent response is assigned an accumulator with its own threshold and an accumulation rate representing the absolute evidence for that response. Race models can also instantiate more complex decision rules (e.g., AND and OR rules) used for combining multiple stimulus attributes into a final decision (e.g., Fific et al., 2010; Little et al., 2018; van Ravenzwaaij et al., 2020). Thresholds should be mapped to the latent response options in the task (e.g., “left/right” or “bright/dark”) rather than to the observed outcome of decisions (e.g., “correct/incorrect”).
Accumulation rate
Accumulation rates measure the strength (signal-to-noise ratio) of evidence extracted from the stimulus (e.g., salience, preference strength, or discriminability relative to other choice options; Gold & Shadlen, 2007; Palmer et al., 2005; Ratcliff & McKoon, 2008). Rates are sensitive to the processing abilities of the decision maker (Schmiedek et al., 2007) and the amount of attention or cognitive resources deployed to the task (i.e., the degree to which the participant is paying attention; Boag et al., 2023; Castro et al., 2019; Eidels et al., 2010). Holding one constant allows measurement of the other (e.g., for equivalent stimuli, different rates reflect differences in attention/capacity).
In a typical experiment, rates are used to account for manipulations of evidence strength (e.g., low- vs. high-discriminability stimuli), attention or processing capacity, and task difficulty, that is, manipulations affecting how easily stimuli are perceived and/or processed (Mulder et al., 2014; Palmer et al., 2005; Ratcliff & McKoon, 2008; P. L. Smith et al., 2015; P. L. Smith & Sewell, 2013). This is accomplished by estimating a different accumulation rate for each difficulty level (Ratcliff & Rouder, 1998). Behaviorally, a faster accumulation rate predicts faster responses and fewer errors, and a slower rate predicts the converse (Ratcliff & McKoon, 2008). Accumulation is typically faster for easier decisions (Ratcliff & Rouder, 1998) and faster for responses associated with higher reward or subjective value 5 (Busemeyer & Townsend, 1993; Krajbich et al., 2012, 2015). Rates track the strength of associative relationships learned via feedback (e.g., Fontanesi et al., 2019; Miletić et al., 2021; Pedersen et al., 2017; Sewell et al., 2019) and the activation strength of items retrieved from memory (Ratcliff, 1978; Ratcliff & McKoon, 1988). Rates are also the locus of attentional or processing biases (sometimes called “stimulus bias”; White & Poldrack, 2014), that is, differences in accumulation between stimuli matched in perceptual discriminability. Furthermore, these mappings hold in more complex naturalistic tasks (for a review, see Boag et al., 2023).
Threshold
Thresholds are a locus of proactive cognitive control (Strickland et al., 2018). Thresholds control the amount of evidence needed to trigger a response and thus measure response caution or speed-accuracy settings. As noted earlier, EAMs assume thresholds are set in advance of stimulus onset (i.e., not adjusted based on features of the current stimulus because it would be circular for the threshold used to identify a stimulus to depend on knowing the identity of that stimulus). In other words, thresholds cannot be altered based on information that was unknown before the trial began (Donkin, Averell, et al., 2009). Consequently, manipulations intended to affect threshold settings must be presented before the onset of a trial/stimulus. This is typically achieved using pretrial cues or blocked instructions (e.g., Forstmann et al., 2008; Katsimpokis et al., 2020), the aim of which is to allow participants to make strategic adjustments (e.g., adopt different threshold settings) before encountering the upcoming stimulus.
In a typical experiment, thresholds are used to explain speed-accuracy trade-off effects whereby individuals set lower thresholds when less time is available and higher thresholds when more time is available (Bogacz et al., 2010; Evans, Hawkins, & Brown, 2020; Forstmann et al., 2008; Frazier & Yu, 2007; Heitz & Schall, 2012; Katsimpokis et al., 2020; Rae et al., 2014; Ratcliff & McKoon, 2008). Behaviorally, higher thresholds predict slower, more accurate decisions, and lower thresholds predict faster, less accurate decisions (Ratcliff & Rouder, 1998). Thresholds are further implicated in posterror slowing (Damaso, Williams, & Heathcote, 2022), a kind of trial-to-trial speed-accuracy trade-off (Larson & Hawkins, 2023).
Response biases
Racing-accumulator models measure biases for one response over another by allowing competing response options to have different thresholds. For example, participants set lower thresholds for prioritized/more rewarding/higher frequency responses and higher thresholds for nonprioritized/less rewarding/lower frequency responses (Boag, Strickland, Loft, & Heathcote, 2019; Mulder et al., 2012; Strickland et al., 2018; Trueblood et al., 2021; for a review, see Cerracchio et al., 2023). By contrast, relative-evidence models measure response biases by assessing how the starting point of the evidence-accumulation process deviates from the neutral midpoint between the two response boundaries (Leite & Ratcliff, 2011; Ratcliff & McKoon, 2008; see also, Edwards, 1965). These mechanisms are mathematically equivalent in some models (e.g., Brown & Heathcote, 2008). Like thresholds, the evidence starting point is assumed to be under the control of the decision maker, and manipulations intended to affect starting point must be presented before stimulus onset. Behaviorally, deviating from the neutral midpoint makes responses for the favored (closer) threshold faster and more accurate while making responses for the nonfavored (further) threshold slower and less accurate (Ratcliff & McKoon, 2008; for a review, see Cerracchio et al., 2023). In experiments, starting-point biases have been used to measure biases in police officers’ decisions to shoot lighter-skinned versus darker-skinned suspects (Johnson et al., 2018, 2021; Pleskac et al., 2018) and to quantify individuals’ tendency to identify items as weapons versus nonweapons (Todd et al., 2021). Starting point has also been used to understand how various response biases are affected by factors such as heightened time pressure (Chen & Krajbich, 2018), changes in stimulus prevalence (Trueblood et al., 2021; see also, Leite & Ratcliff, 2011), and payoff structure (Leite & Ratcliff, 2011).
Nondecision time
Nondecision time measures the sum of the time taken to encode the stimulus (at stimulus onset) and time to produce the motor response (at response onset; Bompas et al., 2023). Nondecision time is sensitive to the difficulty of both the encoding and motor-responding stages. For example, it is sensitive to changes in low-level visual features of stimuli and the complexity or force required to produce the motor response (Bompas et al., 2023; Gomez et al., 2015; Ho et al., 2009; Sandry & Ricker, 2022; Servant et al., 2016; Voss et al., 2004; Weindel, Gajdos, et al., 2021). Although encoding and motor RT cannot be separately identified in standard EAMs, they may be disentangled experimentally (e.g., by holding stimulus properties constant while manipulating response modality or vice versa). Empirically, nondecision time shifts RT distributions in time without affecting accuracy or the shape or scale of the distribution (Ratcliff & McKoon, 2008).
In experimental settings, nondecision time has been used to measure potential differences in encoding or motor-response production (Ratcliff, Thapar, Gomez, & McKoon, 2004; Van Maanen et al., 2016). For example, Ratcliff, Thapar, and McKoon (2004) found that older participants produced reliably slower nondecision times than did younger participants (see also, Van Maanen et al., 2016). Saccadic eye movements have been found to elicit reliably shorter nondecision times than manual-key-press responses (Bompas et al., 2023; Ho et al., 2009). Nondecision time has also been found to be shorter under conditions of heightened time pressure (e.g., Rae et al., 2014; Ratcliff, 2006), potentially reflecting a tendency to encode stimuli less deeply when under time pressure (e.g., Palada et al., 2018, 2019). However, we caution that nondecision time is sometimes estimated less reliably than other EAM parameters (Lerche & Voss, 2018) and can be highly variable across individuals, conditions, and tasks (Bompas et al., 2023). Refining EAMs’ account of nondecision time is a topic of ongoing model-development work (Bompas et al., 2023; Kelly et al., 2021; Servant et al., 2021).
Variability parameters
The across-trials variability parameters (i.e., in accumulation rate, starting point, and nondecision time) are less frequently used for measurement or inference. Rather, they allow the model to account for a number of commonly observed features of behavioral data, such as crossovers in the speed of correct and incorrect responses (Ratcliff, 2013; Ratcliff & Rouder, 1998; Ratcliff & Smith, 2004). Variability is a ubiquitous feature of human cognitive systems, which continuously update attention, memory, and executive-control settings in response to incoming information (Braver et al., 2021; Damaso et al., 2020; Miletić et al., 2024). Such adaptation occurs at multiple timescales, including seconds (e.g., conflict resolution and reactive control over individual decisions), minutes (e.g., short-term learning and proactive cognitive control), and hours/days (e.g., longer-term learning and memory consolidation, fluctuations in attentional and affective state), and is the focus of ongoing model-development work (e.g., Aschenbrenner et al., 2018; Miletić et al., 2024; Steyvers et al., 2019; Wientjes & Holroyd, 2025). In the standard models, some of this variability is (nonsystematically) accounted for in across-trials variability parameters.
Across-trials variability in accumulation rate can account for slow errors (Ratcliff, 1978). This is because trials with faster than average accumulation produce fast responses with very few errors. By contrast, trials with slower than average accumulation produce slow, error-prone responses, which together results in disproportionately many slow errors (Lerche & Voss, 2016). In experiments, across-trials rate variability can be used to account for manipulations affecting variability in evidence extracted from the stimulus (Starns, 2014; Yap et al., 2012) and to identify factors that lead to increased uncertainty (greater variability) in decision-making (Palada et al., 2020; Starns, 2014).
Across-trials variability in starting point can account for fast errors (Laming, 1968). This is because when the accumulation process starts closer to the threshold for the incorrect latent response, errors become both faster and more frequent. By contrast, when accumulation starts closer to the threshold for the correct latent response, errors become slower and less frequent, resulting in disproportionately many fast errors (Lerche & Voss, 2016). Including starting-point variability alongside rate variability allows the model to account for interactions (crossovers or reversals) between correct and incorrect RTs (e.g., fast errors in some cells and slow errors in others; Ratcliff et al., 1999; Ratcliff & Rouder, 1998; Wagenmakers, Ratcliff, et al., 2008). Starting-point variability may be used to account for factors affecting uncertainty (variability) in prior beliefs or expectations (Mulder et al., 2012).
Across-trials variability in nondecision time can account for changes in the leading edge (e.g., the 0.1 quantile) of RT distributions (e.g., Ratcliff, Thapar, Gomez, & McKoon, 2004; Ratcliff & Tuerlinckx, 2002), including those caused by contaminant processes, such as fast guesses (Ratcliff & Tuerlinckx, 2002). This is because nondecision-time variability fattens the tails (i.e., decreases skew) of RT distributions (Lerche & Voss, 2016), making the model more robust to fast contaminants. Models with nondecision-time variability predict a shallower onset of responding than models without. Empirically, nondecision-time variability accounts for variability in encoding and motor-response production (Bompas et al., 2023).
We reiterate that across-trials variability parameters tend to be estimated less reliably than other parameters (Boehm et al., 2018; Lerche et al., 2017; Lerche & Voss, 2016; van Ravenzwaaij & Oberauer, 2009; Vandekerckhove & Tuerlinckx, 2007; Yap et al., 2012). Moreover, at least one rate-variability parameter is typically held fixed in at least one design cell to satisfy the scaling property of EAMs (Donkin, Brown, & Heathcote, 2009). In racing-accumulator models, a common choice is to set across-trials rate variability to 0.1 or 1. Although some work suggests that differences in across-trials variability in accumulation rate and/or nondecision time can be recovered reasonably reliably in some cases (e.g., Boehm et al., 2018; Starns & Ratcliff, 2014), there is evidence suggesting variability parameters trade off with other model parameters and can exhibit nonstationarity over the course of an experiment (e.g., Dutilh, Krypotos, & Wagenmakers, 2011; Evans & Hawkins, 2019; Evans, Steyvers, & Brown, 2018). Estimation and reliability issues with variability parameters can be improved by fixing parameters (e.g., by constraining variability parameters to a single estimated value or removing them entirely by setting variability to zero; Boehm et al., 2018; Lerche & Voss, 2016; van Ravenzwaaij et al., 2017). Moreover, some EAM software simply does not allow for the estimation of across-trials variability (e.g., EZ-diffusion; Dutilh et al., 2013; Grasman et al., 2009; Schmiedek et al., 2007; Souza & Frischkorn, 2023; van Ravenzwaaij et al., 2012, 2017; Wagenmakers et al., 2007; Wagenmakers, van der Maas, et al., 2008) or requires variability to be fixed across participants (e.g., HDDM; Wiecki et al., 2013). Overall, researchers should exercise caution if answering the research question relies on inferences based on potentially unreliable variability parameters (or turn to extended models that explicitly account for systematic across-trials trends; Miletić et al., 2024; Wientjes & Holroyd, 2025).
In the next section, we outline the elements of a single trial in a typical EAM experiment and considerations for task design.
Trial Structure and Event Timing
One of the most important design considerations for model plausibility is how trials are structured in terms of the timing of events within a trial (e.g., cue and stimulus presentation). For an EAM to be a plausible model of the true decision process, the sequence and timing of events within a trial must match the processing assumptions of the model. A typical trial structure/sequence of a standard EAM is illustrated in Figure 3. In the following subsections, we discuss the components that make up a typical trial, their purpose, and common pitfalls surrounding their implementation. Note that the advice presented here allows for judicious deviations, such as when developing a model or using an extended EAM with different processing assumptions.

Structure of a typical decision trial for an EAM-appropriate task. The trial begins with a cue (e.g., instructing the participant to emphasize response speed or accuracy), followed by a fixation interval of variable (unpredictable) duration. Next, a stimulus is presented (stimulus onset) continuously until either the participant makes a response (response onset) or the trial time limit expires (which produces a nonresponse that is truncated from the RT distribution). Feedback indicating that the participant responded too slowly is then displayed. Finally, an intertrial interval gives the participant time to prepare for the next trial. The theoretical accumulation process is illustrated by the dotted arrow. Observing the outcome of many such decision trials produces a distribution of RTs with a characteristic positive skew (the density of which is illustrated in gray at the top of the figure). The presentation durations shown are suggestions only and should be calibrated to the specific task. EAM = evidence-accumulation model; RT = response time.
Cue
In some studies, trials begin with a cue that indicates how participants should perform the upcoming trial (Fig. 3). The cue interval is an opportunity to present information intended to affect the decision maker’s processing and cognitive-control settings (e.g., thresholds and response biases) before the decision. For example, presenting the text “Fast!” or “Accurate!” may signal that participants should respond either quickly or accurately, respectively (e.g., Forstmann et al., 2008; Katsimpokis et al., 2020). Other kinds of cues may direct participants’ gaze to a particular item or spatial location (allowing comparison of attended vs. unattended performance; e.g., Liu et al., 2009; Logan et al., 2023; P. L. Smith et al., 2015) or provide prior information intended to set up biases in the decision maker before encountering the stimulus (Karayanidis et al., 2009; Mulder et al., 2012; Trueblood et al., 2021).
Fixation
Fixation intervals serve the twofold purpose of concentrating participants’ eye gaze/attention on the location of the upcoming stimulus (usually at the center of the display) and allowing time for residual processes (e.g., those stemming from the preceding cue or trial) to complete and return to baseline to avoid process overlap (Pashler, 1994). In a typical fixation interval, participants fixate their gaze on a centrally presented fixation cross while awaiting the stimulus. One issue that can arise with fixed-duration fixation intervals is that participants learn to anticipate the onset of the upcoming stimulus. Participants’ expectation of the onset of the next trial increases over time according to a hazard function (Luce, 1991). This can lead some participants to prematurely sample evidence in anticipation of the stimulus, resulting in disproportionate anticipatory responses for longer intervals (Oswal et al., 2007), which produces biased estimates of nondecision time (Jepma et al., 2012). To avoid this problem, we recommend sampling the duration of fixation intervals from an exponential (or pseudoexponential) distribution (e.g., with mean around 0.7 s and range of about 0.2–5 s) to avoid implausibly short intervals and excessively long waiting times (e.g., Evans & Hawkins, 2019).
Stimulus onset
Following the fixation interval, the stimulus is presented. EAMs assume that stimulus onset represents the beginning of the evidence-accumulation process (plus the time taken to encode the stimulus; Bompas et al., 2023). This structural constraint makes certain tasks unsuitable for EAMs. For example, interrogation paradigms are inappropriate for standard EAMs because the decision maker first views (and presumably accumulates evidence about) the stimulus but must wait until prompted to give a response (Bogacz et al., 2006; Ratcliff, 2006). One reason this is problematic is because the evidence-accumulation process may terminate before the response prompt is presented, making it unclear what cognitive processes might have occurred in the intervening time (or what the observed RT is measuring). In sum, for the standard framework, it is crucial that the evidence-accumulation process runs uninterrupted from the onset of the stimulus until the response.
Response window
The onset of the stimulus marks the beginning of the response window, which ends either when a response is submitted or upon expiry of a predefined deadline. The response window should allow enough time for participants to process and respond to the stimuli and thus should be calibrated to the RT (and RT variability) of actual participants performing the proposed task. An inappropriately calibrated response window can lead participants to adopt undesirable/contaminant response strategies that are not accounted for in standard EAMs. For example, an excessively short response window can lead to a high proportion of fast guesses, cause slower responses to be truncated from the tail of RT distributions (responses that fall outside of the response window, as illustrated in Fig. 3), or induce collapsing bounds (response thresholds that decrease as the deadline approaches). These processes can produce RT distributions that lack the characteristic positive skew and thus cannot be fit by standard EAMs (Evans, Hawkins, & Brown, 2020). Ignoring these issues can compromise parameter estimation (Verdonck & Tuerlinckx, 2016). We recommend pilot testing novel tasks to find an appropriate response window because the optimal window will depend on the task.
Another consideration is whether the average duration of decisions in the experimental task is appropriate for EAMs. Participants making perceptual decisions about simple psychophysical stimuli can usually respond within a 1.5-s response window. By contrast, tasks typical of cognitive psychology (e.g., lexical decision, preferential choice) may require up to 4 s to respond (Glickman & Usher, 2019), and more complex naturalistic tasks can take even longer (e.g., up to 10 s; Boag et al., 2023; Boehm et al., 2021). It is sometimes advised that standard EAMs be applied only to relatively rapid choice tasks (e.g., mean RT < 1.5 s; Ratcliff & McKoon, 2008; Ratcliff, Thapar, Gomez, & McKoon, 2004). This is intended to ensure that the assumption of a single continuous evidence-accumulation process is upheld because violations of the single-stage assumption become increasingly plausible for decisions that unfold over longer timescales. If longer decisions do in fact involve different underlying processes, such as multiple processing stages, then they may not be accurately represented by a standard single-stage EAM, rendering the model difficult to interpret (Heathcote, Brown, & Wagenmakers, 2015).
Nevertheless, some work suggests that standard EAMs can be a valid measurement model of more complex or naturalistic decisions that unfold over longer timescales (Aschenbrenner et al., 2016, Experiment 2; Boag et al., 2023; Boehm et al., 2021; Glickman & Usher, 2019; Lerche & Voss, 2019). This work found that standard models provided good fits and that experimental manipulations affected model parameters in the same way as in studies with shorter RTs (e.g., task difficulty and stimulus discriminability effects mapped to accumulation rates; speed-accuracy trade-off, cognitive control, and bias effects mapped to thresholds and starting point).
When designing a novel task, researchers should consider whether the assumption of a single uninterrupted accumulation process is appropriate, especially in tasks with longer RTs. If not, the researcher may turn to extended EAMs designed to account for phenomena associated with longer RTs, such as models that allow for slow contaminant processes (e.g., Dolan et al., 2002; Ratcliff & Tuerlinckx, 2002), randomly slow or nonterminating accumulation processes (Damaso, Castro, et al., 2022; Howard et al., 2020; Tillman et al., 2017), off-task mind wandering (Hawkins et al., 2019; Hawkins, Mittner, et al., 2015), and multiple processing stages (Little, 2012; Provost & Heathcote, 2015; Shahar et al., 2019). Overall, researchers should be guided by what makes sense in terms of cognitive theory (scientific judgement) and the model’s ability to capture important features of the data (model fit and selection; Navarro, 2019).
Postresponse interval
The postresponse interval signals that the trial has ended and a response recorded. The postresponse interval provides an opportunity to display corrective feedback. For example, excessively fast or slow responding can be discouraged by displaying a warning message (e.g., “Too fast/slow!”) following such responses. Warning messages can be accompanied by a timeout interval that delays the onset of the next trial (e.g., by 1–5 s) to further encourage compliance (e.g., Evans & Hawkins, 2019). Such feedback can help to keep mean RT within the response window.
Providing feedback on performance (e.g., accuracy or points/rewards for correct responses) on experimental trials may introduce nonstationarities (e.g., posterror speeding/slowing and learning effects) that are not accounted for in the standard EAM framework (Miletić et al., 2020, 2021). Aside from during training (see section Task Training), we advise against providing performance feedback for experimental trials unless explicitly modeling learning with an extended EAM (e.g., Fontanesi et al., 2019; Miletić et al., 2021; Pedersen et al., 2017). However, because providing no feedback at all may cause participants to become disengaged from the task, it is possible to give summarized performance feedback (e.g., mean accuracy or overall points scored) following each block of trials. “Gamifying” experiments in this way can increase participant engagement (Lumsden et al., 2016) while avoiding introducing undesirable nonstationarities associated with trial-to-trial feedback (e.g., systematic learning and adaptation effects). Moreover, such performance summaries can double as an intermittent check that participants are paying attention and complying with task instructions.
Intertrial interval
“Intertrial interval” refers to the time between trials. The intertrial interval gives participants time to “reset” and concentrate their attention on the upcoming trial. The intertrial interval is designed to prevent process overlap (Pashler, 1994) and minimize other potential sources of proactive interference, such as sequential or carryover effects stemming from events that occurred on previous trials (e.g., Aschenbrenner et al., 2018; Balota et al., 2018; Jones et al., 2013). Avoiding such interference is important for preserving stationarity both within and across trials (i.e., for treating all trials within a condition as independent observations of the same underlying process). Intertrial intervals can be open-ended (e.g., such that the participant must press a key to initiate the next trial), allowing for self-paced breaks, or can automatically progress to the next trial after some delay.
Sample-Size Planning
Trial numbers
Researchers should plan to collect enough observations (trials) per participant in each experimental condition for reliable modeling. Doing so is important because sufficient data are required to obtain precise and unbiased individual measurement of the EAM parameters representing each participant’s latent decision processes (P. L. Smith & Little, 2018).
Much methodological work has explored how the number of trials used in fitting affects the reliability (e.g., bias, variability, and recoverability) of EAM parameters (Alexandrowicz & Gula, 2020; Lerche et al., 2017; Lerche & Voss, 2016; Lüken et al., 2025; Ratcliff & Childers, 2015; Ratcliff & Tuerlinckx, 2002; van Ravenzwaaij & Oberauer, 2009; Vandekerckhove & Tuerlinckx, 2007; Visser & Poessé, 2017; Wagenmakers et al., 2007; Wiecki et al., 2013). These studies broadly agree that around 200 trials per condition is sufficient to achieve reasonably precise and unbiased individual-level measurement. In general, more trials afford greater measurement precision and thus greater power to detect effects because (Gaussian) measurement variance decreases with the square root of the number of measurements (trials; Ratcliff & Tuerlinckx, 2002). However, they are diminishing returns; simulations suggest there is little to gain from collecting more than about 500 trials per condition (Lerche et al., 2017).
When determining the number of trials to collect, a critical question is whether there will be sufficient observations of the least frequently occurring trial type in the data (Donkin, Brown, & Heathcote, 2011). In most designs, the rarest kind of trial is incorrect responses to the most easily discriminable stimuli (i.e., incorrect responses to decisions typically made with high accuracy). However, other infrequent stimulus-response combinations are possible, such as those that arise in paradigms involving the presentation of a rare stimulus or event on a small subset of trials (e.g., Einstein & McDaniel, 1990; Loughnane et al., 2019; Strickland et al., 2018). Lüken et al. (2025) recommended obtaining error rates of at least 5% to ensure reliable parameter estimation with the standard diffusion (Ratcliff, 1978) and linear ballistic-accumulator models (Brown & Heathcote, 2008). With 200 trials, a 5% error rate corresponds to 10 observations of incorrect responses. This number should be taken as a minimum: 10 error observations provided just enough information about the shape of the error RT distribution to identify the model. Fitting to data with smaller error rates (e.g., data with ceiling effects) is risky because the greater estimation uncertainty can make some parameters (e.g., rates and thresholds) unidentifiable (Lüken et al., 2025).
We caution that although 10 error observations may provide the bare minimum constraint needed to identify the models (e.g., by locating the mean of the incorrect RT distribution), many more observations are needed to make reliable inferences about parameters that rely on information about the variance and skewness of the error RT distribution (e.g., the starting-point and rate-variability parameters for the incorrect latent response). Parameter-recovery simulations can help determine how many trials (and participants) are needed to reliably measure a given effect (Heathcote, Brown, & Wagenmakers, 2015; White et al., 2018; R. C. Wilson & Collins, 2019). The simulation procedure is as follows: (a) Set model parameters to values representative of the effect of interest, (b) simulate many synthetic participants (data sets), (c) fit the model to the synthetic data, and (d) assess how well the recovered parameters match the known data-generating values. Doing this for a range of effect sizes and different numbers of trials and participants can help determine the most appropriate design for achieving a desired level of measurement precision (see section Parameter Recovery).
Clearly, there is no one-size-fits-all solution to trial-number planning because it depends on the goals of the researcher, the size of the target effect, and properties of the model. Several thousand observations may be needed to make reliable inferences about across-trials variability parameters or parameters associated with rare responses (e.g., the accumulation rate of the incorrect latent response). By contrast, for simple models (e.g., in which only one parameter varies over conditions and all others are fixed), reliable estimation can be achieved with fewer trials per condition (e.g., 50–100 trials). In general, we recommend researchers use parameter-recovery simulations to guide trial-number planning (Heathcote, Brown, & Wagenmakers, 2015).
When thousands of trials are required, the experiment may need to be spread across multiple testing sessions. Long-duration experiments have several pitfalls that if ignored, can compromise an EAM analysis. For example, participants tend to become less engaged (e.g., because of fatigue or boredom) the longer a task goes on (Cunningham et al., 2000; Krimsky et al., 2017). Disengaged or impatient participants may “satisfice” by processing stimuli less deeply or lowering their response criteria over time to get through an experiment more quickly (Boehm et al., 2016; Evans et al., 2019; Hawkins et al., 2012). Disengagement can introduce speeding trends and other autocorrelation effects in the data (Gong & Huskey, 2023). In addition, longer experiments that span multiple days tend to have higher rates of participant attrition and may exacerbate already high day-to-day variability in individuals’ cognitive and affective state (Schurr et al., 2024; Stevenson, Innes, et al., 2024). Such effects are problematic because standard EAMs assume data are free of such nonstationarities. These issues can be mitigated by giving participants frequent breaks and using appropriate counterbalancing and trial-randomization schemes to experimentally control for time-on-task effects, such as learning and fatigue.
Finally, we note that collecting a large number of trials is not always feasible. This is true for fMRI research (in which scanner time is costly and scarce; Basten et al., 2010; Forstmann et al., 2008), when studying certain clinical populations (Matzke, Hughes, et al., 2017), or when reanalyzing existing data. If the use of sparse data is unavoidable, there are several techniques that can improve EAM estimation properties. These include using hierarchical models (e.g., Stevenson, Donzallaz, et al., 2024), using more informative priors (i.e., for Bayesian analyses, see M. D. Lee & Vanpaemel, 2018; Matzke et al., 2020; Tran et al., 2021), constructing simpler models (e.g., by not estimating across-trials variability parameters; Boehm et al., 2018; Lerche & Voss, 2016; Ratcliff & Childers, 2015), holding some parameters constant over conditions (Donkin, Brown, & Heathcote, 2011), and using alternative (simpler) model formulations that require only information about error proportions rather than error RT (e.g., Ludwig et al., 2009). We recommend checking the results obtained from simpler models against those obtained from a model in which the constraints are not applied (Vandekerckhove & Tuerlinckx, 2007). If both approaches arrive at the same conclusions, this provides evidence it is safe to interpret the simpler model. If not, one may need to adjust the experimental design and sampling plan until reliable model estimation is achieved.
Participant numbers
A further consideration concerning data suitability is how many participants to include in the sample. The number of participants determines how well findings generalize to the wider population and contributes to power and measurement precision in certain analyses (e.g., individual-differences correlations; Button et al., 2013; Rouder & Haaf, 2019). Studies investigating individual differences (e.g., examining correlations between EAM parameters and individual-level covariates) typically need many participants (e.g., 80 or more), each performing at least a moderate number of trials (e.g., around 200), to obtain sufficiently low measurement noise to reliably characterize potentially subtle individual differences (Rouder et al., 2023; Rouder & Haaf, 2018). Between-subjects and mixed designs also typically require many participants for sufficiently powered between-groups contrasts (e.g., Boag, Strickland, Loft, & Heathcote, 2019; Steyvers et al., 2019) and to precisely characterize the distribution of population-level parameters in hierarchical Bayesian analyses (M. D. Lee, 2011).
By contrast, studies seeking to reliably measure within-subjects effects without assessing individual differences (e.g., comparing parameters for the same individual between different conditions) typically use fewer participants (e.g., Ratcliff & Rouder, 1998), who each perform a large number (typically thousands) of trials to ensure high individual-measurement precision (Kolossa & Kopp, 2018; P. L. Smith & Little, 2018). An advantage of fully within-subjects designs is that the unit of replication is the individual participant rather than the whole study, meaning that each participant serves as an independent replication (validation) of the target effects (P. L. Smith & Little, 2018). Replication increases confidence that obtained effects are real and meaningful.
As with trial-number planning, we recommend conducting parameter-recovery simulations (based on different numbers of synthetic participants) to understand how many participants are needed to obtain a desired level of power or measurement precision for a proposed analysis (White et al., 2018).
Procedural Considerations
In this section, we discuss procedural considerations that can help bring participants (and the data they produce) in line with EAM assumptions. We consider task instructions, task training, and the testing environment.
Task instructions
Task instructions should be designed to maximize participant compliance with the task and minimize undesirable behaviors that may produce data unsuitable for EAMs. Undesirable behaviors may include fast guessing, mind wandering and inattention, waiting/delayed start-ups, random responding, and nonresponding (e.g., Cassey et al., 2014; Hawkins et al., 2019; Ratcliff & Kang, 2021). The foremost goal of instructions is to ensure that participants understand how to perform the task as intended by the researcher. This may involve explaining how a typical trial is structured and showing examples of different possible decision outcomes. Instructions should also explain key features of the task display, experiment-presentation software, and response apparatus.
It is good practice to confirm that participants understand the task instructions and provide reminders of key instructions before each testing block and following breaks or interruptions. Participant compliance/understanding can be assessed through verbal confirmation or by having participants demonstrate that they meet some performance criterion. As a generic strategy, we recommend instructing participants to respond to each trial as quickly and accurately as possible. This instruction is designed to ensure that decisions stem from a pure (uninterrupted) evidence-accumulation process, as assumed in the models. If using a manual-response modality, such as a computer keyboard, we suggest instructing participants to keep their fingers positioned directly above the response keys. This serves to reduce across-trials variability in nondecision time (potentially justifying its removal from the model) and ensures motor RT is as similar as possible for all participants (potentially justifying estimating a common nondecision time across participants). We recommend inviting participants to clarify any outstanding questions before commencing the experiment. Doing so may reduce the amount of data lost because of misunderstanding or noncompliance.
Task training
It is good practice to have participants perform practice/training trials before starting the experiment. Practice serves the twofold purpose of helping participants understand the task and stabilizing performance before the experimental trials. Reaching a stable level of performance is important for preserving within-conditions stationarity (i.e., that latent decision settings do not show systematic trends across trials). Identifying the point of stable performance is difficult because learning and adaptation may continue indefinitely for some tasks. Nevertheless, a common approach is to have participants practice until they reach some performance criterion (e.g., >80% accuracy). Providing performance feedback following training trials (e.g., indicating whether the response was correct or incorrect) can help to speed up the learning/performance-stabilization process. Nonstationarities and carryover effects (e.g., across trials and conditions) can be further minimized using appropriate randomization (e.g., randomizing the presentation of trials within a condition) and counterbalancing regimes (e.g., balancing the order of conditions within an experiment; Brooks, 2012; Lewis, 1989; Zeelenberg & Pecher, 2015).
The testing environment
The testing environment should encourage participants to perform the experimental task in the manner intended by the researcher. For most purposes, this means that participants are seated at a desk with a computer keyboard (or other response apparatus) and a display monitor positioned at a comfortable viewing distance. In application studies, which use various high-fidelity simulated and virtual-reality environments (e.g., Castro et al., 2022; Ratcliff & Strayer, 2014; Tillman et al., 2017; Vanunu & Ratcliff, 2023), participants should be positioned appropriately for the simulator environment. To facilitate engaged and attentive task performance, testing should be conducted in a quiet, comfortable space, free from distractions and interruptions. This is important for the EAM assumptions of model plausibility (i.e., that responses are generated by a single continuous evidence-accumulation process) and stationarity (i.e., that latent cognitive settings are stable over time).
Ideally, all participants would be tested in a single in-person session under identical conditions. However, if testing must be conducted across multiple sessions or in different locations, then conditions should be kept as consistent as possible between each session and testing location. Consistency of context is important because individuals are known to use different decision-making strategies in different contexts, such as when performing a task inside versus outside of an fMRI scanner (Forstmann et al., 2008; Van Maanen et al., 2016). Inside the scanner, participants adopted more conservative (higher) response thresholds and had longer nondecision times than they did in the out-of-scanner testing context (Van Maanen et al., 2016; see also, Forstmann et al., 2008; Gunawan et al., 2020). Ignoring or aggregating over such context effects may introduce undesirable data features (e.g., bimodal RT distributions) that may cause failures to fit and produce misleading or meaningless parameter estimates.
Online testing platforms (e.g., Mechanical Turk, Prolific, CloudResearch) give researchers the potential to collect data more quickly and affordably than is possible offline (Barbosa et al., 2023; Birnbaum, 2004). However, there are concerns that unsupervised online participants may generate poor-quality data (e.g., data that are noisy, nonstationary, or generated by contaminant processes; Douglas et al., 2023; Peer et al., 2021). These concerns arise because lacking supervision, online participants may misunderstand task instructions or be inattentive/careless (Albert & Smilek, 2023; Aruguete et al., 2019) and because the remote online context makes it difficult for experimenters to identify and correct such problems (Reips, 2002). Ratcliff and Hendrickson (2021) conducted an online replication of several classic EAM studies and found that almost half of the participants in one experiment made a significant number of fast guesses (i.e., premature responses with chance accuracy) and/or produced RTs that were unstable (nonstationary) across the testing session. Nevertheless, inferences based on diffusion-model parameters were largely consistent with the prior in-person studies (Ratcliff & Hendrickson, 2021). We recommend approaching online testing with appropriate caution and avoid collecting mixed samples of online and in-person participants. For more detailed advice about constructing an online testing pipeline for EAM analyses, we refer readers to Gong and Huskey (2023).
If context effects are suspected, we recommend accounting for these effects in the EAM analyses. This can be done in most EAM software by including a “session” or “testing context” factor, allowing parameters to vary by context; fitting the model to data from each context separately; or building the additional contextual structure into a hierarchical model (e.g., Schurr et al., 2024; Stevenson, Innes, et al., 2024; Wall et al., 2021). Finding a close agreement across contexts may justify pooling data.
Collecting and Recording Data
EAM analysis requires certain information about each trial to be recorded. Such information is typically recorded by the software used to present the experiment and is saved in the form of a data table or comma-separated values file, in which each row represents a trial and each column represents an experimental or measured variable. At minimum, each row of the data should record the participant identifier, experimental condition, presented stimulus, submitted response, and RT.
Data should include the testing session (if more than one) and trial number, and it is good practice to record the timing of events, including stimulus and response onsets, and events such as cues, feedback/reward screens, and intertrial intervals. Although not everything will be used in modeling, the raw data should ideally allow one to reconstruct the trial composition and timing of the original experiment. Most EAM software will require as input a data frame of this approximate form (e.g., Heathcote et al., 2019; Stevenson, Donzallaz, et al., 2024). However, specific data- and file-formatting requirements will differ depending on the software/fitting routine used.
Screening Data Before EAM Analysis
Before EAM analysis, it is important to screen data for potentially undesirable features or distributional properties that may violate EAM assumptions. Undesirable data features can include outliers (excessively fast or slow RTs), nonresponses, truncated or misshapen RT distributions, and data from participants who did not comply with task instructions. These contaminant processes can compromise the validity of an EAM analysis. Specifically, failure to ensure data fidelity can introduce bias and uncertainty into parameter estimates (Ratcliff, 1993; Ratcliff & Tuerlinckx, 2002; Vandekerckhove & Tuerlinckx, 2007).
Outliers
Outliers are contaminant RTs that are generated by processes other than those that the researcher is interested in and that often lie outside the range of normal observations (Berger & Kiefer, 2021; Miller, 2023). Outliers can be the result of fast guesses (e.g., guesses made without properly inspecting the stimulus), slow guesses (e.g., guesses based on a failure to reach a decision), and delayed or failed start-ups (e.g., because of attentional lapses or “trigger failures”; Matzke, Love, & Heathcote, 2017; Vandekerckhove et al., 2008) or from the participant executing multiple runs of the process of interest (e.g., making multiple assessments before committing to a final response; Ratcliff, 1993; Vandekerckhove & Tuerlinckx, 2007).
The simplest and most common method for removing outliers is to define a range of acceptable RTs and remove any observations outside of this range. For fast outliers, it is common practice to remove RTs faster than about 150 to 300 ms (e.g., McVay & Kane, 2012; Rae et al., 2014; White et al., 2010). This practice is motivated by the argument that because nondecision time (for manual key presses) is typically on the order of 150 to 250 ms (Bompas et al., 2023), responses executed sooner than this are psychologically implausible because they allow too little time for the accumulation of evidence. A more principled method for removing fast guesses is motivated by the fact that fast guesses tend to have very short RTs and chance-level accuracy (Ratcliff & Kang, 2021; Ratcliff & Tuerlinckx, 2002; Vandekerckhove & Tuerlinckx, 2007). Consequently, one can sort RTs from fastest to slowest, find the RT at which accuracy rises above chance, and discard all RTs below the chance-performance point (Vandekerckhove & Tuerlinckx, 2007). The latter method is preferrable, although differences between approaches will likely be small unless there is a significant proportion (e.g., >5%) of fast contaminants distorting the leading edges of the RT distributions (Ratcliff, 1993, 2013; Ratcliff & Tuerlinckx, 2002).
For slow outliers, it is more common to define an upper cutoff based on some measure of observed RT variability or to simply not censor slow outliers unless there is clear evidence of their presence. For example, some researchers censor RTs beyond 3 times the interquartile range/1.349 above the mean (a measure of standard deviation that is robust to skew; e.g., Strickland et al., 2018). Because RT variability differs between individuals, the process of defining and removing slow outliers should be conducted separately for each participant (Miller, 2023). Furthermore, slow contaminants can be more difficult to detect than fast guesses, or even impossible, because they may be hidden within the range of normal RTs (Ratcliff, 1993; Ulrich & Miller, 1994; see also, Berger & Kiefer, 2021). For this reason, we urge caution when deciding whether to remove slow outliers.
Nonresponses
Nonresponses occur when a participant fails to submit a response (e.g., because of missing the response deadline). Because nonresponses result in missing values for choice and RT, standard EAM likelihood functions cannot be evaluated for nonresponses. Nonresponses are thus uninformative in fitting standard EAMs and should be excluded before fitting the model. Some kinds of nonresponses, such as trigger failures (i.e., failures to run the evidence-accumulation process; Matzke, Love, & Heathcote, 2017), can be incorporated into standard EAMs via mixture modeling (Heathcote et al., 2019) or with the aid of specialized experimental designs (Verbruggen et al., 2019).
Misshapen or nonstationary RT distributions
The geometry of standard EAMs predicts positively skewed, stationary RT distributions free of truncation (i.e., without censorship of the leading or trailing edge of an RT distribution). EAMs struggle to capture the shape of truncated distributions because the truncation process is not accounted for in the model (for extended models that can handle truncated data, see Damaso, Castro, et al., 2022; Evans, Steyvers, & Brown, 2018). Likewise, standard EAMs cannot predict normally distributed or negatively skewed RT distributions (Evans, Hawkins, & Brown, 2020) or nonstationary distributions that change in shape or scale over time (Miletić et al., 2021; Walsh et al., 2017). We recommend checking that RT distributions are positively skewed, stationary, and free of truncation. Nonstationarity can be checked by testing the correlation between RT and trial number or dividing the RTs into sequential bins and testing for changes in mean RT/variance/skewness. Significant correlations or systematic between-bins differences suggest nonstationarity.
Noncompliant participants
In addition to excluding individual contaminant trials, it is prudent to exclude data from participants who failed to comply with task instructions. The reason is that noncompliant participants are unlikely to have used the same cognitive strategies as compliant participants who performed the task as instructed. Consequently, standard EAMs may be a poor model of the unknown processes underlying noncompliant participants’ data. One indicator of noncompliance is chance-level performance. It is common practice to exclude data from participants with near-chance performance over all or part of the experiment (e.g., Stevenson, Innes, et al., 2024).
Manipulation check
It is important to check that experimental manipulations produced the expected effects on accuracy and mean RT because it may not be worth modeling data that lack convincing behavioral effects (Palminteri et al., 2017; R. C. Wilson & Collins, 2019). Manipulation checks can be conducted by testing for differences in accuracy or mean RT using traditional or Bayesian linear models (e.g., mixed-effects regression models; Rouder et al., 2017). Bayesian approaches further allow for quantifying evidence for null effects using Bayes’s factors (Dienes, 2016; Lakens et al., 2020; Morey & Rouder, 2011). A lack of convincing behavioral effects could indicate that the experimental manipulations were weak or ineffective. Nevertheless, it is possible to find theoretically interesting latent effects that are masked in accuracy or RT (Lerche & Voss, 2020). We recommend pilot testing proposed tasks on a small sample of participants to ensure novel designs/manipulations are effective.
When it comes to data exclusions, it is our view that prevention is better than a cure. Good data are hard-won resources, and researchers should seek to minimize the amount of it lost to exclusions. We encourage researchers to take measures to minimize contaminants, such as fast guesses and nonresponses, and ensure participants comply with task instructions (e.g., by providing sufficient task training and penalizing undesirable behaviors). Encouraging compliance will help maximize the data quality and minimize the data lost to exclusions. All data exclusions and exclusion criteria should be reported transparently. Furthermore, it is good practice to check whether results are robust to exclusions (e.g., by conducting the same analysis with and without the exclusions applied).
Fitting EAMs to Data
Once satisfied the data are appropriate for an EAM, the process of model fitting can begin. There are numerous freely available software packages that enable fitting EAMs to data (e.g., Fengler et al., 2025; Heathcote et al., 2019; Innes et al., 2022; Pan et al., 2025; Stevenson, Donzallaz, et al., 2024; Vandekerckhove & Tuerlinckx, 2008; Voss et al., 2015; Voss & Voss, 2007; Wagenmakers et al., 2007; Wagenmakers, van der Maas, et al., 2008; Wiecki et al., 2013). Some fitting software takes a Bayesian approach, and some use frequentist methods. Software differs on which models are supported and in how readily the software can be modified or extended (e.g., to support novel models). Furthermore, some software performs parameter estimation with limited additionally functionality (e.g., Wagenmakers et al., 2007), whereas others offer comprehensive suites of functions for plotting model fits and evaluating critical aspects of the modeling process (e.g., parameter recovery and sampling diagnostics; Fengler et al., 2025; Heathcote et al., 2019; Stevenson, Donzallaz, et al., 2024; Wiecki et al., 2013). It is beyond the scope of this article to weigh the merits of various software packages and fitting methods. We direct interested readers to several detailed comparative studies (e.g., Alexandrowicz & Gula, 2020; Evans, 2019; Lerche et al., 2017; Ratcliff & Childers, 2015; van Ravenzwaaij & Oberauer, 2009) and existing comprehensive resources on evaluating and troubleshooting the model-fitting process (e.g., assessing convergence and diagnosing problems with sampling/fitting algorithms; Baribault & Collins, 2025; Gelman et al., 1995; Kruschke, 2014; McElreath, 2016).
We recommend fitting EAMs to the data of individuals rather than to group-aggregated data (e.g., data that have been collapsed or averaged across participants). This is because nonlinear models (e.g., EAMs) can produce misleading inferences when fit to aggregated data (Heathcote et al., 2015; see also, Averell & Heathcote, 2011; Brown & Heathcote, 2003; Heathcote et al., 2000). In some cases, one may want to fit just a single model, such as when the researcher has in mind a specific EAM and clear expectations for how model parameters should change. In this case, the researcher moves on to assessing absolute fit (i.e., how well the chosen model accounts for important data features) and then on to interpreting parameters. An alternative (and more common) situation is to have several plausible models of the data with the goal of finding the one that gives the best (e.g., most parsimonious) account of the data. Finding a good model involves assessing relative fit (i.e., how well a model accounts for data relative to other models) and absolute fit and evaluating the reliability of parameter effects. These are the topics of the next section.
Comparing and Evaluating EAMs
A thorough modeling analysis involves evaluating both relative fit (a model’s ability to account for data relative to other models) and absolute fit (a model’s absolute ability to capture the data). Model comparison enables researchers to evaluate competing cognitive theories against one another (Pitt et al., 2002), the goal being to find the simplest model that also fits the data well (Myung & Pitt, 1997). Model comparison is important because more flexible models will have an unfair advantage in fitting data more closely than a simpler model but will also tend to predict future data less well than a simpler model that captures only robust/reliable effects (Busemeyer & Wang, 2000; Cutting et al., 1992; Myung, 2000; Myung & Pitt, 1997; Roberts & Pashler, 2000; Yarkoni & Westfall, 2017).
Model comparison requires the researcher to propose a set of candidate models, each of which constitutes a different theory of decision-making, as instantiated in an EAM. For example, a researcher might be interested in whether participants’ slower RTs in one condition are due to slower accumulation, higher thresholds, or longer nondecision time (or some combination thereof). The researcher would then build models that explain the effect (i.e., slower RTs) using (the appropriate combination of) accumulation rates, thresholds, or nondecision time while holding the other parameters fixed. The proposed models may vary in complexity (e.g., the number of free parameters and how model parameters are combined in the model equations; Myung & Pitt, 1997) and which parameters are used to explain the target effects (e.g., whether a manipulation is assumed to affect accumulation rates or thresholds or both). Moreover, researchers may seek converging evidence by fitting the same theory instantiated in different EAM architectures (e.g., using relative-evidence and racing-accumulator models). Doing so helps to ensure results are not dependent on the specific choice of EAM (Singmann et al., 2018).
Relative fit
Relative fit can be assessed using model-comparison metrics (e.g., Akaike, 1974; Ando, 2007; Schwarz, 1978; Spiegelhalter et al., 2002; Watanabe & Opper, 2010) that account for both model fit and model complexity (for a review, see Evans, 2019). These metrics can identify the model that out of the models considered, provides the most parsimonious account of the data (i.e., offers the best trade-off between fit and complexity). Methodological work indicates that even the relatively simple “parameter counting” metrics (e.g., Akaike information criterion, Akaike, 1974; Bayesian information criterion, Schwarz, 1978; deviance information criterion, Spiegelhalter et al., 2002) give similar results to “gold-standard” methods such as Bayes’s factors (Evans, 2019), which can be difficult to implement for complex cognitive models (Annis et al., 2019; Evans & Brown, 2018; Gronau, Heathcote, & Matzke, 2020) but are argued to give the optimal trade-off between flexibility and goodness of fit (Jeffreys, 1998; Kass & Raftery, 1995).
When multiple models are under consideration, we recommend the “bookending” strategy (M. D. Lee et al., 2019), in which the set of candidate models includes a minimally parameterized base model (in which all target effects are removed/held fixed) and a fully flexible top model (in which all target effects are included). This strategy helps establish upper and lower bounds on model complexity and find the model (from the set of candidate models) that provides the most parsimonious account of the data (Heathcote, Brown, & Wagenmakers, 2015; M. D. Lee et al., 2019; Shiffrin et al., 2008). Bookending helps to navigate the treacherous waters between underfitting (i.e., failing to capture important data features) and overfitting (i.e., capturing noise or idiosyncratic data features).
When participants have different preferred models, it can indicate the use of distinct cognitive strategies. For example, in a speed-accuracy trade-off experiment, some participants may be better fit by a model in which speed-accuracy instructions selectively influence thresholds, whereas others may prefer a model in which speed-accuracy instructions affect both rates and thresholds. In such cases, we recommend reporting the proportion of participants best represented by each model. 6 We further encourage researchers to seek converging evidence (e.g., by comparing multiple complexity metrics) when choosing from among many possible models.
Absolute fit
One limitation of relative fit metrics is that there is no guarantee that a model selected in this manner actually provides a good account of the data (Box, 1976). The winner may be the best of a bad bunch. This limitation makes relative fit metrics inappropriate for falsifying models because they consider only the relative evidence for the winning model against (an incomplete set of) rival models while ignoring whether the winner gives an adequate account of the data (Palminteri et al., 2017). The ability to falsify models is important for scientific progress because it allows researchers to discard bad theories (models) and propose better ones that become the target of future falsification attempts (Popper, 2005). Falsification requires assessing the absolute fit of a model, that is, its ability to account for all the important trends in the data. A further reason assessing absolute fit is critical is that parameters derived from models that fail to capture important data features may be misleading or uninterpretable (Anscombe, 1973; Heathcote, Brown, & Wagenmakers, 2015).
Absolute fit is commonly assessed via visual inspection (Dutilh et al., 2019). In this method, model predictions are overlaid against empirical data (Heathcote, Brown, & Wagenmakers, 2015). At minimum, we recommend assessing model fit to both accuracy (response proportion) and RT in each cell of the design. Fit to RT should be assessed across the entire range of RTs (e.g., by plotting fits to the 0.1, 0.5, and 0.9 RT quantiles, which correspond to the leading edge, median, and tail of an RT distribution, respectively). Some researchers also check whether models capture higher moments (e.g., variance and skewness) of RT distributions (e.g., Evans, Hawkins, & Brown, 2020). Specific benchmarks for evaluating model fit include assessing whether key experimental or individual-differences effects in accuracy and RT are reproduced by the model. For example, one might check that a model captures differences in accuracy across levels of a difficulty manipulation or that it captures an increase in RT distribution skewness in a clinical group relative to control participants. Conducting a thorough evaluation of absolute fit can help diagnose potential sources of misfit and identify where a model might be mis-specified.
We recommend visually inspecting model fits for each participant individually. Poor individual-level fits can reveal noncompliant participants (e.g., using alternate or contaminant strategies) because the EAM failed to adequately describe the processes at play. We suggest running modeling analyses with and without poorly fit participants and comparing the results of the two analyses. Convergent results increase confidence that a finding is robust and not unduly influenced by potentially noncompliant participants. By contrast, discrepancies should decrease confidence and spur additional model development and exploration of individual differences. Any divergent findings should be reported and discussed transparently. We caution that graphical assessment of fit is inherently subjective and thus subject to human error and judgement biases (Browne & Cudeck, 1992; Korteling & Toet, 2022; Kunda, 1990). Confidence can be increased by using multiple independent assessors (D’Agostino, 1986). For reporting purposes, it usually suffices to show the overall fit averaged over participants (although the model was fit individually) because it may be infeasible to display comprehensive model fits for potentially hundreds of individual participants.
Parameter recovery
Having chosen an adequate model, it is good practice to assess parameter recovery (Heathcote, Brown, & Wagenmakers, 2015). “Parameter recovery” refers to the practice of fitting a model to many synthetic data sets (simulated from known parameter values) and assessing whether the model consistently returns the known data-generating parameters. Recovery can be assessed graphically by plotting the correlation between true and recovered values. Parameter-recovery studies have utility for establishing the reliability of model inferences and identifying potentially unreliable (poorly recovered) parameters/effects. Parameter-recovery simulations are also useful for assessing a design’s suitability for modeling (in terms of trial and participant numbers) and verifying the efficacy of experimental manipulations (in terms of expected effect size; Heathcote, Brown, & Wagenmakers, 2015; Miletić et al., 2017; R. C. Wilson & Collins, 2019). To generate the synthetic data used to assess recovery, one can simulate from parameter values that have been previously reported for similar tasks (Tran et al., 2021), values (e.g., posterior means) derived from fitting the target model to prior data, or values derived from the beliefs of subject-matter experts (Gronau, Ly, & Wagenmakers, 2020; Kadane & Wolfson, 1998; Stefan et al., 2022). Parameter recovery should be assessed across a range of “true” generating values in case there are biases in specific generating-parameter ranges.
Test and interpret parameter effects
Having established a reliably estimated model that is preferred based on relative and absolute fit, focus turns to testing and interpreting parameter effects (i.e., differences across conditions or correlations) contained in the preferred model. Tests can be conducted using traditional statistical approaches (e.g., analysis of variance, Ratcliff, Thapar, Gomez, & McKoon, 2004;
Interpreting parameters involves mapping parameter effects back to cognitive theory. For example, in working-memory tasks, accumulation-rate effects might be interpreted in terms of differences in item activation in memory (e.g., Donkin & Nosofsky, 2012; Ratcliff, 1978; Zhou et al., 2021). By contrast, in preferential-choice tasks, rate effects might be interpreted in terms of subjective utility or preference strength (e.g., Busemeyer et al., 2019; Konovalov & Krajbich, 2017). Likewise, in different tasks, threshold effects might be interpreted in terms of speed-accuracy settings (e.g., Evans, 2021) or the operation of adaptive cognitive control (e.g., Boag, Strickland, Heathcote, et al., 2019; Strickland et al., 2018). Linking parameters to broader cognitive theory helps readers understand and interpret the results of an EAM analysis.
These evaluation practices constitute a minimal set of checks intended to promote robust cognitive modeling (M. D. Lee et al., 2019) rather than an exhaustive list of best practices. A complete tutorial on evaluating EAMs is beyond the scope of this article. We point interested readers to a number of excellent sources on more advanced model-evaluation techniques (e.g., Evans, 2019; Heathcote, Brown, & Wagenmakers, 2015; Shiffrin et al., 2008). These techniques include model recovery and cross-fitting methods to assess mimicry between models (Donkin, Brown, Heathcote, & Wagenmakers, 2011; Evans, 2020; Hawkins, Forstmann, et al., 2015) and generalization tests to assess how well model predictions match new data and experimental contexts (Busemeyer & Wang, 2000; Vehtari et al., 2017).
Reporting an EAM Analysis
We encourage researchers to carefully report all stimuli, materials, procedures, and analysis choices. Table 3 lists essential information to include when reporting an EAM analysis. The purpose of including this information is to help readers interpret and assess the quality of the analysis and facilitate future follow-up studies, such as replications and meta-analyses of EAM results (Theisen et al., 2021; Tran et al., 2021). Providing contextual information (e.g., justifying research goals and design choices) can help readers interpret findings and determine their scope of applicability. Thoroughly describing the experimental procedure and analysis pipeline can help readers assess the trustworthiness of results. To promote transparency and openness in science (Hales et al., 2019; Nosek et al., 2016), we encourage researchers to openly report potential flaws of models and methods. To further encourage open and reproducible research (Crüwell, Van Doorn, et al., 2019; Gilmore et al., 2017; Munafò et al., 2017), we recommend researchers share anonymized raw data (Martone et al., 2018; Wilkinson et al., 2016) and modeling and analysis code (McDougal et al., 2016; M. K. Wilson et al., 2019).
Essential Components to Include When Reporting an Evidence-Accumulation-Model Analysis
Going Beyond the Standard Models
Here, we raise the issue of what to do when a proposed task violates the processing assumptions of standard EAMs or the standard framework fails to provide an adequate account of the data. In these situations, it is prudent to first search the EAM literature to find out whether there already exists an extended EAM that may account for your data. The literature is replete with EAM variants that have been adapted to account for tasks and phenomena not accounted for in the basic EAM framework. One class of extended EAMs account for violations of within-conditions stationarity because of learning (Fengler et al., 2022; Fontanesi et al., 2019; Mendonça et al., 2020; Miletić et al., 2021; Pedersen et al., 2017; Pedersen & Frank, 2020; Sewell et al., 2019). In these models, a learning rule allows parameters to be updated from trial to trial in response to feedback (for a review, see Miletić et al., 2020). Extensions also exist that account for various violations of within-trials stationarity. These include models that allow for within-trials changes in evidence strength (Diederich, 2024; Holmes et al., 2016; Holmes & Trueblood, 2018; Krajbich et al., 2010; Maier et al., 2020; Sepulveda et al., 2020; Sullivan et al., 2015; Weichart et al., 2022; X. Yang & Krajbich, 2023) or thresholds (Busemeyer & Rapoport, 1988; Evans, Hawkins, & Brown, 2020; Hawkins, Forstmann, et al., 2015; P. L. Smith & Ratcliff, 2022; Voskuilen et al., 2016; Voss et al., 2019; Zhang et al., 2014) and the effects of multiple, potentially conflicting sources of evidence on the accumulation process (P.-S. Lee & Sewell, 2024; Little et al., 2018; Ulrich et al., 2015; Weichart et al., 2020; White et al., 2011, 2018). Another highly active area of model-development research seeks to refine the standard account of nondecision time by titrating the sensory encoding and motor components (Bompas et al., 2023; Kelly et al., 2021; Servant et al., 2021; Weindel, Gajdos, et al., 2021).
The basic framework has been extended to decisions involving more than one discrete response per trial (e.g., best-to-worst ranking tasks, Hawkins et al., 2014; double-response paradigms, Evans, Dutilh, et al., 2020; Taylor et al., 2024; Ulrich & Stapf, 1984), decisions with continuous-response spaces (e.g., color-matching and continuous-scaling tasks, Kvam, 2019a, 2019b; Kvam et al., 2023; Kvam & Turner, 2021; Qarehdaghi & Amani Rad, 2022; P. L. Smith, 2016, 2019; P. L. Smith et al., 2020; Zhou et al., 2021), and decisions that involve integrating information along multiple attributes or feature dimensions (Busemeyer et al., 2019; Busemeyer & Townsend, 1993; Fific et al., 2010; Krajbich & Rangel, 2011; Nosofsky et al., 2011; Nosofsky & Palmeri, 1997; Roe et al., 2001; Strickland et al., 2023; Trueblood et al., 2014; Tsetsos et al., 2010).
If no appropriate model exists, focus turns to model development. The goal of model development is to construct a new model that accounts for phenomena that existing models do not (Crüwell, Stefan, & Evans, 2019). This is often accomplished by adapting or extending an existing model (e.g., Brown & Heathcote, 2005; Evans, Brown, et al., 2018; Hawkins & Heathcote, 2021; Miletić et al., 2021; Ratcliff & Rouder, 1998) but can also involve constructing an entirely new model to explain the target paradigm (e.g., Ratcliff, 1978; Usher & McClelland, 2001). Model development is an iterative and exploratory process (Crüwell, Stefan, & Evans, 2019), and one may require specialized knowledge of mathematics and computer programming to successfully build and implement a new model. We refer interested readers to several excellent resources on cognitive-model development (Busemeyer & Diederich, 2010; Farrell & Lewandowsky, 2018; M. D. Lee & Wagenmakers, 2014).
One focus of model development concerns how to incorporate choice confidence ratings into the standard account of decision-making (D. G. Lee et al., 2023; M. D. Lee & Dry, 2006; Moran et al., 2015; Pleskac & Busemeyer, 2010; Ratcliff & Starns, 2009, 2013; Van Zandt & Maldonado-Molina, 2004). Confidence ratings offer a third data source (i.e., choice, RT, and confidence) with which to constrain models of decision-making (Vickers, 2014). Current models make different assumptions about how confidence-rating decision trials should be structured. For example, Ratcliff and Starns (2009) measured confidence during the initial decision, whereas Pleskac and Busemeyer (2010) measured confidence during a subsequent additional decision stage (see also, Moran et al., 2015). This difference is critical if confidence ratings are based on different evidence before, during, and after a decision (D. G. Lee & Pezzulo, 2022, 2023). Such structural differences make it difficult to compare models (with both other confidence models and standard EAMs), especially if eliciting the confidence rating changes how individuals perform the task. The task of refining and unifying models of choice confidence is an active ongoing area of model-development work.
Concluding Remarks
Our aim in this article was to provide practical guidance on planning experimental tasks for EAMs. To this end, we gave advice on how to design tasks that meet EAM assumptions, how to relate experimental manipulations to EAM parameters, and how to collect and prepare task data for EAM analysis. We discussed techniques for evaluating EAMs and warned of common pitfalls that can arise in EAM analyses. Some issues, such as sample-size planning, depend on the goals of the researcher and may require careful judgment. This article is intended as a resource to aid in planning experiments for reliable EAM analysis. By encouraging good task-design practices, we hope to improve the quality and trustworthiness of future EAM studies and help users obtain valid and interpretable results from EAMs.
