Abstract
Keywords
Introduction
A thorough approach to the delivery of behavioral services involves repeated observation and measurement of behavior before, during, and following treatment (Kazdin, 2019). Although this approach to assessment ensures that the beneficiaries of behavioral services receive effective treatment, the repeated measurement of behavior using human observers remains a process that can be challenging in practice and in research. For example, parents, teachers, and even technicians may struggle to measure behavior consistently while simultaneously implementing interventions with high integrity (Bottini et al., 2021). Similarly, tasks such as tending to other children may interfere with data collection, especially for parents and teachers. For complex behavior, a more rigorous method involves recording the behavior on video for subsequent scoring or having an observer dedicated exclusively to data collection. However, using human observers to measure behavior from video recordings remains a costly endeavor as some additional time must be reserved for data collection (Dufour et al., 2020). Moreover, researchers must hire a second observer to make the results more believable by monitoring interobserver agreement, regardless of who is collecting the data (Hausman et al., 2022). To make data collection easier to implement and less costly, practitioners and researchers may use discontinuous measures, but these methods may produce less precise results (Falligant & Vetter, 2020; Leblanc et al., 2020). Thus, collecting data presents many challenges to those who must carry it out on a daily basis.
One potential solution to improve the feasibility of measuring behavior involves the use of artificial intelligence, or more specifically, machine learning. A subfield of artificial intelligence, machine learning trains computer algorithms to identify and use patterns from different sources of data, such as video recordings (Hu et al., 2023). In supervised machine learning, an experimenter provides a series of samples to an algorithm (i.e., a set of computerized mathematical instructions). The samples represent exemplars (and nonexemplars) of what the computer should “learn” to discriminate. Each sample is comprised of two components: the features and the class labels. Features are the input provided to the algorithm whereas the class labels are the responses, or output, that the algorithm should produce based on those features.
For example, let’s assume that we want to train an algorithm to identify a target behavior on video recordings. In this case, each second of the video could be used as a sample: a 5-min video would contain 300 samples. In each sample, the features could be the video frames (images) at each second and the class label whether the target behavior was present or not in that specific second. The algorithm would then train a model that would use these features (video frames) to identify, or predict, the class labels (absence or presence of the target behavior). The last step involves testing the model with new samples on which it was never trained, which is akin to the process of measuring generalization. In our example, testing would involve a different video than the one used for training.
One of the costliest challenges in machine learning involves creating high-quality labeled datasets, which are required to train new models. One advantage of using machine learning with behavior science is that most practitioners and researchers typically collect rigorous data about behavior (i.e., high-quality labeled datasets). Put differently, the data required to develop novel machine learning models are already collected as part of our standard assessment and treatment procedures. Another commonality between behavior science and machine learning includes their emphasis on generalization. In machine learning, models that do not generalize to novel datasets are typically discarded. These similarities should facilitate and spur further collaboration between behavioral scientists and machine learning experts.
Due to their repetitive nature, one of the first areas of application of machine learning to measurement in behavior science has targeted motor stereotypy in autistic children (Alcañiz Raya et al., 2020; Fasching et al., 2015; Goodwin et al., 2011; Rad et al., 2016; Shabaz et al., 2021). Researchers have defined stereotypy as repetitive and rhythmic patterns of behavior that typically persist in the absence of social consequences (Rapp & Vollmer, 2005). Examples of motor stereotypy include repetitive movements and gestures such as mouthing, hand flapping and body rocking (Graber et al., 2023; Thakore et al., 2024; Tse et al., 2018). Several studies have already examined the measurement of motor stereotypy using machine learning (Fasching et al., 2015; Rad et al., 2016; Shabaz et al., 2021), but the current study will focus on another form of the behavior: vocal stereotypy.
Vocal stereotypy is often observed in autistic individuals and refers “to any repetitive sounds or words produced by an individual’s vocal apparatus that are maintained by nonsocial reinforcement” (Lanovaz & Sladeczek, 2012, p. 148). Contrarily to other types of vocal responses, the behavior is not used for communication (i.e., social reinforcement). The main hypothesis to explain the maintenance of vocal stereotypy is that engaging in the behavior produces some type of sensory stimulation that results in automatic reinforcement. Examples of vocal stereotypy include behaviors such as unclear and repetitive sounds, acontextual grunting or screaming, repeating script lines of TV shows, and laughter that does not align with the social context (Ahrens et al., 2011; Lanovaz & Sladeczek, 2012; Martinez et al., 2016). Although not physically harmful to the person or others, practitioners and researchers may target vocal stereotypy for reduction when the intensity or frequency of the behavior causes interference with learning, social interactions, academic engagement, and language development (Wang et al., 2020).
To our knowledge, only a handful of researchers have used machine learning to detect vocal stereotypy (Dufour et al., 2020; Khan et al., 2023; Min & Fetzner, 2018, 2019). For example, Min and Fetzner (2018) employed a machine learning algorithm to analyze audio files recorded from four children with autism to detect the presence of vocal stereotypy in 2- to 20-s clips. Their initial analyses produced a sensitivity ranging from 73% to 93%, but the study lacked important information to draw clear conclusions. Notably, the researchers did not describe how they tested their models on untrained data (i.e., generalization), which brings into question the external validity of their results. In the following year, Min and Fetzner (2019) trained a novel model on publicly available data. Their model detected 86% of occurrences of vocal stereotypy (i.e., number of true positives detected by the model divided by the total number of true positives) on a generalization set consisting of two children with autism. The study did not report other metrics for the generalization set (e.g., false negatives, true negatives), limiting inferences that we can draw from the models. Furthermore, both prior studies have defined vocal stereotypy as an expression of emotional frustration that resembled screaming. Vocal stereotypy is not typically defined as an emotional response (Lanovaz & Sladeczek, 2012), which may restrict the extent to which the models can detect behaviors other than screaming.
More recently, Khan et al. (2023) proposed machine learning-based software to identify vocal responses in children with autism. Using 2,575 samples from 76 different children, their objective was to classify the vocal responses into seven categories: speech, clapping, echolalia, non-speech, repetitive speech, unusual noises, and pronoun reversal. For their analyses, the researchers combined repetitive speech, echolalia, speech, and pronoun reversal in the same category (i.e., speech). Their best model to differentiate speech, non-speech, unusual sounds and claps resulted in an accuracy of 77%. Yet again, the problem was that the study did not report other metrics such as unbalanced accuracy or the kappa statistic, which is important as the dataset was imbalanced (i.e., contained different numbers of each behavior). Additionally, including repetitive speech and echolalia in the same category as speech seems conceptually inconsistent as the former are typically considered forms of vocal stereotypy. Therefore, two common limitations of the previous studies were that (a) they only explored whether vocal stereotypy was present or not in short video clips and (b) their definition of vocal stereotypy departed significantly from what is typically used in behavior science. The implementation of single-case designs to monitor behavior typically involves measuring the duration of vocal stereotypy for longer sessions. As such, the prior algorithms have limited utility for practice and research in behavior science.
To address this issue, Dufour et al. (2020) used an artificial neural network architecture (i.e., a type of machine learning algorithm) to measure the duration of vocal stereotypy in children with autism. To train their models, the study included more than 27 hr of video recordings collected from eight children with autism who engaged in different forms of vocal stereotypy. Although the session-by-session correlations between the percentage reported by human observers and those reported by the machine learning models showed promising results, the study had several limitations that need to be addressed. First, many of their models produced error rates higher than those typically tolerated by behavioral scientists when using other methods of data collection (e.g., discontinuous methods; Leblanc et al., 2020). The main cause of this error remains unknown as it was not reported by the researchers. Second, the study limited its analysis to a simple model architecture that may not have been optimal to detect vocal stereotypy. Leveraging novel architectures such image recognition with vision transformers, which had not been described when Dufour et al. (2020) conducted their original study, may potentially produce more accurate models (Dosovitskiy et al., 2021; Khan et al., 2022). Finally, their analyses relied exclusively on the Mel frequency cepstral coefficients (MFCC; see “Method” section for more details) of the audio recording to extract the features from the sound.
Incorporating additional audio features, such as Mel spectrograms, 1 could potentially yield models with different performance characteristics (Meghanani et al., 2021; Rawat et al., 2023; Turab et al., 2022). Building upon the work of Dufour et al. (2020), the present study aimed to address such limitations by applying enhanced methods to their original datasets. Specifically, we incorporated both MFCC and Mel spectrogram features. While MFCCs effectively summarize key information relevant to human hearing, Mel spectrograms retain crucial temporal information, allowing for the assessment of changes over time. Furthermore, we employed a novel model architecture, the Cross-Covariance Image Transformer (XCiT; Ali et al., 2021), which was not available when Dufour et al. published their original study. As XCiTs were originally designed for natural image processing, significant adaptation was required to enable XCiT to effectively process MFCC and Mel spectrogram features derived from audio. Thus, the purpose of our study was to extend Dufour et al. (2020) by training and testing a novel model architecture on their original dataset.
Method
Dataset
We used the same dataset as Dufour et al. (2020). The dataset included eight children with autism who engaged in vocal stereotypy and participated in a larger study on the use of mobile technology to reduce engagement in stereotypy (see Préfontaine et al., 2019; Trudel et al., 2021). Table 1 presents the age, gender, number and duration of sessions as well as the duration and topography of vocal stereotypy for each participant. Each child participated in 6 to 38 sessions for a total time ranging from 4,015 s (1.1 hr) to 27,448 s (7.6 hr). On average, each session lasted 10 to 12 min. Some sessions involved parents interacting with their child, background music, and the implementation of differential reinforcement (including verbal praise).
Participant Characteristics.
In their study, Dufour et al. (2020) “defined vocal stereotypy as acontextual or unintelligible sounds or words produced by the vocal apparatus of the child” (p. 372). The researchers extracted the sound recordings from a video file collected using a standard definition camcorder placed on a tripod that filmed baseline and intervention sessions. The original authors measured the second-by-second duration of stereotypy using the audio recordings. Each second was scored using a binary approach: vocal stereotypy present (1) or no vocal stereotypy (0). Vocal stereotypy was recorded as being present as soon as it occurred during a second (as in partial interval recording with 1-s intervals). For each participant, a second observer measured interobserver agreement for at least 25% of sessions (
Feature Extraction
The first step in machine learning involves extracting features, which are used by the algorithm to detect vocal stereotypy. The current study involved using both the Mel spectrogram and the MFCC of the audio recordings. Simply put, the Mel spectrogram represents the amplitude or the loudness of the sound over different frequencies bands (based on human hearing) across time. The MFCC involves a transformation of the Mel spectrogram using a function called discrete cosine transform to extract a series of coefficients. Specifically, the MFCC involves taking the Fourier transform of the signal’s windowed excerpt, mapping the spectrum’s powers onto the Mel scale, extracting the logs of the powers at each Mel frequency, and applying the discrete cosine function.
Dufour et al. (2020) had used a simple procedure wherein the values of the MFCC were in a numerical format. That said, the extracted features using the Mel spectrogram and the MFCC may also be considered images of the audio recordings. Given the availability of highly accurate image classification algorithms, the current study used the images of the sounds to improve the models. To this end, we divided each recording into smaller audio files with a duration of one second each and extracted the Mel spectrogram as well as the MFCC in an image-based format. The sample rate of audio files was 22,050 Hz, which means each second included 22,050 samples. To extract features from these audio files using the Mel spectrogram and the MFCC, we set the window size to 1,024 and the hop length to 256 as it produced the best accuracy. As such, each second was divided into 87 timesteps 2 (approximately 11.6 ms) and we extracted 128 features per timestep for the MFCC and for the Mel spectrogram. The 128 features represent the number of filters (i.e., bandwidth or frequency bins) extracted at each timestep by the transformations. This manipulation generated two images containing 11,136 pixels each, which were combined (i.e., concatenated) in an array of shape 2 × 128 × 87. Figures 1 and 2 present samples of the extracted features from the recording using the Mel spectrogram and the MFCC, respectively.

Sample of a Mel spectrogram.

Sample of Mel frequency cepstral coefficients.
Algorithm
In the current study, we applied an XCiT transformer to train our models (Ali et al., 2021). Vision transformers can extract the global context of data, which helps them learn complex relations between features (Dosovitskiy et al., 2021). The ViT divides images into square patches. Then, the patches are linearly transformed into vectors using a learnable linear layer projection. These vectorized patches are called tokens. Because ViT models do not know which token belongs to which part of the image, positional embedding is added to tokens. Positional embedding is a technique that provides the model with information about the relative order of the patches of the input (Jiang et al., 2022).
The central component of ViTs is the encoding layer, which is the part that the model learns from the input tokens. It includes multi-head self-attention and feed-forward networks. Self-attention extracts the relationship between different input tokens (Voita et al., 2019). More specifically, the mechanism examines interactions between the tokens (regardless of their distances) to capture any dependencies between them. This process can be performed several times in parallel, and each one is called a head. Each head focuses on different types of information using different learned linear transformations. Finally, the results obtained from the heads are provided to a feed-forward neural network as extracted features to classify the output (Geva et al., 2020). A feed-forward network is a simple network that connects nodes of each layer to the next layer. The information flows from the input nodes, through any hidden layers, to the output nodes.
The problem with ViT models is that they have a quadratic complexity (i.e., the time for running the algorithm rapidly increases as the size of the model increases), which is caused by the self-attention mechanism (Zhang et al., 2024). To address this concern, researchers have replaced self-attention with cross-covariance attention (XCA; Ali et al., 2021). The difference between self-attention and XCA is that the latter operates across feature channels rather than tokens. In regular ViT models, each token must attend to all other tokens, making it complex. In XCA, each feature channel attends to the other feature channels at the same spatial location. This transformation reduces the model’s complexity without comprising its accuracy.
Pre-training
To speed up convergence and improve the robustness of the models (He et al., 2019; Hendrycks et al., 2019), the current study conducted pre-training. In the pre-training phase, researchers rely on a dataset to assist the model in learning general features and representations. We first pre-trained the model with an audio dataset called UrbanSound8K, with 8,732 labeled audio files from 10 different classes, none of which included vocal stereotypy. Even though the classes differed, pretraining may assist the model to learn the types of features that can be extracted from the input, reducing training time and improving robustness. During the pre-training procedure, the test set included 20% of the pretraining dataset whereas the remaining pretraining data were used for training. We also tuned the hyperparameters to identify those that produced the highest accuracy. Table 2 included the values of the best hyperparameters subsequently used for training (see below). The model was implemented with the PyTorch framework (version 2.2; Paszke et al., 2019).
Hyperparameter Values of the Algorithm.
Procedures
Similarly to Dufour et al. (2020), we divided our study into three analyses: between-participant, within-participant, and hybrid. All analyses used the same pre-trained model as a starting point, but varied in the cross-validation methodology used for training, validation, and testing. Each machine learning model was trained to detect stereotypy on a second-by-second basis using a binary approach: vocal stereotypy present (1) or no vocal stereotypy (0). The detailed procedures are explained in the sections below.
Between-Participant Analysis
Between-participant analysis aims to determine whether the model can identify the duration of vocal stereotypy in a participant that the model has never seen before (i.e., generalization). This analysis is the most challenging because participants typically engage in different topographies of vocal stereotypy. We trained the models on seven participants while keeping one participant for testing the model. Hence, the analysis resulted in eight outcomes (one per participant). Our analyses involved a k-fold cross-validation algorithm during the training process, with one participant for the validation set and six participants for the training set (the eighth participant was in the testing set). We selected one participant for the validation set to evaluate the model and how well it performed on the data that had not been learned in the training procedure. As indicated in Table 2, the maximum number of epochs (iterations) was 25. Across each iteration, our analyses checked the kappa value (see Outcomes) on the validation set and kept the best model only (i.e., the one that yielded the highest kappa value during validation). To measure generalization, the models were tested on the participant that had not been included in the training and validation sets. Each participant was included in the test set once. For each test set (i.e., each participant), the model was trained seven times with a different participant in the validation set so our results present the mean outcomes for each participant.
Within-Participant Analysis
Our second analysis focused on a single participant at a time. The purpose was to train the model on several sessions of the participant and identify the duration of vocal stereotypy in a new session that the model had never learned from. During training, we excluded one session from the test set and used the rest of the sessions for the training and the validation sets. The analysis involved four folds (i.e., the data was divided into four subsets): one fold was considered as a validation set and three folds were used for the training set. Because Matt had a small number of sessions compared to the others, his data were split in three folds instead. As before, we only kept the model of the epoch that produced the highest kappa on the validation set. Each session was in the test set once. Our code also shuffled the sessions and repeated the cross-validation twice per test set to ensure the model was not biased on any of the folds. As the analysis was repeated twice, the results present the mean outcomes for each session.
Hybrid Analysis
The third analysis combined the between- and within-participant approaches. Similarly to the within-participant analysis, we excluded one session for the test set and used the rest of the sessions for the training and the validation sets for each participant. The difference is that we also added data from the other seven participants in the training and validation sets. To balance the dataset, our code randomly selected the data from the other participants so that the amount of between- and within-participant data was the same. Our cross-validation involved four folds with two repetitions. In this analysis, one fold was used for validation and three folds for training. The other procedures remained consistent with the within-participant analysis.
Outcomes
To compare our results to Dufour et al. (2020), our code computed the same outcome measures on the test set: accuracy, the kappa statistic, and the session-by-session Pearson correlation. As the rank order of the sessions is important when analyzing single-case graphs, we also included the Spearman correlation. Our code and models are available at: https://doi.org/10.17605/OSF.IO/J6D3Z.
Results
Table 3 compares the mean outcomes (i.e., accuracy, kappa statistics, correlation) of the between-participant analyses between our models and those produced by Dufour et al. (2020). On the kappa value, the current study outperformed the previous study for all participants and the values were near or above 0.50 for six of eight participants. Additionally, all models produced correlations near or above 0.90, with the exception of Dave. In contrast, only three of eight participants were near or above 0.90 in Dufour et al. (2020). Figure 3 shows that the percentage detected by each of our models closely matched those recorded by the human observer. Noteworthily, the participant with the lowest kappa value, Nate, still had a very high Pearson correlation. Contrarily, Dave had the lowest Pearson correlation despite having a kappa value higher than average. These results suggest that the models may differ in their identification of molecular and molar patterns of behavior across participants.
Between-Participant Analysis: Mean Accuracy, Mean Kappa, and Correlation Comparison Between the Models of the Current Study and Those Developed by Dufour et al. (2020).

Between-participant analysis: correlation between the percentages measured by machine learning and those measured by the human observer across all sessions for each participant.
Table 4 shows the same outcomes for the within-participant analyses. Our models outperformed those reported by Dufour et al. (2020) on all metrics for each participant. The models produced kappa values near or above 0.70 for four of eight participants and the Pearson correlation values were equal to or greater than 0.95 for all participants. Three of the four participants with the lowest kappa values (i.e., Billy Peter, Dan, and Nate) are the same as for the between-participant analysis. In contrast, Alia’s kappa value considerably improved from 0.49 to 0.80, suggesting that her form of vocal stereotypy was consistent across sessions, but differed from those of other participants. Figure 4 presents the correspondence between the machine and human measures in a graphical format for the within-participant analysis. The points remained close to the linear regression line, indicating that both measures followed similar patterns. A comparison to the between-participant analysis shows that the within-participant analysis produced more improvements in the detection of molecular patterns (i.e., accuracy and kappa) than for molar patterns (i.e., correlations).
Within-Participant Analysis: Mean Accuracy, Mean Kappa, and Correlation Comparison Between the Models of the Current Study and Those Developed by Dufour et al. (2020).

Within-participant analysis: correlation between the percentages measured by machine learning and those measured by the human observer across sessions for each participant.
For the hybrid analysis, Table 5 shows that the new models produced better outcomes than those reported by Dufour et al. (2020). The kappa value was above 0.50 for all participants and the Pearson correlations were all above 0.90. The participants with the lowest accuracy and kappa values remained the same as for the prior analysis: Billy Peter, Dan, and Nate. Yet again, the molecular outcomes do not systematically match the molar outcomes. As an example, Matt has one of the lowest accuracy values but still maintains the highest correlations. Figure 5 shows how the values predicted by machine learning followed patterns consistent with those produced by a human observer. A comparison to prior analyses indicates that the hybrid analysis generally produced improvements over between-participant analysis, especially for accuracy and kappa values. Contrarily, correlations were generally higher for the within-participant analysis, but the results were mixed for kappa and accuracy. Taken together, these results suggest that the within-participant and hybrid analyses seem to produce more accurate outcomes than the between-participant analysis.
Hybrid Analysis: Mean Accuracy, Mean Kappa, and Correlation Comparison Between the Models of the Current Study and Those Developed by Dufour et al. (2020).

Hybrid analysis: correlation between the percentages measured by machine learning and those measured by the human observer across sessions for each participant.
Discussion
Despite using the same data for training, our new models outperformed those produced by Dufour et al. (2020) on all metrics. Moreover, the kappa values were lowest for the between-participant models, which was consistent with expectations and prior research. Hybrid and within-participant models typically produced better results, but they require more effort and expertise to train in practice and research. The accuracy of our models also compared favorably to those of studies detecting vocal stereotypy in short time intervals (Khan et al., 2023; Min & Fetzner, 2018, 2019). However, our models allow for a more comprehensive measurement of vocal stereotypy as they can detect duration on a second-by-second basis.
There are several potential explanations for why the current models produced more accurate results than those previously reported by Dufour et al. (2020). Firstly, the inclusion of more features may have provided a richer data representation and allowed the algorithm to model the data more comprehensively. Secondly, the XCiT model may learn patterns and complex relationships across features more conveniently than the simple network used in the prior study. Thirdly, we tuned the number of epochs and pretrained the models in the current study, which may have further improved the outcomes.
Interestingly, nearly all our models produced correlations similar to those between continuous and discontinuous methods of measurements (Leblanc et al., 2020). That is, Leblanc et al. (2020) reported correlation coefficients between continuous and discontinuous measurements within the range of those observed in the current study. Thus, the errors of measurement produced by our models are similar to those accepted by practitioners and researchers who use discontinuous methods. Another relevant observation is that kappa values in our study were all highly correlated with those reported in Dufour et al. (2020). Put differently, the participants with the highest kappa values in Dufour et al. were also those that had the highest kappa values in our study (see Emile). Conversely, our models typically performed worst on the same participants (see Dan).
The prior correlations suggest that factors beyond algorithm selection and training may make the measurement of vocal stereotypy more challenging. For example, the models produced poor agreement for Alia when using between-participant analysis in comparison to within-participant and hybrid analyses. The characteristics of her voice could potentially explain these results as she was the only female in our sample. Hence, only males were used to train her model during the between-participant analysis (as there was no other female). With the exclusion Matt (who had strong outcomes despite few sessions), the best kappa values were typically obtained by the participants with the most sessions (i.e., Dave, Emile, and Owen). These three participants also had the lowest proportion of background music during their sessions, which may have facilitated detection. Other potential variables that may have affected detection include: the presence of other background noises, the topography of vocal stereotypy, and the relative percentage of the behavior during sessions.
A future direction for research is to test the current model on novel datasets to examine the generalizability of our findings. As each analytical approach has its advantages and disadvantages, more research is needed to identify which would be most beneficial to research and practice. For example, the main advantage of between-participant models 3 is that they do not need to be re-trained for each new individual. A sufficiently well-trained model may apply to any individual exhibiting similar behavior. The problem is that training generalizable models may require data from tens, if not hundreds, of individuals who engage in vocal stereotypy. The current study only included eight participants, making the application of the model premature in practice. As an alternative, the within-participant training method has the advantage of being independently tailored to each individual, requiring only one participant. As single-case methodology is often at the core of behavior analysis, this approach has the added benefit of remaining consistent with the characteristics of the science.
Nevertheless, an issue with within-participation approach is that the models need to be retrained for each participant, which requires some additional work when compared to the between-participant approach. Furthermore, many questions remain unanswered: How many sessions of observation does it take to create an individual-level model? How does the accuracy of a model for an individual improve over time with more data? How many sessions of data collection until that threshold is met? If within-participant models were to be adopted in the future, researchers should examine these questions. The hybrid model may also prove useful as it allows adding data to within-participant models, potentially reducing the number of sessions that need to be scored at the individual level. That said, more research must be conducted to confirm this hypothesis. Regardless of the approach used by researchers, examining the generalizability of our models is crucial prior to their adoption in practice.
While our results are promising, the study has some limitations. All the data originated from the same research team using a specific protocol (see Dufour et al., 2020), which may limit generalization and have introduced bias. Furthermore, the participants engaged in a limited number of forms of vocal stereotypy. Future research could address this issue by using our models as starting points to train novel models (i.e., as pre-trained models) with more participants and different forms of vocal stereotypy. Another limitation is that the kappa values were not as high as the correlations, which suggests the models may not be well suited to analyze
In the current study, a human observer determined whether vocal stereotypy was absent or present during each second. We then compared the measure of the human observer with the one produced by the machine to calculate agreement. In other words, our study examined the concurrent validity of our new measure: to what extent does this new measure (i.e., machine learning) correlate with current practices (i.e., human observation)? To our knowledge, researchers have no other way to label the data than to rely on a human observer. Nonetheless, future research could involve examining other forms of validity to establish the utility of machine learning. For example, machine learning could measure behavior within a single-case design. If using the measure accurately captured changes (and absence of changes) across conditions, such research would provide further evidence for the criterion validity of using machine learning for observation.
Technology that relies on machine learning to detect behavior may transform how researchers and practitioners interact with their participants and clients. As models improve, fully automated systems that measure behavior in real-time could become more common in clinics. We are already seeing this automation emerge in research on nonhuman animals (e.g., Isik & Unal, 2023; León et al., 2021; Nath et al., 2019). With clinical populations, such technology could free professionals, caregivers, and teachers to focus more on intervention and less on measurement, potentially improving treatment integrity, social validity and quality of life. Other areas of research and practice that may be affected by machine learning include treatment selection, assessment, and progress monitoring. One risk related to the development of this technology is that it may only become available to those who can afford it. Consequently, researchers should strive for open access and permissive licensing when developing machine learning models as to maximize accessibility. In the end, decreasing the amount of effort and resources required for routine procedures will not only benefit practitioners and researchers, but also those who are supported by behavior science.
Footnotes
Ethical Considerations
The project was approved by the Research Ethics Board in Education and Psychology at the Université de Montréal.
Consent to Participate
The legal guardian of each participant provided informed consent.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research project was supported in part by a grant from the Canadian Institutes of Health Research (# 136895).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
