Sage Journals: Discover world-class research

Abstract

Direct observation is a process central to behavior science, but its implementation may be challenging in some contexts (e.g., classrooms, homes). One potential solution to improve the feasibility of conducting behavioral observation and measurement involves machine learning. Using previously published data, we developed and tested novel models to automatically measure the duration of vocal stereotypy in eight children with autism. In addition to accuracy and the kappa statistic, we examined session-by-session correlations between values measured by machine learning and those recorded by a human observer. Nearly all our models produced high correlations (i.e., .90 or more) and resulted in better metrics than those reported by the original study. The next step is for researchers to test the models on novel datasets to examine the generalizability of our findings.

Keywords

artificial intelligence behavior detection machine learning measurement neural network vocal stereotypy

Introduction

A thorough approach to the delivery of behavioral services involves repeated observation and measurement of behavior before, during, and following treatment (Kazdin, 2019). Although this approach to assessment ensures that the beneficiaries of behavioral services receive effective treatment, the repeated measurement of behavior using human observers remains a process that can be challenging in practice and in research. For example, parents, teachers, and even technicians may struggle to measure behavior consistently while simultaneously implementing interventions with high integrity (Bottini et al., 2021). Similarly, tasks such as tending to other children may interfere with data collection, especially for parents and teachers. For complex behavior, a more rigorous method involves recording the behavior on video for subsequent scoring or having an observer dedicated exclusively to data collection. However, using human observers to measure behavior from video recordings remains a costly endeavor as some additional time must be reserved for data collection (Dufour et al., 2020). Moreover, researchers must hire a second observer to make the results more believable by monitoring interobserver agreement, regardless of who is collecting the data (Hausman et al., 2022). To make data collection easier to implement and less costly, practitioners and researchers may use discontinuous measures, but these methods may produce less precise results (Falligant & Vetter, 2020; Leblanc et al., 2020). Thus, collecting data presents many challenges to those who must carry it out on a daily basis.

One potential solution to improve the feasibility of measuring behavior involves the use of artificial intelligence, or more specifically, machine learning. A subfield of artificial intelligence, machine learning trains computer algorithms to identify and use patterns from different sources of data, such as video recordings (Hu et al., 2023). In supervised machine learning, an experimenter provides a series of samples to an algorithm (i.e., a set of computerized mathematical instructions). The samples represent exemplars (and nonexemplars) of what the computer should “learn” to discriminate. Each sample is comprised of two components: the features and the class labels. Features are the input provided to the algorithm whereas the class labels are the responses, or output, that the algorithm should produce based on those features.

For example, let’s assume that we want to train an algorithm to identify a target behavior on video recordings. In this case, each second of the video could be used as a sample: a 5-min video would contain 300 samples. In each sample, the features could be the video frames (images) at each second and the class label whether the target behavior was present or not in that specific second. The algorithm would then train a model that would use these features (video frames) to identify, or predict, the class labels (absence or presence of the target behavior). The last step involves testing the model with new samples on which it was never trained, which is akin to the process of measuring generalization. In our example, testing would involve a different video than the one used for training.

One of the costliest challenges in machine learning involves creating high-quality labeled datasets, which are required to train new models. One advantage of using machine learning with behavior science is that most practitioners and researchers typically collect rigorous data about behavior (i.e., high-quality labeled datasets). Put differently, the data required to develop novel machine learning models are already collected as part of our standard assessment and treatment procedures. Another commonality between behavior science and machine learning includes their emphasis on generalization. In machine learning, models that do not generalize to novel datasets are typically discarded. These similarities should facilitate and spur further collaboration between behavioral scientists and machine learning experts.

Due to their repetitive nature, one of the first areas of application of machine learning to measurement in behavior science has targeted motor stereotypy in autistic children (Alcañiz Raya et al., 2020; Fasching et al., 2015; Goodwin et al., 2011; Rad et al., 2016; Shabaz et al., 2021). Researchers have defined stereotypy as repetitive and rhythmic patterns of behavior that typically persist in the absence of social consequences (Rapp & Vollmer, 2005). Examples of motor stereotypy include repetitive movements and gestures such as mouthing, hand flapping and body rocking (Graber et al., 2023; Thakore et al., 2024; Tse et al., 2018). Several studies have already examined the measurement of motor stereotypy using machine learning (Fasching et al., 2015; Rad et al., 2016; Shabaz et al., 2021), but the current study will focus on another form of the behavior: vocal stereotypy.

Vocal stereotypy is often observed in autistic individuals and refers “to any repetitive sounds or words produced by an individual’s vocal apparatus that are maintained by nonsocial reinforcement” (Lanovaz & Sladeczek, 2012, p. 148). Contrarily to other types of vocal responses, the behavior is not used for communication (i.e., social reinforcement). The main hypothesis to explain the maintenance of vocal stereotypy is that engaging in the behavior produces some type of sensory stimulation that results in automatic reinforcement. Examples of vocal stereotypy include behaviors such as unclear and repetitive sounds, acontextual grunting or screaming, repeating script lines of TV shows, and laughter that does not align with the social context (Ahrens et al., 2011; Lanovaz & Sladeczek, 2012; Martinez et al., 2016). Although not physically harmful to the person or others, practitioners and researchers may target vocal stereotypy for reduction when the intensity or frequency of the behavior causes interference with learning, social interactions, academic engagement, and language development (Wang et al., 2020).

To our knowledge, only a handful of researchers have used machine learning to detect vocal stereotypy (Dufour et al., 2020; Khan et al., 2023; Min & Fetzner, 2018, 2019). For example, Min and Fetzner (2018) employed a machine learning algorithm to analyze audio files recorded from four children with autism to detect the presence of vocal stereotypy in 2- to 20-s clips. Their initial analyses produced a sensitivity ranging from 73% to 93%, but the study lacked important information to draw clear conclusions. Notably, the researchers did not describe how they tested their models on untrained data (i.e., generalization), which brings into question the external validity of their results. In the following year, Min and Fetzner (2019) trained a novel model on publicly available data. Their model detected 86% of occurrences of vocal stereotypy (i.e., number of true positives detected by the model divided by the total number of true positives) on a generalization set consisting of two children with autism. The study did not report other metrics for the generalization set (e.g., false negatives, true negatives), limiting inferences that we can draw from the models. Furthermore, both prior studies have defined vocal stereotypy as an expression of emotional frustration that resembled screaming. Vocal stereotypy is not typically defined as an emotional response (Lanovaz & Sladeczek, 2012), which may restrict the extent to which the models can detect behaviors other than screaming.

More recently, Khan et al. (2023) proposed machine learning-based software to identify vocal responses in children with autism. Using 2,575 samples from 76 different children, their objective was to classify the vocal responses into seven categories: speech, clapping, echolalia, non-speech, repetitive speech, unusual noises, and pronoun reversal. For their analyses, the researchers combined repetitive speech, echolalia, speech, and pronoun reversal in the same category (i.e., speech). Their best model to differentiate speech, non-speech, unusual sounds and claps resulted in an accuracy of 77%. Yet again, the problem was that the study did not report other metrics such as unbalanced accuracy or the kappa statistic, which is important as the dataset was imbalanced (i.e., contained different numbers of each behavior). Additionally, including repetitive speech and echolalia in the same category as speech seems conceptually inconsistent as the former are typically considered forms of vocal stereotypy. Therefore, two common limitations of the previous studies were that (a) they only explored whether vocal stereotypy was present or not in short video clips and (b) their definition of vocal stereotypy departed significantly from what is typically used in behavior science. The implementation of single-case designs to monitor behavior typically involves measuring the duration of vocal stereotypy for longer sessions. As such, the prior algorithms have limited utility for practice and research in behavior science.

To address this issue, Dufour et al. (2020) used an artificial neural network architecture (i.e., a type of machine learning algorithm) to measure the duration of vocal stereotypy in children with autism. To train their models, the study included more than 27 hr of video recordings collected from eight children with autism who engaged in different forms of vocal stereotypy. Although the session-by-session correlations between the percentage reported by human observers and those reported by the machine learning models showed promising results, the study had several limitations that need to be addressed. First, many of their models produced error rates higher than those typically tolerated by behavioral scientists when using other methods of data collection (e.g., discontinuous methods; Leblanc et al., 2020). The main cause of this error remains unknown as it was not reported by the researchers. Second, the study limited its analysis to a simple model architecture that may not have been optimal to detect vocal stereotypy. Leveraging novel architectures such image recognition with vision transformers, which had not been described when Dufour et al. (2020) conducted their original study, may potentially produce more accurate models (Dosovitskiy et al., 2021; Khan et al., 2022). Finally, their analyses relied exclusively on the Mel frequency cepstral coefficients (MFCC; see “Method” section for more details) of the audio recording to extract the features from the sound.

Incorporating additional audio features, such as Mel spectrograms,¹ could potentially yield models with different performance characteristics (Meghanani et al., 2021; Rawat et al., 2023; Turab et al., 2022). Building upon the work of Dufour et al. (2020), the present study aimed to address such limitations by applying enhanced methods to their original datasets. Specifically, we incorporated both MFCC and Mel spectrogram features. While MFCCs effectively summarize key information relevant to human hearing, Mel spectrograms retain crucial temporal information, allowing for the assessment of changes over time. Furthermore, we employed a novel model architecture, the Cross-Covariance Image Transformer (XCiT; Ali et al., 2021), which was not available when Dufour et al. published their original study. As XCiTs were originally designed for natural image processing, significant adaptation was required to enable XCiT to effectively process MFCC and Mel spectrogram features derived from audio. Thus, the purpose of our study was to extend Dufour et al. (2020) by training and testing a novel model architecture on their original dataset.

Method

Dataset

We used the same dataset as Dufour et al. (2020). The dataset included eight children with autism who engaged in vocal stereotypy and participated in a larger study on the use of mobile technology to reduce engagement in stereotypy (see Préfontaine et al., 2019; Trudel et al., 2021). Table 1 presents the age, gender, number and duration of sessions as well as the duration and topography of vocal stereotypy for each participant. Each child participated in 6 to 38 sessions for a total time ranging from 4,015 s (1.1 hr) to 27,448 s (7.6 hr). On average, each session lasted 10 to 12 min. Some sessions involved parents interacting with their child, background music, and the implementation of differential reinforcement (including verbal praise).

Table 1.

Participant Characteristics.

Participants	Age	Gender	Number of sessions	Total duration of sessions		Duration of vocal stereotypy		Percentage of engagement (range across sessions)	Topography of vocal stereotypy
Participants	Age	Gender	Number of sessions	s	hr	s	hr	Percentage of engagement (range across sessions)	Topography of vocal stereotypy
Alia	10	F	10	7,091	1.97	1,147	0.32	16 (3–31)	Humming and unintelligible vocalizations
Billy-Peter	8	M	10	6,909	1.92	695	0.19	10 (0–32)	Monosyllable sounds and acontextual giggling
Dan	11	M	11	7,533	2.09	1,128	0.31	15 (0–49)	Phrase or word repetitions
Dave	6	M	30	20,756	5.77	7,450	2.07	36 (13–68)	Humming and unintelligible vocalizations
Emile	7	M	38	27,448	7.62	3,988	1.11	16 (2–41)	Grunting and unintelligible vocalizations
Matt	5	M	6	4,015	1.12	1,520	0.42	38 (2–69)	Monosyllable sounds and repetitive singing
Nate	6	M	12	8,351	2.32	2,583	0.72	31 (11–62)	Phrase or word repetitions
Owen	7	M	25	17,461	4.85	4,232	1.18	24 (0–65)	Phrase or word repetitions

Source. Adapted from “Artificial intelligence for the measurement of stereotypy,” by M. J. Lanovaz, M.-M. Dufour, and P. Cardinal, Journal of the Experimental Analysis of Behavior, 114(3), 371–372. Creative Commons Attribution 4.0 International (CC BY 4.0).

In their study, Dufour et al. (2020) “defined vocal stereotypy as acontextual or unintelligible sounds or words produced by the vocal apparatus of the child” (p. 372). The researchers extracted the sound recordings from a video file collected using a standard definition camcorder placed on a tripod that filmed baseline and intervention sessions. The original authors measured the second-by-second duration of stereotypy using the audio recordings. Each second was scored using a binary approach: vocal stereotypy present (1) or no vocal stereotypy (0). Vocal stereotypy was recorded as being present as soon as it occurred during a second (as in partial interval recording with 1-s intervals). For each participant, a second observer measured interobserver agreement for at least 25% of sessions (M = 42%). Calculating interobserver agreement involved dividing the number of seconds for which the two observers agreed on the presence or absence of vocal stereotypy by the total session duration and multiplying the quotient by 100%. The mean second-by-second interobserver agreement was 97% (range: 93%–99%) and the mean kappa interobserver agreement was 0.87 (range: 0.81–0.94).

Feature Extraction

The first step in machine learning involves extracting features, which are used by the algorithm to detect vocal stereotypy. The current study involved using both the Mel spectrogram and the MFCC of the audio recordings. Simply put, the Mel spectrogram represents the amplitude or the loudness of the sound over different frequencies bands (based on human hearing) across time. The MFCC involves a transformation of the Mel spectrogram using a function called discrete cosine transform to extract a series of coefficients. Specifically, the MFCC involves taking the Fourier transform of the signal’s windowed excerpt, mapping the spectrum’s powers onto the Mel scale, extracting the logs of the powers at each Mel frequency, and applying the discrete cosine function.

Dufour et al. (2020) had used a simple procedure wherein the values of the MFCC were in a numerical format. That said, the extracted features using the Mel spectrogram and the MFCC may also be considered images of the audio recordings. Given the availability of highly accurate image classification algorithms, the current study used the images of the sounds to improve the models. To this end, we divided each recording into smaller audio files with a duration of one second each and extracted the Mel spectrogram as well as the MFCC in an image-based format. The sample rate of audio files was 22,050 Hz, which means each second included 22,050 samples. To extract features from these audio files using the Mel spectrogram and the MFCC, we set the window size to 1,024 and the hop length to 256 as it produced the best accuracy. As such, each second was divided into 87 timesteps² (approximately 11.6 ms) and we extracted 128 features per timestep for the MFCC and for the Mel spectrogram. The 128 features represent the number of filters (i.e., bandwidth or frequency bins) extracted at each timestep by the transformations. This manipulation generated two images containing 11,136 pixels each, which were combined (i.e., concatenated) in an array of shape 2 × 128 × 87. Figures 1 and 2 present samples of the extracted features from the recording using the Mel spectrogram and the MFCC, respectively.

Figure 1.

Sample of a Mel spectrogram.

Figure 2.

Sample of Mel frequency cepstral coefficients.

Algorithm

In the current study, we applied an XCiT transformer to train our models (Ali et al., 2021). Vision transformers can extract the global context of data, which helps them learn complex relations between features (Dosovitskiy et al., 2021). The ViT divides images into square patches. Then, the patches are linearly transformed into vectors using a learnable linear layer projection. These vectorized patches are called tokens. Because ViT models do not know which token belongs to which part of the image, positional embedding is added to tokens. Positional embedding is a technique that provides the model with information about the relative order of the patches of the input (Jiang et al., 2022).

The central component of ViTs is the encoding layer, which is the part that the model learns from the input tokens. It includes multi-head self-attention and feed-forward networks. Self-attention extracts the relationship between different input tokens (Voita et al., 2019). More specifically, the mechanism examines interactions between the tokens (regardless of their distances) to capture any dependencies between them. This process can be performed several times in parallel, and each one is called a head. Each head focuses on different types of information using different learned linear transformations. Finally, the results obtained from the heads are provided to a feed-forward neural network as extracted features to classify the output (Geva et al., 2020). A feed-forward network is a simple network that connects nodes of each layer to the next layer. The information flows from the input nodes, through any hidden layers, to the output nodes.

The problem with ViT models is that they have a quadratic complexity (i.e., the time for running the algorithm rapidly increases as the size of the model increases), which is caused by the self-attention mechanism (Zhang et al., 2024). To address this concern, researchers have replaced self-attention with cross-covariance attention (XCA; Ali et al., 2021). The difference between self-attention and XCA is that the latter operates across feature channels rather than tokens. In regular ViT models, each token must attend to all other tokens, making it complex. In XCA, each feature channel attends to the other feature channels at the same spatial location. This transformation reduces the model’s complexity without comprising its accuracy.

Pre-training

To speed up convergence and improve the robustness of the models (He et al., 2019; Hendrycks et al., 2019), the current study conducted pre-training. In the pre-training phase, researchers rely on a dataset to assist the model in learning general features and representations. We first pre-trained the model with an audio dataset called UrbanSound8K, with 8,732 labeled audio files from 10 different classes, none of which included vocal stereotypy. Even though the classes differed, pretraining may assist the model to learn the types of features that can be extracted from the input, reducing training time and improving robustness. During the pre-training procedure, the test set included 20% of the pretraining dataset whereas the remaining pretraining data were used for training. We also tuned the hyperparameters to identify those that produced the highest accuracy. Table 2 included the values of the best hyperparameters subsequently used for training (see below). The model was implemented with the PyTorch framework (version 2.2; Paszke et al., 2019).

Table 2.

Hyperparameter Values of the Algorithm.

Hyperparameter	Value
Optimizer	AdamW
Learning rate	0.001
Loss function	Binary cross-entropy
Max. number of epochs	25

Procedures

Similarly to Dufour et al. (2020), we divided our study into three analyses: between-participant, within-participant, and hybrid. All analyses used the same pre-trained model as a starting point, but varied in the cross-validation methodology used for training, validation, and testing. Each machine learning model was trained to detect stereotypy on a second-by-second basis using a binary approach: vocal stereotypy present (1) or no vocal stereotypy (0). The detailed procedures are explained in the sections below.

Between-Participant Analysis

Between-participant analysis aims to determine whether the model can identify the duration of vocal stereotypy in a participant that the model has never seen before (i.e., generalization). This analysis is the most challenging because participants typically engage in different topographies of vocal stereotypy. We trained the models on seven participants while keeping one participant for testing the model. Hence, the analysis resulted in eight outcomes (one per participant). Our analyses involved a k-fold cross-validation algorithm during the training process, with one participant for the validation set and six participants for the training set (the eighth participant was in the testing set). We selected one participant for the validation set to evaluate the model and how well it performed on the data that had not been learned in the training procedure. As indicated in Table 2, the maximum number of epochs (iterations) was 25. Across each iteration, our analyses checked the kappa value (see Outcomes) on the validation set and kept the best model only (i.e., the one that yielded the highest kappa value during validation). To measure generalization, the models were tested on the participant that had not been included in the training and validation sets. Each participant was included in the test set once. For each test set (i.e., each participant), the model was trained seven times with a different participant in the validation set so our results present the mean outcomes for each participant.

Within-Participant Analysis

Our second analysis focused on a single participant at a time. The purpose was to train the model on several sessions of the participant and identify the duration of vocal stereotypy in a new session that the model had never learned from. During training, we excluded one session from the test set and used the rest of the sessions for the training and the validation sets. The analysis involved four folds (i.e., the data was divided into four subsets): one fold was considered as a validation set and three folds were used for the training set. Because Matt had a small number of sessions compared to the others, his data were split in three folds instead. As before, we only kept the model of the epoch that produced the highest kappa on the validation set. Each session was in the test set once. Our code also shuffled the sessions and repeated the cross-validation twice per test set to ensure the model was not biased on any of the folds. As the analysis was repeated twice, the results present the mean outcomes for each session.

Hybrid Analysis

The third analysis combined the between- and within-participant approaches. Similarly to the within-participant analysis, we excluded one session for the test set and used the rest of the sessions for the training and the validation sets for each participant. The difference is that we also added data from the other seven participants in the training and validation sets. To balance the dataset, our code randomly selected the data from the other participants so that the amount of between- and within-participant data was the same. Our cross-validation involved four folds with two repetitions. In this analysis, one fold was used for validation and three folds for training. The other procedures remained consistent with the within-participant analysis.

Outcomes

To compare our results to Dufour et al. (2020), our code computed the same outcome measures on the test set: accuracy, the kappa statistic, and the session-by-session Pearson correlation. As the rank order of the sessions is important when analyzing single-case graphs, we also included the Spearman correlation. Our code and models are available at: https://doi.org/10.17605/OSF.IO/J6D3Z.

Results

Table 3 compares the mean outcomes (i.e., accuracy, kappa statistics, correlation) of the between-participant analyses between our models and those produced by Dufour et al. (2020). On the kappa value, the current study outperformed the previous study for all participants and the values were near or above 0.50 for six of eight participants. Additionally, all models produced correlations near or above 0.90, with the exception of Dave. In contrast, only three of eight participants were near or above 0.90 in Dufour et al. (2020). Figure 3 shows that the percentage detected by each of our models closely matched those recorded by the human observer. Noteworthily, the participant with the lowest kappa value, Nate, still had a very high Pearson correlation. Contrarily, Dave had the lowest Pearson correlation despite having a kappa value higher than average. These results suggest that the models may differ in their identification of molecular and molar patterns of behavior across participants.

Table 3.

Between-Participant Analysis: Mean Accuracy, Mean Kappa, and Correlation Comparison Between the Models of the Current Study and Those Developed by Dufour et al. (2020).

Participants	Current study				Dufour et al. (2020)
Participants	Accuracy	Kappa	Pearson correlation	Spearman correlation	Accuracy	Kappa	Pearson correlation
Alia	0.87	0.49	0.88	0.82	0.79	0.30	−0.12
Billy Peter	0.91	0.51	0.93	1.00	0.89	0.50	0.88
Dan	0.84	0.45	0.98	0.90	0.75	0.29	0.30
Dave	0.82	0.59	0.64	0.88	0.79	0.50	0.82
Emile	0.93	0.74	1.00	1.00	0.90	0.66	0.86
Matt	0.82	0.60	0.99	1.00	0.78	0.49	0.97
Nate	0.74	0.42	0.99	0.97	0.71	0.33	−0.12
Owen	0.87	0.66	0.99	0.97	0.83	0.52	0.80
Mean	0.86	0.56	0.93	0.94	0.81	0.45	0.55

Figure 3.

Between-participant analysis: correlation between the percentages measured by machine learning and those measured by the human observer across all sessions for each participant.

Table 4 shows the same outcomes for the within-participant analyses. Our models outperformed those reported by Dufour et al. (2020) on all metrics for each participant. The models produced kappa values near or above 0.70 for four of eight participants and the Pearson correlation values were equal to or greater than 0.95 for all participants. Three of the four participants with the lowest kappa values (i.e., Billy Peter, Dan, and Nate) are the same as for the between-participant analysis. In contrast, Alia’s kappa value considerably improved from 0.49 to 0.80, suggesting that her form of vocal stereotypy was consistent across sessions, but differed from those of other participants. Figure 4 presents the correspondence between the machine and human measures in a graphical format for the within-participant analysis. The points remained close to the linear regression line, indicating that both measures followed similar patterns. A comparison to the between-participant analysis shows that the within-participant analysis produced more improvements in the detection of molecular patterns (i.e., accuracy and kappa) than for molar patterns (i.e., correlations).

Table 4.

Within-Participant Analysis: Mean Accuracy, Mean Kappa, and Correlation Comparison Between the Models of the Current Study and Those Developed by Dufour et al. (2020).

Participants	Current study				Dufour et al. (2020)
Participants	Accuracy	Kappa	Pearson correlation	Spearman correlation	Accuracy	Kappa	Pearson correlation
Alia	0.95	0.80	1.00	1.00	0.91	0.67	0.58
Billy Peter	0.91	0.48	0.96	1.00	0.91	0.25	0.93
Dan	0.86	0.45	0.95	0.92	0.79	0.23	0.34
Dave	0.87	0.71	0.95	0.95	0.83	0.60	0.66
Emile	0.96	0.81	0.99	0.99	0.94	0.75	0.97
Matt	0.86	0.69	0.99	1.00	0.80	0.43	0.96
Nate	0.79	0.52	0.97	0.95	0.74	0.34	0.33
Owen	0.87	0.64	0.99	0.98	0.86	0.40	0.88
Mean	0.87	0.64	0.98	0.97	0.85	0.46	0.71

Figure 4.

Within-participant analysis: correlation between the percentages measured by machine learning and those measured by the human observer across sessions for each participant.

For the hybrid analysis, Table 5 shows that the new models produced better outcomes than those reported by Dufour et al. (2020). The kappa value was above 0.50 for all participants and the Pearson correlations were all above 0.90. The participants with the lowest accuracy and kappa values remained the same as for the prior analysis: Billy Peter, Dan, and Nate. Yet again, the molecular outcomes do not systematically match the molar outcomes. As an example, Matt has one of the lowest accuracy values but still maintains the highest correlations. Figure 5 shows how the values predicted by machine learning followed patterns consistent with those produced by a human observer. A comparison to prior analyses indicates that the hybrid analysis generally produced improvements over between-participant analysis, especially for accuracy and kappa values. Contrarily, correlations were generally higher for the within-participant analysis, but the results were mixed for kappa and accuracy. Taken together, these results suggest that the within-participant and hybrid analyses seem to produce more accurate outcomes than the between-participant analysis.

Table 5.

Hybrid Analysis: Mean Accuracy, Mean Kappa, and Correlation Comparison Between the Models of the Current Study and Those Developed by Dufour et al. (2020).

Participants	Current study				Dufour et al. (2020)
Participants	Accuracy	Kappa	Pearson correlation	Spearman correlation	Accuracy	Kappa	Pearson correlation
Alia	0.93	0.72	0.94	0.92	0.92	0.60	0.79
Billy Peter	0.92	0.55	0.94	0.94	0.91	0.23	0.87
Dan	0.86	0.50	0.94	0.86	0.83	0.24	0.20
Dave	0.83	0.61	0.93	0.90	0.83	0.57	0.84
Emile	0.95	0.80	0.99	0.99	0.95	0.74	0.97
Matt	0.85	0.66	0.99	1.00	0.78	0.41	0.98
Nate	0.82	0.56	0.99	0.93	0.73	0.31	0.08
Owen	0.89	0.70	0.98	0.99	0.85	0.45	0.88
Mean	0.88	0.64	0.96	0.94	0.85	0.44	0.70

Figure 5.

Hybrid analysis: correlation between the percentages measured by machine learning and those measured by the human observer across sessions for each participant.

Discussion

Despite using the same data for training, our new models outperformed those produced by Dufour et al. (2020) on all metrics. Moreover, the kappa values were lowest for the between-participant models, which was consistent with expectations and prior research. Hybrid and within-participant models typically produced better results, but they require more effort and expertise to train in practice and research. The accuracy of our models also compared favorably to those of studies detecting vocal stereotypy in short time intervals (Khan et al., 2023; Min & Fetzner, 2018, 2019). However, our models allow for a more comprehensive measurement of vocal stereotypy as they can detect duration on a second-by-second basis.

There are several potential explanations for why the current models produced more accurate results than those previously reported by Dufour et al. (2020). Firstly, the inclusion of more features may have provided a richer data representation and allowed the algorithm to model the data more comprehensively. Secondly, the XCiT model may learn patterns and complex relationships across features more conveniently than the simple network used in the prior study. Thirdly, we tuned the number of epochs and pretrained the models in the current study, which may have further improved the outcomes.

Interestingly, nearly all our models produced correlations similar to those between continuous and discontinuous methods of measurements (Leblanc et al., 2020). That is, Leblanc et al. (2020) reported correlation coefficients between continuous and discontinuous measurements within the range of those observed in the current study. Thus, the errors of measurement produced by our models are similar to those accepted by practitioners and researchers who use discontinuous methods. Another relevant observation is that kappa values in our study were all highly correlated with those reported in Dufour et al. (2020). Put differently, the participants with the highest kappa values in Dufour et al. were also those that had the highest kappa values in our study (see Emile). Conversely, our models typically performed worst on the same participants (see Dan).

The prior correlations suggest that factors beyond algorithm selection and training may make the measurement of vocal stereotypy more challenging. For example, the models produced poor agreement for Alia when using between-participant analysis in comparison to within-participant and hybrid analyses. The characteristics of her voice could potentially explain these results as she was the only female in our sample. Hence, only males were used to train her model during the between-participant analysis (as there was no other female). With the exclusion Matt (who had strong outcomes despite few sessions), the best kappa values were typically obtained by the participants with the most sessions (i.e., Dave, Emile, and Owen). These three participants also had the lowest proportion of background music during their sessions, which may have facilitated detection. Other potential variables that may have affected detection include: the presence of other background noises, the topography of vocal stereotypy, and the relative percentage of the behavior during sessions.

A future direction for research is to test the current model on novel datasets to examine the generalizability of our findings. As each analytical approach has its advantages and disadvantages, more research is needed to identify which would be most beneficial to research and practice. For example, the main advantage of between-participant models³ is that they do not need to be re-trained for each new individual. A sufficiently well-trained model may apply to any individual exhibiting similar behavior. The problem is that training generalizable models may require data from tens, if not hundreds, of individuals who engage in vocal stereotypy. The current study only included eight participants, making the application of the model premature in practice. As an alternative, the within-participant training method has the advantage of being independently tailored to each individual, requiring only one participant. As single-case methodology is often at the core of behavior analysis, this approach has the added benefit of remaining consistent with the characteristics of the science.

Nevertheless, an issue with within-participation approach is that the models need to be retrained for each participant, which requires some additional work when compared to the between-participant approach. Furthermore, many questions remain unanswered: How many sessions of observation does it take to create an individual-level model? How does the accuracy of a model for an individual improve over time with more data? How many sessions of data collection until that threshold is met? If within-participant models were to be adopted in the future, researchers should examine these questions. The hybrid model may also prove useful as it allows adding data to within-participant models, potentially reducing the number of sessions that need to be scored at the individual level. That said, more research must be conducted to confirm this hypothesis. Regardless of the approach used by researchers, examining the generalizability of our models is crucial prior to their adoption in practice.

While our results are promising, the study has some limitations. All the data originated from the same research team using a specific protocol (see Dufour et al., 2020), which may limit generalization and have introduced bias. Furthermore, the participants engaged in a limited number of forms of vocal stereotypy. Future research could address this issue by using our models as starting points to train novel models (i.e., as pre-trained models) with more participants and different forms of vocal stereotypy. Another limitation is that the kappa values were not as high as the correlations, which suggests the models may not be well suited to analyze within-session patterns. As indicated earlier, models that require additional training (i.e., within-participant and hybrid) remain inaccessible to clinicians and researchers who do not have knowledge of programming. User-friendly tools to incorporate these novel data in our models should be developed. Our tool only allows for the post hoc analysis of video recordings. Therefore, future research should focus on developing more robust and smaller models that could eventually be used in real-time.

In the current study, a human observer determined whether vocal stereotypy was absent or present during each second. We then compared the measure of the human observer with the one produced by the machine to calculate agreement. In other words, our study examined the concurrent validity of our new measure: to what extent does this new measure (i.e., machine learning) correlate with current practices (i.e., human observation)? To our knowledge, researchers have no other way to label the data than to rely on a human observer. Nonetheless, future research could involve examining other forms of validity to establish the utility of machine learning. For example, machine learning could measure behavior within a single-case design. If using the measure accurately captured changes (and absence of changes) across conditions, such research would provide further evidence for the criterion validity of using machine learning for observation.

Technology that relies on machine learning to detect behavior may transform how researchers and practitioners interact with their participants and clients. As models improve, fully automated systems that measure behavior in real-time could become more common in clinics. We are already seeing this automation emerge in research on nonhuman animals (e.g., Isik & Unal, 2023; León et al., 2021; Nath et al., 2019). With clinical populations, such technology could free professionals, caregivers, and teachers to focus more on intervention and less on measurement, potentially improving treatment integrity, social validity and quality of life. Other areas of research and practice that may be affected by machine learning include treatment selection, assessment, and progress monitoring. One risk related to the development of this technology is that it may only become available to those who can afford it. Consequently, researchers should strive for open access and permissive licensing when developing machine learning models as to maximize accessibility. In the end, decreasing the amount of effort and resources required for routine procedures will not only benefit practitioners and researchers, but also those who are supported by behavior science.

Footnotes

ORCID iDs

Ali Reza Omrani

Marc J. Lanovaz

Davide Moroni

Ethical Considerations

The project was approved by the Research Ethics Board in Education and Psychology at the Université de Montréal.

Consent to Participate

The legal guardian of each participant provided informed consent.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research project was supported in part by a grant from the Canadian Institutes of Health Research (# 136895).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Our code and models are available at:

Notes

Author Biographies

Ali Reza Omrani, PhD, received a master’s degree in artificial intelligence from Kharazmi University in 2020. He also received a Ph.D. degree from the Department of Engineering, Università Campus Bio-Medico di Roma, Rome, and the University of Rome Tor Vergata, Rome, Italy, in 2025. He is currently a Researcher at the Institute of Information Science and Technologies (ISTI), National Research Council of Italy, Pisa, Italy. His current research interests lie in image reconstruction and behavior analysis.

Marc J. Lanovaz, PhD, BCBA-D, is a Professor at the École de Psychoéducation of the Université de Montréal and scientific director at the Institut universitaire en déficience intellectuelle et en trouble du spectre de l’autisme. His research interests include developing and testing machine learning models to improve decision-making, developing and assessing mobile and web apps to support parents and professionals, and automatically detecting motor and vocal behavior using artificial intelligence.

Davide Moroni, PhD, received his M.Sc. in Mathematics (Hons.) from the University of Pisa and a Ph.D. in Mathematics from the University of Rome La Sapienza. He is a Senior Researcher and Head of the Signals and Images Lab at ISTI-CNR, specializing in image processing, computer vision and topological data analysis.

References

Ahrens

E. N.

Lerman

D. C.

Kodak

Worsdell

A. S.

Keegan

(2011). Further evaluation of response interruption and redirection as treatment for stereotypy. Journal of Applied Behavior Analysis, 44(1), 95–108. https://doi.org/10.1901/jaba.2011.44-95

Alcañiz Raya

Marín-Morales

Minissi

M. E.

Teruel Garcia

Abad

Chicchi Giglioli

I. A

. (2020). Machine learning and virtual reality on body movements’ behaviors to classify children with autism spectrum disorder. Journal of Clinical Medicine, 9(5), 1260. https://doi.org/10.3390/jcm9051260

Ali

Touvron

Caron

Bojanowski

Douze

Joulin

Laptev

Neverova

Synnave

Verbeek

Jégou

(2021). Xcit: Cross-covariance image transformers. In Ranzato

Beygelzimer

Dauphin

Liang

P. S.

Wortman Vaughan

(Eds.), Advances in neural information processing systems 34 (pp. 20014–20027). NeurIPS. https://proceedings.neurips.cc/paper/2021/hash/a655fbe4b8d7439994aa37ddad80de56-Abstract.html

Bottini

Gillis

Romanczyk

(2021). Impact of data collection format on training and treatment integrity during discrete-trial implementation. Behavior Analysis: Research and Practice, 21(2), 140–152. https://doi.org/10.1037/bar0000217

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

Uszkoreit

Houlsby

(2021). An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv. https://doi.org/10.48550/arXiv.2010.11929.

Dufour

M.-M.

Lanovaz

M. J.

Cardinal

(2020). Artificial intelligence for the measurement of stereotypy. Journal of the Experimental Analysis of Behavior, 114(3), 368–380. https://doi.org/10.1002/jeab.636

Falligant

J. M.

Vetter

J. A.

(2020). Quantifying false positives in simulated events using partial interval recording and momentary time sampling with dual-criteria methods. Behavioral Interventions, 35(2), 281–294. https://doi.org/10.1002/bin.1707

Fasching

Walczak

Morellas

Papanikolopoulos

(2015). Classification of motor stereotypies in video [Conference session]. 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 4894–4900). IEEE. https://doi.org/10.1109/IROS.2015.7354065

Geva

Schuster

Berant

Levy

(2020). Transformer feed-forward layers are key-value memories. arXiv. https://doi.org/10.48550/arXiv.2012.14913

10.

Goodwin

M. S.

Intille

S. S.

Albinali

Velicer

W. F.

(2011). Automated detection of stereotypical motor movements. Journal of Autism and Developmental Disorders, 41(6), 770–782. https://doi.org/10.1007/s10803-010-1102-z

11.

Graber

(2023). Applied behavior analysis and the abolitionist neurodiversity critique: An ethical analysis. Behavior Analysis in Practice, 16(4), 921–937. https://doi.org/10.1007/s40617-023-00780-6

12.

Hausman

N. L.

Javed

Bednar

M. K.

Guell

Schaller

Nevill

R. E.

Kahng

(2022). Interobserver agreement: A preliminary investigation into how much is enough? Journal of Applied Behavior Analysis, 55(2), 357–368. https://doi.org/10.1002/jaba.811

13.

Girshick

Dollár

(2019). Rethinking imagenet pre-training [Conference session]. Proceedings of the IEEE/CVF International Conference on Computer Vision(pp. 4918–4927). IEEE. https://doi.org/10.1109/ICCV.2019.00502

14.

Hendrycks

Lee

Mazeika

(2019). Using pre-training can improve model robustness and uncertainty [Conference session]. International Conference on Machine Learning (pp. 2712–2721). PMLR.

15.

Jin

Zheng

Weng

Ding

(2023). Overview of behavior recognition based on deep learning. Artificial Intelligence Review, 56(3), 1833–1865. https://doi.org/10.1007/s10462-022-10210-8

16.

Isik

Unal

(2023). Open-source software for automated rodent behavioral analysis. Frontiers in Neuroscience, 17, 1149027. https://doi.org/10.3389/fnins.2023.1149027

17.

Jiang

Peng

Lian

(2022). The encoding method of position embeddings in vision transformer. Journal of Visual Communication and Image Representation, 89, 103664. https://doi.org/10.1016/j.jvcir.2022.103664

18.

Kazdin

A. E.

(2019). Single-case experimental designs. Evaluating interventions in research and clinical practice. Behaviour Research and Therapy, 117, 3–17. https://doi.org/10.1016/j.brat.2018.11.015

19.

Khan

Basu

Pal

Roy

(2023). System assisted vocal response analysis and assessment of autism in children: A machine learning based approach [Conference session]. International Conference on Speech and Computer (pp. 506–519). Springer Nature. https://doi.org/10.1007/978-3-031-48309-7_41

20.

Khan

Naseer

Hayat

Zamir

S. W.

Khan

F. S.

Shah

(2022). Transformers in vision: A survey. ACM Computing Surveys, 54(10s), 200. https://doi.org/10.1145/3505244

21.

Lanovaz

M. J.

Sladeczek

I. E.

(2012). Vocal stereotypy in individuals with autism spectrum disorders: A review of behavioral interventions. Behavior Modification, 36(2), 146–164. https://doi.org/10.1177/0145445511427192

22.

LeBlanc

L. A.

Lund

Kooken

Lund

J. B.

Fisher

W. W.

(2020). Procedures and accuracy of discontinuous measurement of problem behavior in common practice of applied behavior analysis. Behavior Analysis in Practice, 13(2), 411–420. https://doi.org/10.1007/s40617-019-00361-6

23.

León

Hernandez

Lopez

Guzman

Quintero

Toledo

Avendaño-Garrido

M. L.

Hernandez-Linares

C. A.

Escamilla

(2021). Beyond single discrete responses: An integrative and multidimensional analysis of behavioral dynamics assisted by machine learning. Frontiers in Behavioral Neuroscience, 15, 681771. https://doi.org/10.3389/fnbeh.2021.681771

24.

Martinez

C. K.

Betz

A. M.

Liddon

C. J.

Werle

R. L.

(2016). A progression to transfer RIRD to the natural environment. Behavioral Interventions, 31(2), 144–162. https://doi.org/10.1002/bin.1444

25.

Meghanani

Anoop

C. S.

Ramakrishnan

A. G.

(2021, January). An exploration of Log-Mel spectrogram and MFCC features for Alzheimer’s dementia recognition from spontaneous speech [Paper presentation]. 2021 IEEE Spoken Language Technology Workshop (SLT) (pp. 670–677). IEEE. https://doi.org/10.1109/SLT48900.2021.9383491

26.

Min

C. H.

Fetzner

(2018) Vocal stereotypy detection: An initial step to understanding emotions of children with autism spectrum disorder [Conference session]. 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 3306–3309). IEEE. https://doi.org/10.1109/EMBC.2018.8513050

27.

Min

C. H.

Fetzner

(2019) Training a neural network for vocal stereotypy detection [Conference session]. Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 5451–5455). IEEE. https://doi.org/10.1109/EMBC.2019.8856626

28.

Nath

Mathis

Chen

A. C.

Patel

Bethge

Mathis

M. W.

(2019). Using DeepLabCut for 3D markerless pose estimation across species and behaviors. Nature Protocols, 14(7), 2152–2176. https://doi.org/10.1038/s41596-019-0176-0

29.

Paszke

Gross

Chintala

Chanan

Yang

DeVito

. . . Wilson

(2019). PyTorch: An imperative style, high-performance deep learning library. arXiv. https://doi.org/10.48550/arXiv.1912.01703

30.

Préfontaine

Lanovaz

M. J.

McDuff

McHugh

Cook

J. L.

(2019). Using mobile technology to reduce engagement in stereotypy: A validation of decision-making algorithms. Behavior Modification, 43(2), 222–245. https://doi.org/10.1177/0145445517748560

31.

Rad

N. M.

Furlanello

(2016). Applying deep learning to stereotypical motor movement detection in autism spectrum disorders [Conference session]. IEEE 16th International Conference on Data Mining Workshops (pp. 1235–1242). IEEE. https://doi.org/10.1109/ICDMW.2016.0178.

32.

Rapp

J. T.

Vollmer

T. R.

(2005). Stereotypy I: A review of behavioral assessment and treatment. Research in Developmental Disabilities, 26(6), 527–547. https://doi.org/10.1016/j.ridd.2004.11.005

33.

Rawat

Bajaj

Vats

Sharma

(2023). A comprehensive study based on MFCC and spectrogram for audio classification. Journal of Information and Optimization Sciences, 44(6), 1057–1074. https://doi.org/10.47974/JIOS-1431

34.

Shabaz

Singla

Jawarneh

M. M.

Qureshi

H. M.

(2021). A novel automated approach for deep learning on stereotypical autistic motor movements. In Kautish

Dhiman

(Eds.), Artificial intelligence for accurate analysis and detection of autism spectrum disorder (pp. 54–68). IGI Global. https://doi.org/10.4018/978-1-7998-7460-7.ch004

35.

Thakore

Kelly

Petursdottir

A. I.

Stockdale

(2024). Evaluation of a treatment package for chronic, stereotypic hand mouthing of a child diagnosed with autism. Behavior Analysis in Practice. Advance online publication. https://doi.org/10.1007/s40617-024-00956-8

36.

Trudel

Lanovaz

M. J.

Préfontaine

(2021). Brief report: Mobile technology to support parents in reducing stereotypy. Journal of Autism and Developmental Disorders, 51(7), 2550–2558. https://doi.org/10.1007/s10803-020-04735-6

37.

Tse

C. Y. A.

Pang

C. L.

Lee

P. H.

(2018). Choosing an appropriate physical exercise to reduce stereotypic behavior in children with autism spectrum disorders: A non-randomized crossover study. Journal of Autism and Developmental Disorders, 48(5), 1666–1672. https://doi.org/10.1007/s10803-017-3419-3

38.

Turab

Kumar

Bendechache

Saber

(2022). Investigating multi-feature selection and ensembling for audio classification. arXiv. https://doi.org/10.48550/arXiv.2206.07511

39.

Voita

Talbot

Moiseev

Sennrich

Titov

(2019). Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv. https://doi.org/10.48550/arXiv.1905.09418

40.

Wang

Mason

R. A.

Lory

Kim

S. Y.

David

Guo

(2020). Vocal stereotypy and autism spectrum disorder: A systematic review of interventions. Research in Autism Spectrum Disorders, 78, 101647. https://doi.org/10.1016/j.rasd.2020.101647

41.

Zhang

Tao

(2024). Vision transformer with quadrangle attention. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5), 3608–3624. https://doi.org/10.1109/TPAMI.2023.3347693

Machine Learning to Detect Vocal Stereotypy: Improving Duration-Based Measures

Abstract

Keywords

Introduction

Method

Dataset

Feature Extraction

Algorithm

Pre-training

Procedures

Between-Participant Analysis

Within-Participant Analysis

Hybrid Analysis

Outcomes

Results

Discussion

Footnotes

ORCID iDs

Ethical Considerations

Consent to Participate

Funding

Declaration of Conflicting Interests

Data Availability Statement

Notes

Author Biographies

References