Abstract
Introduction
Mindfulness-based interventions (MBIs) are increasingly being offered in healthcare, education, and community settings. The seminal program, Mindfulness-Based Stress Reduction (MBSR), developed by Jon Kabat-Zinn at University of Massachusetts Medical School, has trained almost 1000 MBSR teachers across the United States and in more than 30 countries. There is a growing scientific literature supporting efficacy for conditions such as pain,1,2 stress, and anxiety. 3 Related interventions, such as Mindfulness-based Cognitive Therapy (MBCT) have been shown to be effective for depression. 4 , 5
In contrast to pharmacologic treatments that can readily be manufactured to ensure consistent active ingredients, MBIs are complex, multi-dimensional interventions, making them challenging to implement to agreed standards for practice.6,7 A dearth of ways to evaluate the fidelity of such complex interventions, especially teacher skill, has been an important limitation in the field. 6
The MBI:Teaching Assessment Criteria (MBI:TAC) was developed to measure teaching competency and is now used in both MBI research and training settings.8-11 An important logistical challenge with the use of the MBI:TAC in research is that it was developed and validated using video (with audio) recordings, because the developers of the tool thought that videos were more informative and, therefore, preferable to audio-only samples. 8 This idea was based on the premise that a core teaching methodology in MBIs is communication of mindfulness through the teacher’s embodied practice. Much of this is sensed by the course participants through the body language of the teacher. Video recordings, however, create important logistical challenges, which include more complex requirements for recording (video camera or smartphone with tripod vs smartphone or small audio recorder), greater intrusiveness in the teaching setting due to the more visible equipment, and greater loss of privacy for participants if their faces appear in the video. In addition, video files for two-hour classes are large, adding complexity to storage and transfer of files if needed for rating purposes. The greater visibility of video recording equipment may also increase Hawthorne effects (the alteration of behavior by the subjects of a study due to their awareness of being observed). Video recordings may also increase the possibility of introducing implicit biases based on visual impressions that could influence ratings. As MBI delivery is increasingly conducted online, some of the drawbacks of video-recordings have been reduced; for example, video-conference platforms can make video-recording sessions easy. Even for programs that are delivered using a video-conference platform, however, some of the limitations of using video recordings remain, including privacy concerns for participants and the resulting large files, which are more difficult to store and transfer securely when shared with evaluators.
Audio recordings may be an important alternative to video recordings for assessing teacher skill in some settings, but the MBI:TAC has not been validated using audio-only recordings. We sought to evaluate the reliability of the MBI:TAC when audio-only recordings were used. Using a mixed-methods approach, we investigated whether the recording format of the MBI sessions influenced the inter-rater reliability of the MBI:TAC and explored MBI:TAC evaluators’ perceptions of rating using audio-only recordings. We hypothesized that inter-rater reliability, as measured by intraclass correlation (ICC) coefficients, would be lower with audio recordings than video recordings, though potentially still adequate for research settings.
Methods
We developed an audio-ratings sub-study within the Predictors of Outcomes in MBSR Participants from Teacher Factors (PrOMPT) trial. This study was reviewed and approved by the Institutional Review Board of University of California, San Francisco. For the PrOMPT-F study, we conducted an 8-week course of 2-hour weekly sessions to train 31 experienced MBI teachers in using the MBI:TAC. The MBI evaluators who conducted MBI:TAC ratings for research purposes had at least 3 years of MBI teaching experience. Trainees were asked to complete weekly homework ratings during the training as well as rate a set of selected video clips at the end of the training. From this pool of newly trained evaluators, we assembled a group of 19 who had both high reliability of ratings compared to benchmark ratings and time available to complete further ratings of video recordings of MBSR teachers. These same evaluators were subsequently invited to participate in the sub-study of MBI:TAC rating using audio recordings. Twelve evaluators agreed to perform ratings for the audio rating study and completed MBI:TAC ratings for at least 1 audio recording.
MBSR Course Recordings
For the main PrOMPT study, 21 teachers recruited from 5 different sites video-recorded themselves teaching MBSR. When using the term “video,” we are designating a video recording that includes an audio track. We used a random number generator to select 2 recordings from each teacher for rating, 1 session from the first 4 weeks of the course and a second random selection of a session from the second 4 weeks. There were 40 MBSR session recordings (2 of each teacher, except for 2 teachers who each had just 1 recorded session). For this sub-study, we used only the audio portion of the video recordings that had previously been rated in the main study.
MBI:TAC Measure and Ratings
The MBI:TAC is used to assess the competence and adherence of MBI teaching practice. Evaluators score each domain on a scale from 1 (incompetent) to 6 (advanced). 10 The 6 different domains are: (1) coverage, pacing and organization of session curriculum, (2) relational skills, (3) embodiment of mindfulness, (4) guiding mindfulness practices, (5) conveying course themes through interactive inquiry and didactic teaching, and (6) holding the group learning environment.
Each MBSR audio-recorded session was rated by 3 different evaluators. Evaluators were assigned to audio recordings for teachers who were unknown to them and for whom they had not already rated using video recordings.
Quantitative Analysis
For the primary analysis, we calculated the absolute agreement intraclass correlation (ICC) coefficients to assess inter-rater reliability for audio ratings. In this context, ICC is a measure of the agreement between ratings made by multiple evaluators measuring the same MBSR teacher, where 0 indicates no agreement between evaluators, and 1.0 indicates perfect agreement, and the evaluators are considered a random sample from a pool of possible evaluators. We calculated ICC for the audio recording ratings 2 different ways for the 6 MBI:TAC domains, based on absolute agreement, from 2-way random effect models. We calculated individual rater ICC coefficients, which generalize to the case of using a single rater to evaluate a teacher. From the same mixed-effects model, we also calculated ICCs for the average rating of the 3 evaluators. This generalizes to the case of using a panel of evaluators (eg, a panel of 3 evaluators) and averaging their ratings to derive a final rating. In additional analyses, we calculated ICCs comparing inter-rater reliability of audio ratings to those of video ratings. We used paired t-tests to assess whether there were statistically significant differences between the ratings of audio or video recordings of the same teacher. We also used a Bland-Altman plot to evaluate degree of agreement between ratings of audio and video recordings and whether there were generally higher or lower ratings using the audio recordings compared to video. 12 Lastly, we also evaluated whether experienced MBSR teachers were easier to rate using audio alone compared to less experienced teachers using a linear mixed model with teacher years of formal practice or MBSR teaching as predictor and a rating by evaluators of how hard it was to assess an MBSR teacher with only audio. The rating scale ranged from 1 to 5, (higher numbers = harder to rate audio, lower numbers = easier to rate audio), with crossed random effects of teacher and rater (because this was a crossed design in which every category of 1 factor co-occurred in the design with every category of the other factor). For this analysis, data from eleven evaluators were used since 1 rater was not able to complete the survey assessing difficulty or ease of rating MBI:TAC when using video vs audio-recorded sessions.
Qualitative Analysis
We individually interviewed 8 MBI:TAC evaluators to assess their experience rating sessions using both recording formats. The evaluators we interviewed were a convenience sample based on who was available and willing to be interviewed; they represented two-thirds of the evaluators. Evaluators received a $30 gift card in appreciation of the time spent being interviewed. We used a semi-structured interview guide that included questions on evaluators’ overall opinions of the MBI:TAC, what they found to be the easiest and most difficult aspects of rating MBSR sessions using both the audio and video recording formats, and how the experience of rating influenced assessors’ training and teaching. We conducted the 30-minute interviews in English through a recorded videoconference. Participants were not paid for participating in the interview but received monetary compensation for each MBI:TAC audio rating assignment they completed. Interviews were transcribed verbatim and uploaded to Dedoose (v8.2.14, 2019) for analysis. We conducted qualitative thematic analysis of interviews using an inductive approach. Two team members (RR and EF) independently coded transcripts and jointly reconciled coding differences. 13 The full team met regularly during the coding and analysis process to review coding and to identify and reach consensus on the development of key themes.
Results
Characteristics of MBI:TAC Evaluators ad MBSR Teachers.
Quantitative Analysis
Intraclass Correlation Coefficients (ICC) for MBI:TAC Audio Ratings by Domain.
ICCs represent the average of rating 2 MBSR sessions per teacher. Individual ICC refers to ICC if ratings are done by a single evaluator. Average represents the ICC if ratings from 3 evaluators are averaged.
MBI:TAC Audio Ratings Compared to Video Benchmark Ratings.

Bland Altman Plots of Agreement Between MBI:TAC Ratings Using Audio Recordings and Video Recordings.
We next looked at whether it was more challenging to rate less experienced teachers using the audio alone, compared with experienced teachers. In a linear mixed model with teacher years of formal practice as predictor and the difficulty of rating with audio-only recordings as the dependent variable, we found an association with years of formal meditation experience: for each additional 10 years of experience, the audio difficulty decreased by −.16 on the scale, 95% CI: −.33 to −.002, Increased difficulty of rating with audio recording alone based on years of meditation practice of teacher being rated. Increased difficulty of rating with audio recording alone based on years of mindfulness-based intervention teaching experience of teacher being rated. Increased difficulty of rating with audio recording alone based on years of meditation practice of evaluator making rating. Increased difficulty of rating with audio recording alone based on years of mindfulness-based intervention teaching experience of evaluator making rating.



Qualitative Analysis
In analyzing interviews with 8 evaluators to explore the experience of using audio-only recordings for rating using the MBI:TAC, we identified 3 themes: (1) video recordings were particularly helpful when rating less skillful teachers, (2) video recordings tended to provide a more complete picture for rating, and (3) audio rating had some positive features.
Video Recordings Were Particularly Helpful When Rating Less Skillful Teachers
Many evaluators felt that rating less competent teachers using audio alone was more difficult than using video: The second teacher [review] that I did, with audio alone, really was challenging and I didn’t experience that teacher as an experienced teacher. I would have really liked to have seen them in action, because I feel like there’s a lot of information available in the body that I didn’t have access to. And just their language, it didn’t sit well with me…. It was a challenging rating experience. (Interview 2, with female living in the United States with 11 years of mindfulness teaching experience)
Most evaluators felt that the visual component was less important for reviewing more advanced teachers, because they could measure the teachers’ embodiment of mindfulness through the sound of their voices and get a sense of the teachers’ “presence” through the audio recording. Likewise, interpersonal dynamics between the group and teacher could be noted via audio recordings, while visual information was less necessary to develop a clear sense of the interaction with advanced teachers.
Video Recordings Tended to Provide a More Complete Picture for Rating
While evaluators had varying opinions regarding how significant visual data were during the MBI:TAC rating process, all 8 interviewees acknowledged that video added more sensory information than audio-only. Six out of 8 noted that completing the MBI:TAC ratings using the audio format was more difficult than the video due to the lack of visual information. Some interviewees (3 out of 8) mentioned that to get the most accurate rating, video recordings should be used, since “everything is helpful” when optimizing accuracy (Interview 4, with female living in Spain with 7 years of mindfulness teaching experience). Another compared the visual information to additional pieces in a jigsaw puzzle: “I think it offers a complete picture, if you like. It’s like a jigsaw with many different parts, and then to get an overall sense, needing to see the detail of the pieces.” (Interview 6, with female living in the United Kingdom with 10 years of mindfulness teaching experience)
This same interviewee thought that the lack of video often left her questioning her final score: I think there’s something in the fullness of being able to see and hear that helps to bring clarity as to which side of the line they may be on. With just the audio it was quite hard, because I felt like there was quite a lot of borderline.… It was like I needed more information to feel really sure [of] where I was placing people. (Interview 6)
As noted in the quantitative analysis, average audio ratings tended to be lower than video ratings. Without being aware of these data, this possibility was mentioned by some evaluators who hypothesized that they scored teachers lower when they lacked visual information: “I might have graded higher if I could have seen the person and saw embodiment, for example, rather than just felt it.” (Interview 1, with female living in the United States with 22 years of mindfulness teaching experience)
Some evaluators noted that the lack of visual information made the interpersonal relationships seem flat. Most evaluators described how the visual component created a more complete understanding of interpersonal relationships, class organization, visual displays, and the group mindfulness practices. “There’s so much of communication that’s physical, not words, and you miss that whole piece. So, was that teacher leaning forward? Were they leaning back? Did their face look like they were interested? Did the laughter look like it was uncomfortable laughter or like it was natural?” (Interview 1).
Audio Rating Had Some Positive Features
A few of the interviewees said that visual information was distracting in some cases or could bias or unnecessarily influence the rater: “How old somebody is, or their clothing, or whatever…. I think the video is more likely, for myself, to produce more snap judgments” (Interview 8, with male living in the United States with 12 years of mindfulness teaching experience).
A couple of evaluators noted that increased MBI:TAC rating experience using a particular recording format was likely to be a more important factor in increasing accuracy than the specific recording format of recording used, although they still acknowledged that using video was easier in some cases.
A few interviewees noted other positive aspects of using the audio recordings instead of videos. For example, 1 explained that the audio recording may actually force the rater to be more present and really listen to what is being said.
Additional Qualitative Data Findings
The 4 MBI: TAC domains that were most frequently mentioned as more difficult to rate via audio compared to video were ability to relate to the students (2), embodiment of mindfulness (3), inquiry (5), and holding the group environment (6). Evaluators who assessed MBI sessions in their second language felt that video format provided additional information for language comprehension, though they did not see this as an important barrier to using audio-only for ratings. One of these evaluators reported that although she was initially worried about the quality of audio-ratings given the language difference, she found that rating with audio was not as difficult as she had expected.
Discussion
We found evidence that using a single evaluator with audio recordings to perform an MBI:TAC rating generally resulted in low ICCs. However, when a panel of 3 evaluators was used and ratings were averaged, ICCs were above .5, indicating relatively good inter-rater reliability. An analogy for the difference between individual and panel ratings is the way ice-skating performances are scored by trained judges. If the skater is scored by a single judge, the inter-rater reliability of the score is expected to be low. For this reason, a panel of judges is used, instead, and the ratings combined, providing better inter-rater reliability for the score of the performance. Our findings suggest that use of the MBI:TAC with audio recording is feasible, but averaging more than 1 rating is desirable for good inter-rater reliability. Although we did not directly assess the use of 2 evaluators per teacher, ICCs would be expected to be in-between these results.
Overall, ICCs of audio recordings were lower than ratings of video recordings. Ratings of the same teachers using video recordings had ICCs in the .6-.8 range using an average of multiple evaluators. While the differences in average scores on the MBI:TAC between ratings of video and audio recordings were modest, we found a fairly consistent trend toward lower ratings with audio recordings. This was consistent with the views expressed by some evaluators in interviews, several of whom had concerns that they might be scoring teachers lower without the additional information from the video recordings.
Bland-Altman plots provided evidence that ratings converged more closely for teachers who received higher scores on the MBI:TAC. We also found that teachers’ years of mindfulness practice was correlated with increased ease in rating their audio-recorded sessions. These quantitative findings were consistent with qualitative data from interviews, in which several evaluators reported that they felt the video information was particularly important when rating less experienced teachers. Taken together, these findings suggest that use of audio-only recordings for MBI:TAC ratings may be most appropriate when rating experienced teachers, for example, in the context of research studies. On the other hand, using audio recordings may be more problematic when rating teachers-in-training. The overall tendency for ratings from audio recordings to be slightly lower might be best considered in the context of how ratings from audio recordings might be used. For example, in teacher training this might mean adjusting feedback for what might be expected to be slightly lower scores when using audio recordings. When comparing ratings for teachers between research studies that used different recording media (video or audio), our findings provide some guide to adjustments that might be made to assess whether MBI:TAC scores were similar.
While the embodiment domain had the poorest ICC in this audio sub-study, it also had the lowest level of interrater agreement during the initial development of the MBI:TAC when evaluators compared rating MBI sessions using video recordings to live observation. 9 Crane et al., 9 found that embodiment was the most challenging domain to articulate and the most open to interpretation. Our findings further support this original finding as the embodiment of mindfulness was the most difficult to rate reliably using the audio-only recording format. However, while the interviewees identified the domains which had the lowest ICCs, such as the embodiment of mindfulness, their order of difficulty was not identical to the ICC findings. For example, the ICC associated with relational skills was much higher than the evaluators hypothesized in the qualitative interviews. This observation highlights the possibility that there may have been sufficient data in the audio recordings to evaluate interpersonal abilities, even if the evaluators found the process more difficult.
In other research that assessed the optimal means of recording medical group sessions for evaluation, there have been variable findings. Some studies have found that the process of rating such sessions is different for certain scales when sessions are recorded using the audio vs video format, while others have not.14-19 Most studies exploring this topic found a non-significant difference in clinical ratings between audio-recorded and video-recorded clinical encounters.16-18 One study even favored the audio-recorded sessions over video, noting that the visual information increased rating time and complexity when assessing communication between oncology patients and their physicians using the Cancode interaction system, 15 and that the intra-rater reliability scores were similar between recording formats. 15 However, among these studies, a few aspects of the patient-provider relationship and communication, namely confrontation among empathic communication 16 and patronizing tone, 18 were rated differently depending on the recording format.
There were several limitations of this study. The MBSR teachers who were evaluated were predominantly rated within the upper 50% competency level. Our data is thus less informative about MBI:TAC assessments of MBI teachers with limited experience. Also, the number of teachers we evaluated was not large. Additional research may help to further define ICC values when using the MBI:TAC with audio recordings.
In summary, results from this pilot project suggest that audio recordings are adequate for research purposes in order to assess MBI teacher competency. Video recording appears to be optimal, when feasible, particularly when using the MBI:TAC for teacher training purposes.
