Abstract
Those with a watch know the time. Those with two are never sure.
Introduction
Music is a complex art form, involving interactions between rhythm, melody, harmony, texture, timbre, and dynamics. Although analyses of basic musical building blocks (e.g., isolated notes and chords) have objective answers (i.e., are easily verifiable), the sophisticated interplay of these elements and interpretive flexibility mean that many analyses lack verifiable answers, even when addressing fundamental concepts such as chord function:
The first chord of Wagner’s Tristan Prelude—F B D♯ G♯—is notoriously resistant to analysis, or at least seemingly impervious to consensus among analysts . . . More than a hundred years of debate have done little to diminish its capacity to fascinate, and to vex, music theorists. (Martin, 2008, p. 6)
Although extreme, this is not an isolated example. Prominent works by J.S. Bach (Larson, 1997), Schoenberg (1983), and Beethoven (Byros, 2012) have elicited heated debates over the form and function of compositional elements. Traditionally, such arguments are treated as differences of opinion, which are not uncommon in pursuits lacking verifiable answers. However, research into the process of judgment itself suggests this might not tell the full story.
In their book
The importance of reducing noise in judgment
Noise is particularly problematic in high-stakes judgments such as judicial sentencing. For example, disparities between judges can be “so pronounced that a defendant sentenced to three years by one judge would have been sentenced to twenty years had he been assigned to another” (C. S. Yang, 2014, p. 1). Such observations have long been of concern to judges themselves (Frankel, 1973), sparking significant interest in noise reduction within the criminal justice system (Clancy et al., 1981). Judicial sentencing is one of the prime examples referenced by Kahneman et al. (2021) in their book-length review of noise in judgment across many domains (a full summary of which is beyond the scope of this article). Consequently, we focus here only upon the details most pertinent to music analysis, which we believe are of great relevance given that “wherever there is judgment there is noise, and more of it than you think” (Kahneman et al., 2021, p. 12).
It is crucial to acknowledge one important distinction between judgments in music analysis and those in fields more commonly discussed in noise-reduction research. Variability in sentences issued by judges (e.g., lenient vs harsh) is always undesirable, as punishment should reflect the severity of the crime rather than the severity of the judge. However, variability between music analysts is not, as it is a natural consequence of differences in perspective. Yet not all variability is desirable in music analysis. For example, the musical properties of a piece should change neither as a function of the time of day of evaluation, nor the order of evaluation, nor the paper upon which it is printed (all factors shown to erroneously affect judgments in previous research). Therefore, in the context of music analysis, only undesirable variability meets the traditional definition of noise.
Kahneman et al. (2021) extensively detail different types of noise and the sophisticated statistical approaches that are useful for detecting them. In the interest of space, here we simply differentiate variability that is undesirable (errors/oversights) from variability that is not. We will refer to undesirable variability in judgments of music as Type B disagreement, referring to disagreement resolved after considering other perspectives and correcting self-identified errors. We contrast this with Type A disagreement, defined simply as disagreement remaining after resolving Type B disagreement.
1
Although the implications of this framework will be discussed later, we introduce the terms themselves here as they play an important role in guiding our approach to applying principles of
Reducing noise in judgment: Approaches to decision hygiene
The average of many evaluations is typically more accurate than any individual evaluation (Surowiecki, 2005), a phenomenon long referred to as
The wisdom-of-crowds concept relies upon independent assessments, reducing the risk of groupthink—suppression of dissent and consideration of alternative views—which can be a major problem in decision making (Janis, 1972). For example, an esteemed professor’s analysis of a piece may encourage a group of graduate students to seek confirmatory evidence, causing the group to overlook alternative approaches and reach artificial consensus. However, full independence of evaluators removes the opportunity to recognize overlooked information and/or gain new insight from other evaluators. This presents a paradox: group discussions can be both helpful and harmful. For this reason, harnessing the benefits of collective decision making requires careful consideration of how information is shared.
The Delphi technique (Dalkey, 1969) is a widely used procedure balancing the costs and benefits of information sharing. In this approach, participants submit independent estimates to a moderator who collates the responses and shares them anonymously. Each participant reviews and discusses this information before (optionally) re-visiting their estimates. The anonymity of responses and re-evaluations helps to mitigate the harmful effects of groupthink (Kahneman et al., 1998). Although we base key aspects of our approach on the Delphi method, we recognize two challenges associated with applying it directly, as decision hygiene in music analysis presents a special case of noise reduction.
The first challenge of using a traditional Delphi approach to music analysis is that simply averaging across ratings assumes all variability is noise (i.e., undesirable). Although true in the case of judicial sentencing, this is not the case for music analysis, where decision hygiene should remove Type B disagreement while preserving Type A disagreement. The second challenge is that simply averaging independent ratings offers no insight into the different perspectives of raters. In principle, deviations in ratings based upon a shared and clearly articulated rubric could be considered noise. However, any process of decision hygiene presuming a single perspective would make assumptions not shared by music theorists. Therefore, in addition to removing Type B disagreement, we felt it important for our approach to clarify perspectives. This goal requires more than simply averaging independent ratings. Keeping these two points in mind, we developed a variant on the traditional Delphi technique, which we call
Previous interest in multiple perspectives in music analysis
In the domain of music theory, interest in understanding different perspectives is certainly not unprecedented. For example, the
Although the analysis symposia and book by Bergé et al. (2009) illustrate an interest in the role of multiple perspectives in music analysis, these particular approaches have not been widely embraced. According to the tables of contents in JMT, the symposia took place from 1966 to 1974 yet have now “all but ceased” (Goldenberg, 2006, p. 50). One reason for this lack of interest can be found in the music theorist David Damschroder’s observation that although Bergé et al.’s collection offers some insight into multiple perspectives, “their authors proceed seemingly unaware of the book’s remaining contents . . . Several of the essays offer critiques of analyses already in print, but little attempt was made to draw connections among those included in the collected work” (Damschroder, 2010, p. 1).
Clarifying our goals
Here we explore the application of techniques for noise reduction profitably employed in fields other than music analysis, which balance independent assessments with collaborative discussions (in contrast with the siloed approach used in the JMT analysis symposia). To the best of our knowledge, this is the first such application of decision hygiene to music analysis, complementing recent applications to music on issues ranging from performance evaluation (Passarotto et al., 2023), to the development of tools for assessing musicians’ health literacy (Baadjou et al., 2019) and fatigue management (McCrary & Altenmüller, 2020).
We see our approach to collective music analysis as paradoxically both a natural outgrowth of, and an abrupt departure from, traditional analytical approaches. On the one hand, it builds on, and addresses some of the shortcomings of, previous efforts such as the JMT analysis symposia and the essays published by Bergé et al. (2009). On the other hand, it also represents a departure from normative approaches in a field prioritizing single-author analyses, often of single pieces. As such, it is important to acknowledge previous critiques of such attempts.
Some have previously argued that using scientific approaches to human questions risks “ . . . vaporizing [participants’] . . . sense of their personhood by treating them as instances of an impersonal rule” (Klausner, 1970, p. 101). This long-standing concern should not be taken lightly. We do not think analyses would be improved if theorists always agreed (i.e., achieving a Type A disagreement of 0). In fact, reducing variability by simply ignoring the perspectives of different analysts would be antithetical to the entire purpose of music analysis, which is to gain a deeper understanding of musical structure—often requiring perspectives that diverge from traditional thinking.
Differentiating between desirable variability and undesirable variability is crucial, as noise can profoundly affect the outcomes of a study. For example, the addition of noisy data to an analysis can affect conclusions about statistical significance, even if the average effect changes only minimally (Andrade, 2013). To overcome this challenge, we developed AMP with the goal of retaining variability arising from true differences in perspective, while minimizing variability arising from noise (i.e., analytical oversights). We believe approaches to decision hygiene such as the Delphi method can reduce Type B disagreement, which ultimately enhances the clarity of individual perspectives (i.e., Type A disagreement). To the best of our knowledge, the distinction between Type A and Type B disagreement has been discussed neither in the context of music analysis, nor in the broader noise reduction literature. 2 However, we believe it is crucial for applying decision-hygiene research to contexts where understanding different perspectives can be as important as the conclusions following from them.
Applying decision hygiene to music analysis
As assessing unverifiable judgments is challenging, their improvement requires focusing not on the outcome of a judgment, but rather on the process by which it is made (Kahneman et al., 2019). Common recommendations include: (1) breaking complex decisions into smaller parts and avoiding discussing or finalizing them until all the parts have been considered; (2) managing the flow of information such that each evaluator makes independent judgments before taking part in group discussion (minimizing groupthink); (3) aggregating across multiple perspectives; and (4) basing all evaluations on the same, objectively defined scale or rubric, or using forced ranking to ensure all participants use the same criteria.
Although the first two recommendations above can be applied directly, the latter two are more challenging for any endeavor involving the assessment of artistic works. Recommendation (3) can be resolved by simply retaining individual ratings (in conjunction with aggregation), although (4) is problematic as different analysts might reasonably focus on different aspects of the same passage, making it undesirable and/or counterproductive to require a common scale. Although we believe mitigating Type B disagreement is always beneficial, preserving disagreement related to the use of different approaches (Type A disagreement) is crucial in the context of music analysis.
Analysis from multiple perspectives (AMP): Proof of concept
Inspired by research demonstrating how noise reduction techniques improve judgments of unverifiable properties, we developed AMP in the hope that it could be applied to many topics in music analysis (e.g., locating modulations, analyzing phrase structure, and/or identifying thematic material). Therefore, we see this article as a proof-of-concept regarding AMP, rather than a study of a specific musical property per se. Nonetheless, as we wanted our topic to be of broad relevance, we chose to focus on
Music theorists previously observed that we hear the diatonic modes “as alterations of these more familiar [major and minor] scales” (Clendinning & Marvin, 2016), given that the Lydian and Mixolydian modes are “major-like,” with raised fourth and lowered seventh scale degrees, respectively, and that Dorian and Phrygian modes are “minor-like” (Schoenberg & Stein, 1969). Furthermore, participants evaluating melodies in an experiment report, unsurprisingly, that those in Ionian (major) scales are perceived as happier than those in Aeolian (minor), yet those in Phrygian sound “even sadder” than those in Aeolian (Temperley & Tan, 2012). This work illustrates that treating modality on a continuum has some precedence in the field of music theory and music cognition.
In addition to theoretical interest from music analysts, data on relative mode have practical implications for other fields. For example, music information retrieval software such as MIRtoolbox offers an estimation through
When making judgments of relative mode, it is important to acknowledge that although classifying isolated scales or chords as major or minor is straightforward, identifying mode in complete passages is more challenging, and requires a degree of judgment. Differences of opinion on the importance of leading tones, cadences, melodies, and harmonies inevitably lead to divergent perspectives. Nonetheless, evaluation of relative mode is not merely a matter of taste, as some judgments are simply incorrect (e.g., identifying a passage with only major chords as strongly minor). As such, judgments of relative mode reflect the concept of
Although in principle a common rubric can reduce disagreement, here participants crafted their own rubrics throughout the procedure, documenting the musical events and/or devices informing their mode ratings. We asked participants to craft rubrics such that they sorted excerpts into three categories of major and minor ratings (six categories in total), with roughly equal numbers in each category. We also asked them to identify one category of ambiguous excerpts (described in the Method section).
We chose to use personal rubrics as they (1) bolster consistency in individual participants’ ratings, (2) allow participants to refine their perspectives individually (mitigating risks of groupthink) over the course of the project, and (3) offer useful information on how/why participants might systematically disagree, providing unique insights articulated clearly for future reference. We acknowledge that this breaks markedly from traditional noise-reduction approaches, where common rubrics are recommended. However, in those situations individual perspectives are generally undesirable (e.g., disparities in criminal sentencing), whereas in music analysis, individual differences in perspective are not problematic, but in many cases desirable.
Method
Participants
We recruited five (two male, three female) graduate music students from Western University as participants, each of whom received an honorarium of $800. All five had advanced musical training, holding degrees in subjects ranging from music theory to music performance and musicology. Two of the five participants had advanced training in piano, with one completing a Doctor of Musical Arts in piano performance during the study (see Supplemental Appendix E for participants’ biographies). Consistent with past practice in the Don Wright Faculty of Music, we viewed these raters as research assistants rather than participants, as they were recruited through a job posting advertised to graduate music students already employed in the Faculty and compensated accordingly. However, as they participated in the project as raters, we will henceforth refer to them as participants.
Materials
We gathered scores and recordings of 16 sets of pieces (henceforth
Score and recording details.
Recordings
As most sets are widely performed and recorded, we followed a previously developed procedure for selecting a prominent performer. This involved identifying the five performances of each complete set that appear most often in the Naxos Music Library (NAXOS; Heymann, 2019) and Classical Music Archives (CMA; Schwob, 2019), then selecting the highest-ranking performer appearing in both lists (for further details see Kelly et al., 2021). This identified the album of choice for 13 of 16 sets, but not for those by Hummel, Guillet, or Fischer, as these three had no identifiable commercial piano recordings.
The set by Hummel (Op. 67) can be found on YouTube (Tao, 2012), which we captured using Audacity® (The Audacity Team, 2007) at a 44.1 kHz sample rate. According to the video creator, the performance, which features dynamics and tempo changes, was sequenced in SONAR4 using the EastWest Boesendorfer 290 Virtual Studio Technology (VST) plug-in. To the best of our knowledge there is no recording available of the set by Guillet. We therefore encoded each of the preludes as MIDI files, using the Addictive Keys Studio Grand VST piano and exporting them as .wav files sampled at 44.1 kHz, matching their tempi and dynamics to a MIDI encoding realized by recorder (Aeolian Consort, 2019). Rather than using a commercial recording of the set by Fischer (composed for organ, not piano), we encoded each of the preludes as we had done with those of Guillet, using a piano timbre and aligning the average root mean squared energy and tempo to that of the organist’s recording of the set. We undertook these steps for the sets by Guillet and Fischer to avoid presenting participants with sets performed with timbres clearly different from the piano used in 14 of the 16 sets.
To avoid an abrupt cutoff after the 8-measure excerpts, and following Battcock and Schutz (2019), we faded out the recording of each prelude after eight measures. The fadeout began at the onset of the ninth measure and lasted 2 seconds. We assigned each audio file a 3-digit ID number corresponding to the respective notated excerpt (as discussed below).
Scores
We produced 9-measure excerpts of the scores of each prelude by using digital scans of the score and Microsoft PowerPoint to fade out the ninth measure so that it was invisible from the 10th measure onward. We removed any information from the score excerpt that could identify the composer. Fifteen of Hummel’s Op. 67 preludes consist of fewer than nine measures, so we included these preludes in full. One of Kalkbrenner’s Op. 20 preludes opens with a long cadenza-like improvisatory section, so we produced an excerpt comprising the next nine measures of the prelude. We assigned each score excerpt an ID number corresponding to its audio file.
Prior to the analysis, participants analyzed a trial set using materials prepared in the manner described above. This practice set consisted of 24 preludes: 10 by Busoni (1881/1927) (five major and five minor), four by Clementi (1811/1896) (two major and two minor), and 10 by Kabalevsky (1934/1947) (five major and five minor).
Procedure
For each participant we prepared a package consisting of (1) the 381 score excerpts (order randomized uniquely for each participant); (2) instructions (Supplemental Appendix A); (3) links to a shared Google Drive folder with audio files; and (4) links to individual rating sheets on Google Drive. In the instructions we asked participants to rate the mode of the score excerpts (major/minor) in the order presented, referring to the corresponding audio file as necessary. We explicitly requested participants rate the relative mode of the music, rather than how the performance sounds.
For each excerpt, we asked participants to indicate (1) mode on a 7-point Likert-type scale from 1 (
Before distributing materials, we held a group meeting in which we described the goals of the study, provided a timeline, discussed compensation, and explained that the focus was on internal consistency rather than group agreement. We asked participants to aim to distribute their ratings evenly between the three categories of minor (most, middle, and least corresponding to 1, 2, and 3 on the Likert-type rating scale) and three categories of major (least, middle, and most corresponding to 5, 6, and 7). During the meeting we also pointed out that the category of 4 need not necessarily be used as much as the ratings of 1–3 (minor) and 5–7 (major). We included a column on participants’ response spreadsheets to flag whether an excerpt was atonal (yes/no), differentiating whether ratings reflected either (1) an equal division of major/minor or (2) atonality. We also explained that the ninth measure of excerpts (where applicable) should only be considered when it clarified the preceding measures (e.g., a modulation clarifying a previously ambiguous passage). Beyond following these general guidelines, we asked them to use their own musical intuition and discretion, telling them that they should feel free to modify previous ratings as the procedure unfolded and their understanding increased. We also asked them to generate a personal rubric to guide and explain their ratings (see Table 2 for an example and Supplemental Appendix E), which should contain information sufficient for future analysts to reconstruct their thinking and end up with similar ratings should they strictly follow their rubric. Participants updated their rubrics throughout the course of the AMP procedure.
Sample rating rubric.
In the
In the
At the beginning of each meeting, we reminded participants that the goal was not agreement, but to consider all perspectives. After each meeting we invited participants to update their ratings as necessary, encouraging them to add explanatory notes. Over the course of the 12 meetings, we reviewed 192 (50.4%) of the 381 excerpts with the most disagreement. Participants changed 251 ratings (13.2% of 5*381 = 1,905) during this phase (see Figure 1), reflecting new insights and/or a better understanding of 155 (40.7%) of the 381 excerpts.

Stages of study.
In the
After completing the final review phase, we asked participants to attend three further weekly meetings in which they wrote brief vignettes describing 12 excerpts and noting specific structural elements of each. We selected seven of these 12 as category exemplars (Supplemental Appendix B; Figure 2), defined as pieces with unanimous ratings (e.g., all participants assigned Shostakovich’s

The incipit of each of the category exemplars shown in Supplemental Appendix B.
Score incipits of category exemplars
Before the first of the three vignette-writing meetings, we gave participants an example by author JDS (Supplemental Appendix D). After writing a practice vignette, they wrote vignettes individually for an excerpt unanimously rated 4 (Debussy’s
After the third meeting, we asked participants to write five additional vignettes for excerpts receiving a range of ratings (Supplemental Appendix C). These would provide insight into how their ratings had been influenced by different structural elements and thus why they had disagreed. We chose these excerpts by sampling from five levels of disagreement (i.e., the top 1.8%, 5.5%, 8.1%, 14.2%, and 25.5%) while aiming for diversity in composer, historical era, and the nominal mode. Taken together, the vignettes in Supplemental Appendices B and C, and the participants’ biographies in Supplemental Appendix E, suggest how their musical training may have influenced decision making. For example, one participant pursuing a PhD in music theory (P4), emphasized cadences and functional harmonic movement because of their interest in chord-function theory; another with a strong background in jazz (P5) gave more weight to the bass line and less to chromaticism.
Results
The quantitative data (i.e., mode ratings) offer two distinct types of information. The first comprises the means of the five participants’ ratings for all 381 pieces at T3, which can be considered a reference standard (i.e., information assumed to be a reliable best estimate) of relative mode. The second comprises the variability across the five participants’ ratings for all 381 pieces. We see the averages as useful outcomes of this proof-of-concept project, and the analysis of variability as helpful in understanding the effect of AMP itself. Consequently, we address each in turn before considering how review phases affected Type A and Type B disagreement.
Exploring mean ratings across different timepoints
To assess changes in ratings across timepoints, we conducted a series of Wilcoxon signed-rank tests, calculating test statistics and accompanying
Figure 3 depicts changes in the average ratings of the 381 excerpts between T1 and T3. Average mode ratings increased for 96 excerpts (25.20% of excerpts; T1:

Changes in average ratings of excerpts between T1 and T3.
To evaluate how the AMP procedure affected the average ratings of the 381 excerpts between initial (
The effect of AMP on rating variability at each timepoint
To gain insight into the degree of consensus among participants, we examined the consistency and variability of ratings across timepoints. To assess inter-rater reliability, we computed alpha coefficients for ratings of each excerpt at T1, T2, and T3 using the
We measured the variability in ratings of each excerpt using standard deviations (

Standard deviations of ratings by number of excerpts reviewed.
Analysis of final review stage
Figure 5 summarizes the effect of the final review phase, showing all 175 changes made by participants working to ensure consistency between their rubrics and evaluations. As shown in the summary panel (bottom right), most of these changes entailed adjustments to adjacent categories (e.g., changing a rating of 3 to 2), although we observed a few instances of moves by as far as two categories (e.g., changing a 6 to a 4). In total, the five participants changed 30, 30, 58, 24, and 33 ratings, respectively.

Changes in ratings during final review.
We next assessed whether the inclusion of excerpts in the group discussion phase affected changes in the final review phase using a Mann-Whitney test. The test revealed the magnitude of change in the final review phase for the 70 excerpts discussed (
Analyzing types of disagreement
To explore Type A and Type B disagreement using statistical methods, we assumed that participants’ ratings at T3 reflect their best assessments and compared ratings in each phase to the mean rating at T3. Although this is an admittedly imperfect approach, computing deviations from the mean is normative when assessing unverifiable judgments, such as disparities in judicial sentencing or setting insurance premiums (Kahneman et al., 2021). Using this estimate, we quantified disagreement at each of the timepoints depicted in Figure 1 (T1–3) with RMSEs:
where
Presumably T1 ratings contain both Type A and Type B (resolved through subsequent discussions and/or reflection) disagreement. Therefore, assuming RMSE of ratings at T3 reflects Type A disagreement (which remains after the full AMP procedure), we inferred Type B disagreement by subtracting Type A from total disagreement at a given timepoint. Doing this at each timepoint (T1, T2, T3) offers insight into the amount of Type B disagreement across the procedure. We carried out Wilcoxon signed-rank tests, finding RMSEs of the 381 excerpts decreased significantly between T1 (RMSE = 0.84) and T2 (RMSE = 0.60),
Discussion
Analysis of the average ratings of each of the 381 pieces illustrates a small but statistically significant change in ratings between T1 (following initial review) and T3 (following final review). This presumably represents a shift closer to the true individual perspectives of participants. However, as the accuracy of unverifiable judgments is ultimately unknowable, and spurious variability contributes to noise (Andrade, 2013), the effect of AMP is best seen through changes in the variability of ratings for each piece.
We note a significant reduction in RMSE (from 0.84 to 0.58) and increased consistency (α = .82 to α = .90) between T1 to T3. The reduction of Type B disagreement (disagreement resolved by correcting errors and considering other perspectives) is one of the strengths of AMP, since in addition to improving piece-wise averages, it helps clarify individual perspectives (Type A disagreement). Consequently, we will now turn our discussion to the specific ways in which this approach to decision hygiene improved participants’ ratings.
Observations on weekly review: Outcomes of group discussions
In reviewing the qualitative comments and reflecting upon deliberations during the group discussion phase, we note four distinct ways in which participants reduced Type B disagreement between T1 and T2. First, they corrected errors (e.g., P1 changed their rating of one excerpt from 4 to 2 having realized they had misread a clef). Second, they had new personal insights (e.g., P5 changed their rating of the excerpt from Bach’s Prelude in F sharp minor from 2 to 3 following discussion). Third, they gained insight from the ratings and explanatory notes of other participants (e.g., P3 changed their rating of the excerpt from Guillet’s Fantasy no. 15 from 4 to 3, writing “several people noted how it sounds a bit more minor than major, despite having a strong tonal center. I came to agree with that”). Fourth, they updated their personal rubrics (e.g., P4 changed their rating of the excerpt from Kapustin’s Prelude in F minor from 1 to 2, saying “[This was] one of the first ones I [rated] so it doesn’t fit my rubric [anymore. I’m] changing [my rating] to a 2 [since] it has a raised sixth, but no raised seventh”).
Many Type B disagreements reflect non-trivial differences. For example, P1 changed their rating of the excerpt from Rachmaninoff’s Prelude in B major from 5 to 6 “after considering a comment about minor chords fitting more easily into major [keys] than major chords fitting into minor [keys].” P3 changed their rating of the excerpt from Kalkbrenner’s Prelude in D-flat major from 5 to 6, noting “valid points about conventional harmonic movement and cadences—something that, according to my own rubric, makes this piece fit into a higher major third [of ratings].” Some Type B disagreement reflects participants learning from one another in the weekly reviews. P5 changed their rating of the excerpt from Guillet’s Fantasy no. 16 from 6 to 4 after being
convinced in our discussions that these modal chorales should be analyzed more in context of the individual lines than in timing of harmonies. Based on that understanding, [it] seems far less major than initially, but much more confusing and ambiguous.
As some degree of consensus may have arisen from discussion (as with the Guillet excerpt above), it is important to consider the role of groupthink. To that end, we aimed to mitigate its harmful effects by (1) asking participants to provide initial ratings prior to any discussion, and (2) making any changes to ratings done after the group discussion and in private. We took these steps based on findings from Kahneman et al. (1998), demonstrating that gathering independent ratings from jury members and aggregating them before group deliberation affords the beneficial aspects of considering different perspectives while mitigating the problems of groupthink. In addition to those standard approaches, we opened each weekly review meeting by reminding participants that the goal of the discussions was not to find agreement, but rather to consider all perspectives. In addition to those standard steps, we included a third step for reducing groupthink unique to our application of decision hygiene—after the weekly discussion phase, participants completed a final review phase independently, with the goal of ensuring their ratings aligned with their own (personalized) rubric. Although this is not part of standard decision hygiene approaches, we believe it is an invaluable part of AMP, which applies scientifically-oriented principles of decision hygiene to artistic contexts such as music analysis.
Observations on the final review phase: Internal consistency between rubrics and ratings
To promote internal consistency, we asked participants to complete a final review to align their rubrics and ratings as much as possible. This made use of participants’ greater clarity (e.g., less noisy perspectives) about their respective classifications at T2 following weekly discussions and corrections of self-identified errors, resulting in changes to 175 ratings in total. We observed an equivalent amount of change in ratings during the final review regardless of whether excerpts had been discussed previously. Therefore, we doubt that doubling the length of weekly review (which would have been needed to cover all excerpts) would have meaningfully improved our outcomes. We note this here both to address any theoretical concerns over the number of excerpts discussed, as well as offer practical insight for colleagues interested in optimally applying AMP to other topics.
We designed the final review to achieve two goals: (a) allow participants to review previous responses along with a clarified understanding of their personal rubrics, and (b) offer a chance for ranked comparisons between adjacent categories, which are known to be more accurate than absolute judgments. Based upon our understanding of decision hygiene, we suspect this leads to better alignment between rubrics and ratings. Although it is not possible to assess this quantitatively, the qualitative data afford useful insights.
The comments and changes made during the weekly meetings in the final review phase suggest that this phase helped participants align their ratings with their fully considered perspectives. For example, all five participants initially gave a rating of 1 to the excerpt from Chopin’s Prelude in D minor, which consists only of the tonic triad D F A. However, P4 later changed their rating to a 3 after reflecting upon the importance of cadences and leading tones in their finalized rubric. P3 changed ratings of all excerpts they believed were “modal polyphonic works” to a rating of 4 to be consistent with their rubric. P4’s rubric stressed the importance of modulating to the relative major for minor-mode pieces, and therefore gave Kalkbrenner’s Prelude in F sharp minor, op. 20, a rating of 7, despite the other participants rating it 3 or 4.
It is interesting to note that changes discussed in the previous paragraph moved participants’ ratings for those three excerpts farther from the group mean, thereby increasing their RMSEs. This illustrates one of the idiosyncrasies of applying decision hygiene to music analysis. In some cases, greater clarity of individual perspectives, through development and use of personal rubrics, can actually increase Type A disagreement. This serves as a useful reminder that although statistical approaches such as the use of RMSE are undoubtedly useful, they tell only part of the story.
Extrapolating from controlled studies to real-world settings
To what extent do our findings regarding Type B disagreement here apply to analysis of music under more typical conditions? When considering this important issue, we note that similar questions arose when a landmark study of 208 federal judges revealed stark disparities between their independent assessments of 16 theoretical cases (Clancy et al., 1981). In contemplating how such findings translate to real-world situations, Kahneman et al. (2021) posit that
these studies, which involve tightly controlled experiments, almost certainly understate the magnitude of noise in the real world of criminal justice. Real-life judges are exposed to far more information than what the study participants received in the carefully specified vignettes of these experiments. Some of this additional information is relevant, of course, but there is also ample evidence that irrelevant information, in the form of small and seemingly random factors, can produce major differences in outcomes. (p. 16)
We find this context useful when contemplating the implications of our study. Here, each participant rated excerpts of solo piano music when studying scores and recordings presented sequentially in relatively quick succession. In contrast to normative approaches to music analysis, this represents near-optimal conditions for minimizing Type B disagreement. Therefore, the controlled conditions here almost certainly understate its probable magnitude in real-world conditions. In other words, the challenges associated with judgment leading to Type B disagreement in this study are likely present in analyses undertaken by individual scholars. However, they are difficult to detect in those analyses, like noise in other unverifiable judgments.
Outcomes of this procedure: Data, rubrics, vignettes, and changes in perspective
This proof of concept of AMP yields several useful quantitative and qualitative outcomes. For example, the quantitative data (specifically the mean final ratings of all 381 pieces) offer a useful complement to quantifications of relative mode generated using the
Beyond the relative mode data, this AMP proof-of-concept project yielded complementary qualitative data. The first are the five rubrics crafted by experienced analysts differentiating seven levels of relative mode in the same corpus of 381 excerpts. The rubrics provide insight into the five sets of ratings based upon them. In future studies, this could lead to novel ways of exploring how different perspectives on music analysis play out in a particular corpus (e.g., asking five groups of raters to evaluate sets of pieces using the five rubrics resulting from this study).
In addition to the rubrics, the exemplars offer useful anchor points for degrees of majorness/minorness, and the vignettes describing them clarify how specific features affected participants’ ratings. This is particularly noteworthy, as algorithmic predictions of relative mode (e.g. MIRtoolbox’s
Limitations and future directions
We note two limitations regarding the resultant dataset. First, although representing mode on a single spectrum from major to minor captures some of its complexity (Persichetti, 1961), our desire to include composers such as Debussy, who often eschewed tonal conventions, meant including some pieces that could not be easily accommodated. We originally considered using different rating scales for major, minor, and ambiguous-mode excerpts but eventually decided on a single rating scale with the option of indicating atonality for several reasons including time limitations. However, as participants flagged only around 3% of excerpts as atonal (see Supplemental Appendix F), the 4 rating appears mostly to have been assigned to excerpts that were both major and minor.
A second limitation of the dataset concerns our stimuli. Although sets of preludes form a useful corpus, they were all composed for keyboard and thus fail to represent some aspects of composition such as texture in works for instrumental or vocal ensemble or orchestra. Also, most pieces had their keys named explicitly in their titles. Although we removed this identifying information from the scores presented to participants, it is possible that these works may not reflect the composers’ use of mode in other contexts. Therefore, other corpora could lead to somewhat different rubrics.
As this article focuses on evaluating the AMP procedure instead of testing hypotheses about the resultant mode ratings, we separately note two limitations of the procedure. First, AMP requires a significant amount of preparation and time for execution, as well as resources for acquiring materials and compensating participants. Second, it requires a dedicated group of analysts committed to regular meetings. Had any participants ceased their involvement mid-way through the project, much of its value would have been lost. To that end, scholars interested solely in average ratings might consider using only the first phase of AMP, which would capture many of the benefits of collective analysis (albeit lacking the nuanced information on individual perspectives). We recommend colleagues interested in AMP should reflect upon these trade-offs and decide upon the best use of their time and resources.
Conclusion
In this article we describe an approach to collective music analysis, adapted from best practices for noise reduction in domains involving unverifiable judgments, such as judicial sentencing. Rather than adopting them literally, we took care to adapt the general principles of a scientifically grounded procedure to a music analysis context. We see this approach as essential for domains where both noise reduction and the preservation of individual perspectives (Type A disagreement) are valued. We believe that AMP effectively balances these two considerations, and in this proof of concept led not only to a rich dataset, but also rubrics and vignettes aligned with those data, thereby making it suitable for both quantitative and qualitative analysis.
In conclusion, we note that disputes among theorists (e.g., the often-heard remark of “I don’t hear it that way”) are generally presumed to reflect disagreements in perspective (what we call Type A disagreement), rather than errors or oversights (Type B disagreement). Although there are precedents for taking a qualitative approach to exploring the existence of disagreement (e.g., Bergé et al., 2009; Forte, 1965), to the best of our knowledge this is the first study of music analysis quantitatively exploring these types of disagreement. Distinguishing between Type A and Type B disagreement is difficult, if not impossible, without meaningful collaboration. Therefore, our study lends credence to the view that collaborative efforts can offer new insights through “projects that may exceed the capacities of a single individual” (Society for Music Theory [SMT], 2018), and hope others will find this project and detailed information in the appendices and supplementary materials useful in their own work.
Supplemental Material
sj-docx-1-msx-10.1177_10298649251385727 – Supplemental material for Analysis from multiple perspectives (AMP): Applying decision hygiene to analysis of musical structure
Supplemental material, sj-docx-1-msx-10.1177_10298649251385727 for Analysis from multiple perspectives (AMP): Applying decision hygiene to analysis of musical structure by Max Delle Grazie, Cameron J Anderson, Jonathan De Souza and Michael Schutz in Musicae Scientiae
Supplemental Material
sj-docx-2-msx-10.1177_10298649251385727 – Supplemental material for Analysis from multiple perspectives (AMP): Applying decision hygiene to analysis of musical structure
Supplemental material, sj-docx-2-msx-10.1177_10298649251385727 for Analysis from multiple perspectives (AMP): Applying decision hygiene to analysis of musical structure by Max Delle Grazie, Cameron J Anderson, Jonathan De Souza and Michael Schutz in Musicae Scientiae
Footnotes
Acknowledgements
The authors would like to thank Jordan McClean and Jamie Ling for their contributions to score and stimuli preparation, and Konrad Sweirczek for helpful insight on disagreement in music analysis. This study was inspired to a great extent by
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was conducted in the Don Wright Faculty of Music through a Visiting Research Chair award from Western University in Fall of 2022 in conjunction with support from a Social Sciences and Humanities Research Council of Canada (SSHRC) Insight Grant to MS. CJA is supported in part by funding from the Social Sciences and Humanities Research Council.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
