Sage Journals: Discover world-class research

Abstract

Music analysis is a complex and subjective task requiring a considerable degree of judgment on questions often lacking verifiable answers. In many cases, this subjectivity leads to seemingly intractable disagreements. Although disagreements can offer useful insight, whether they represent genuine differences in perspective is not always clear. To contribute to this challenging aspect of music analysis, here we introduce a procedure inspired by recommendations for improving decision making in other domains lacking verifiable answers, such as judicial sentencing. Our approach involves a 3-phase procedure, combining independent analyses with information sharing and re-evaluation among five graduate-level music analysts. We show that this procedure reduces self-identified errors/oversights in music analysis while preserving meaningful differences in perspective. As a proof of concept, we apply this procedure to 381 excerpts from 16 historic sets of preludes to assess relative mode, a complex musical property alluded to in previous scholarship but never formally explored in theoretical applications. This procedure (yielding complementary qualitative and quantitative data) demonstrates how a new, group-based approach to music analysis can offer insights unavailable from more traditional single-scholar approaches.

Keywords

music theory music analysis mode decision making noise reduction harmony

Those with a watch know the time. Those with two are never sure.

Segal’s law

Introduction

Music is a complex art form, involving interactions between rhythm, melody, harmony, texture, timbre, and dynamics. Although analyses of basic musical building blocks (e.g., isolated notes and chords) have objective answers (i.e., are easily verifiable), the sophisticated interplay of these elements and interpretive flexibility mean that many analyses lack verifiable answers, even when addressing fundamental concepts such as chord function:

The first chord of Wagner’s Tristan Prelude—F B D♯ G♯—is notoriously resistant to analysis, or at least seemingly impervious to consensus among analysts . . . More than a hundred years of debate have done little to diminish its capacity to fascinate, and to vex, music theorists. (Martin, 2008, p. 6)

Although extreme, this is not an isolated example. Prominent works by J.S. Bach (Larson, 1997), Schoenberg (1983), and Beethoven (Byros, 2012) have elicited heated debates over the form and function of compositional elements. Traditionally, such arguments are treated as differences of opinion, which are not uncommon in pursuits lacking verifiable answers. However, research into the process of judgment itself suggests this might not tell the full story.

In their book Noise: A Flaw in Human Judgment, Kahneman et al. (2021) review a growing body of research on decision making—across a wide range of domains—as well as ideas for its improvement. They argue that all judgments contain noise, defined as “undesirable variability in judgments of the same problem” (p. 36). Crucially, this noise arises regardless of whether questions have verifiable answers (i.e., truth). In other words, noise is always present, although it is difficult to detect in unverifiable judgments. In some contexts, the costs of failing to recognize noise are profound.

The importance of reducing noise in judgment

Noise is particularly problematic in high-stakes judgments such as judicial sentencing. For example, disparities between judges can be “so pronounced that a defendant sentenced to three years by one judge would have been sentenced to twenty years had he been assigned to another” (C. S. Yang, 2014, p. 1). Such observations have long been of concern to judges themselves (Frankel, 1973), sparking significant interest in noise reduction within the criminal justice system (Clancy et al., 1981). Judicial sentencing is one of the prime examples referenced by Kahneman et al. (2021) in their book-length review of noise in judgment across many domains (a full summary of which is beyond the scope of this article). Consequently, we focus here only upon the details most pertinent to music analysis, which we believe are of great relevance given that “wherever there is judgment there is noise, and more of it than you think” (Kahneman et al., 2021, p. 12).

It is crucial to acknowledge one important distinction between judgments in music analysis and those in fields more commonly discussed in noise-reduction research. Variability in sentences issued by judges (e.g., lenient vs harsh) is always undesirable, as punishment should reflect the severity of the crime rather than the severity of the judge. However, variability between music analysts is not, as it is a natural consequence of differences in perspective. Yet not all variability is desirable in music analysis. For example, the musical properties of a piece should change neither as a function of the time of day of evaluation, nor the order of evaluation, nor the paper upon which it is printed (all factors shown to erroneously affect judgments in previous research). Therefore, in the context of music analysis, only undesirable variability meets the traditional definition of noise.

Kahneman et al. (2021) extensively detail different types of noise and the sophisticated statistical approaches that are useful for detecting them. In the interest of space, here we simply differentiate variability that is undesirable (errors/oversights) from variability that is not. We will refer to undesirable variability in judgments of music as Type B disagreement, referring to disagreement resolved after considering other perspectives and correcting self-identified errors. We contrast this with Type A disagreement, defined simply as disagreement remaining after resolving Type B disagreement.¹ Although the implications of this framework will be discussed later, we introduce the terms themselves here as they play an important role in guiding our approach to applying principles of decision hygiene (efforts to improve the process of judging) to music analysis.

Reducing noise in judgment: Approaches to decision hygiene

The average of many evaluations is typically more accurate than any individual evaluation (Surowiecki, 2005), a phenomenon long referred to as the wisdom of crowds. For example, the Human Diagnosis Project (a cross-sectional study) found that aggregating across multiple independent diagnoses greatly improves accuracy (Barnett et al., 2019). Consequently, second opinions are common in high-consequence areas such as medicine where, for example, they have been shown to reduce the need for indeterminate thyroid cancer operations by 25% (Davidov et al., 2010). These lessons are not confined to medicine, as an empirical study of 30 comparisons across many fields found that aggregating multiple individual forecasts reduces error by 12.5% (Armstrong, 2001).

The wisdom-of-crowds concept relies upon independent assessments, reducing the risk of groupthink—suppression of dissent and consideration of alternative views—which can be a major problem in decision making (Janis, 1972). For example, an esteemed professor’s analysis of a piece may encourage a group of graduate students to seek confirmatory evidence, causing the group to overlook alternative approaches and reach artificial consensus. However, full independence of evaluators removes the opportunity to recognize overlooked information and/or gain new insight from other evaluators. This presents a paradox: group discussions can be both helpful and harmful. For this reason, harnessing the benefits of collective decision making requires careful consideration of how information is shared.

The Delphi technique (Dalkey, 1969) is a widely used procedure balancing the costs and benefits of information sharing. In this approach, participants submit independent estimates to a moderator who collates the responses and shares them anonymously. Each participant reviews and discusses this information before (optionally) re-visiting their estimates. The anonymity of responses and re-evaluations helps to mitigate the harmful effects of groupthink (Kahneman et al., 1998). Although we base key aspects of our approach on the Delphi method, we recognize two challenges associated with applying it directly, as decision hygiene in music analysis presents a special case of noise reduction.

The first challenge of using a traditional Delphi approach to music analysis is that simply averaging across ratings assumes all variability is noise (i.e., undesirable). Although true in the case of judicial sentencing, this is not the case for music analysis, where decision hygiene should remove Type B disagreement while preserving Type A disagreement. The second challenge is that simply averaging independent ratings offers no insight into the different perspectives of raters. In principle, deviations in ratings based upon a shared and clearly articulated rubric could be considered noise. However, any process of decision hygiene presuming a single perspective would make assumptions not shared by music theorists. Therefore, in addition to removing Type B disagreement, we felt it important for our approach to clarify perspectives. This goal requires more than simply averaging independent ratings. Keeping these two points in mind, we developed a variant on the traditional Delphi technique, which we call analysis from multiple perspectives (AMP).

Previous interest in multiple perspectives in music analysis

In the domain of music theory, interest in understanding different perspectives is certainly not unprecedented. For example, the Journal of Music Theory (JMT) published several analysis symposia in which three scholars analyzed the same piece independently, using whatever methods they thought appropriate. The pieces included Mozart’s Minuet, K355 (Boatwright & Oster, 1966), the second movement of Beethoven’s Piano Sonata, op. 53 (Beach et al., 1969), and Debussy’s Pour les Sixtes (Benjamin, 1978). The editorial of the Winter 1965 issue describes the initiative as follows: “Three distinguished musicians have been invited to contribute analyses of the same composition, so that readers will have an opportunity to observe technique and give thoughtful consideration to the insights provided by different, possibly divergent approaches” (Forte, 1965, para. 2). In another example, 12 theorists took different analytical perspectives on Beethoven’s Piano Sonata, op. 31 no. 2 (“The Tempest”), producing an 11-chapter book (Bergé et al., 2009).

Although the analysis symposia and book by Bergé et al. (2009) illustrate an interest in the role of multiple perspectives in music analysis, these particular approaches have not been widely embraced. According to the tables of contents in JMT, the symposia took place from 1966 to 1974 yet have now “all but ceased” (Goldenberg, 2006, p. 50). One reason for this lack of interest can be found in the music theorist David Damschroder’s observation that although Bergé et al.’s collection offers some insight into multiple perspectives, “their authors proceed seemingly unaware of the book’s remaining contents . . . Several of the essays offer critiques of analyses already in print, but little attempt was made to draw connections among those included in the collected work” (Damschroder, 2010, p. 1).

Clarifying our goals

Here we explore the application of techniques for noise reduction profitably employed in fields other than music analysis, which balance independent assessments with collaborative discussions (in contrast with the siloed approach used in the JMT analysis symposia). To the best of our knowledge, this is the first such application of decision hygiene to music analysis, complementing recent applications to music on issues ranging from performance evaluation (Passarotto et al., 2023), to the development of tools for assessing musicians’ health literacy (Baadjou et al., 2019) and fatigue management (McCrary & Altenmüller, 2020).

We see our approach to collective music analysis as paradoxically both a natural outgrowth of, and an abrupt departure from, traditional analytical approaches. On the one hand, it builds on, and addresses some of the shortcomings of, previous efforts such as the JMT analysis symposia and the essays published by Bergé et al. (2009). On the other hand, it also represents a departure from normative approaches in a field prioritizing single-author analyses, often of single pieces. As such, it is important to acknowledge previous critiques of such attempts.

Some have previously argued that using scientific approaches to human questions risks “ . . . vaporizing [participants’] . . . sense of their personhood by treating them as instances of an impersonal rule” (Klausner, 1970, p. 101). This long-standing concern should not be taken lightly. We do not think analyses would be improved if theorists always agreed (i.e., achieving a Type A disagreement of 0). In fact, reducing variability by simply ignoring the perspectives of different analysts would be antithetical to the entire purpose of music analysis, which is to gain a deeper understanding of musical structure—often requiring perspectives that diverge from traditional thinking.

Differentiating between desirable variability and undesirable variability is crucial, as noise can profoundly affect the outcomes of a study. For example, the addition of noisy data to an analysis can affect conclusions about statistical significance, even if the average effect changes only minimally (Andrade, 2013). To overcome this challenge, we developed AMP with the goal of retaining variability arising from true differences in perspective, while minimizing variability arising from noise (i.e., analytical oversights). We believe approaches to decision hygiene such as the Delphi method can reduce Type B disagreement, which ultimately enhances the clarity of individual perspectives (i.e., Type A disagreement). To the best of our knowledge, the distinction between Type A and Type B disagreement has been discussed neither in the context of music analysis, nor in the broader noise reduction literature.² However, we believe it is crucial for applying decision-hygiene research to contexts where understanding different perspectives can be as important as the conclusions following from them.

Applying decision hygiene to music analysis

As assessing unverifiable judgments is challenging, their improvement requires focusing not on the outcome of a judgment, but rather on the process by which it is made (Kahneman et al., 2019). Common recommendations include: (1) breaking complex decisions into smaller parts and avoiding discussing or finalizing them until all the parts have been considered; (2) managing the flow of information such that each evaluator makes independent judgments before taking part in group discussion (minimizing groupthink); (3) aggregating across multiple perspectives; and (4) basing all evaluations on the same, objectively defined scale or rubric, or using forced ranking to ensure all participants use the same criteria.

Although the first two recommendations above can be applied directly, the latter two are more challenging for any endeavor involving the assessment of artistic works. Recommendation (3) can be resolved by simply retaining individual ratings (in conjunction with aggregation), although (4) is problematic as different analysts might reasonably focus on different aspects of the same passage, making it undesirable and/or counterproductive to require a common scale. Although we believe mitigating Type B disagreement is always beneficial, preserving disagreement related to the use of different approaches (Type A disagreement) is crucial in the context of music analysis.

Analysis from multiple perspectives (AMP): Proof of concept

Inspired by research demonstrating how noise reduction techniques improve judgments of unverifiable properties, we developed AMP in the hope that it could be applied to many topics in music analysis (e.g., locating modulations, analyzing phrase structure, and/or identifying thematic material). Therefore, we see this article as a proof-of-concept regarding AMP, rather than a study of a specific musical property per se. Nonetheless, as we wanted our topic to be of broad relevance, we chose to focus on relative mode (i.e., major/minor mode on a relative scale), going beyond the operationalization of mode as a binary construct. We believe this topic is both of general utility, as well as a useful test case for the AMP procedure. Among other merits, relative mode (1) is intuitive, yet sufficiently complex to engender different points of view, (2) grows naturally out of issues relevant to music scholarship, yet (3) has not previously been subjected to formal theoretical treatment.

Music theorists previously observed that we hear the diatonic modes “as alterations of these more familiar [major and minor] scales” (Clendinning & Marvin, 2016), given that the Lydian and Mixolydian modes are “major-like,” with raised fourth and lowered seventh scale degrees, respectively, and that Dorian and Phrygian modes are “minor-like” (Schoenberg & Stein, 1969). Furthermore, participants evaluating melodies in an experiment report, unsurprisingly, that those in Ionian (major) scales are perceived as happier than those in Aeolian (minor), yet those in Phrygian sound “even sadder” than those in Aeolian (Temperley & Tan, 2012). This work illustrates that treating modality on a continuum has some precedence in the field of music theory and music cognition.

In addition to theoretical interest from music analysts, data on relative mode have practical implications for other fields. For example, music information retrieval software such as MIRtoolbox offers an estimation through mirmode by computing continuous value from −1 to +1. However, as recent assessments suggest significant shortcomings (Swierczek & Schutz, in press; Zhou et al., 2023), high-quality information on relative mode is crucial for improving MIR software. Additionally, an understanding of mode moving beyond the conventional binary of major/minor would be valuable to music psychologists, given that mode is widely recognized as the most significant contributor to perceived valence (i.e., positive/negative affect) in Western classical music (Eerola et al., 2013) in general, and some of the sets of preludes analyzed in the present project in particular (Anderson & Schutz, 2022; Battcock & Schutz, 2019; Delle Grazie et al., 2025).

When making judgments of relative mode, it is important to acknowledge that although classifying isolated scales or chords as major or minor is straightforward, identifying mode in complete passages is more challenging, and requires a degree of judgment. Differences of opinion on the importance of leading tones, cadences, melodies, and harmonies inevitably lead to divergent perspectives. Nonetheless, evaluation of relative mode is not merely a matter of taste, as some judgments are simply incorrect (e.g., identifying a passage with only major chords as strongly minor). As such, judgments of relative mode reflect the concept of bounded disagreement, a range of acceptable values for unverifiable judgments (Kahneman et al., 2021).

Although in principle a common rubric can reduce disagreement, here participants crafted their own rubrics throughout the procedure, documenting the musical events and/or devices informing their mode ratings. We asked participants to craft rubrics such that they sorted excerpts into three categories of major and minor ratings (six categories in total), with roughly equal numbers in each category. We also asked them to identify one category of ambiguous excerpts (described in the Method section).

We chose to use personal rubrics as they (1) bolster consistency in individual participants’ ratings, (2) allow participants to refine their perspectives individually (mitigating risks of groupthink) over the course of the project, and (3) offer useful information on how/why participants might systematically disagree, providing unique insights articulated clearly for future reference. We acknowledge that this breaks markedly from traditional noise-reduction approaches, where common rubrics are recommended. However, in those situations individual perspectives are generally undesirable (e.g., disparities in criminal sentencing), whereas in music analysis, individual differences in perspective are not problematic, but in many cases desirable.

Method

Participants

We recruited five (two male, three female) graduate music students from Western University as participants, each of whom received an honorarium of $800. All five had advanced musical training, holding degrees in subjects ranging from music theory to music performance and musicology. Two of the five participants had advanced training in piano, with one completing a Doctor of Musical Arts in piano performance during the study (see Supplemental Appendix E for participants’ biographies). Consistent with past practice in the Don Wright Faculty of Music, we viewed these raters as research assistants rather than participants, as they were recruited through a job posting advertised to graduate music students already employed in the Faculty and compensated accordingly. However, as they participated in the project as raters, we will henceforth refer to them as participants.

Materials

We gathered scores and recordings of 16 sets of pieces (henceforth preludes although they included some pieces labeled études or fantasies) consisting of sets by Charles-Valentin Alkan (1847), J.S. Bach (two sets; 1722/1999a; 1742/1999b), Frédéric Chopin (1839/2007), Claude Debussy, (1910, 1913); Johann Fischer (1702/2012), Charles Guillet (1610/2012), Johann Hummel (two sets; 1826/1891; 1833/2020), Friedrich Kalkbrenner (two sets; 1816/1825; 1827), Nikolai Kapustin (1988/2017), Sergei Rachmaninoff (1892, 1903, 1910/1911), Alexander Scriabin (1895/1897), and Dmitri Shostakovich (two sets; 1933/1945; 1951/1973). The titles and number of excerpts in each trial set are listed in Table 1. In cases of sets published in multiple editions, such as those by Chopin, we either selected the Urtext or tried to obtain the highest quality scores from reputable publishers.

Table 1.

Score and recording details.

Composer	Set title	Pieces	Score publisher	Performer
C.V. Alkan	Op. 31, 25 Preludes (1847)	25	Schlesinger (1847)	Laurent Martin (Alkan, 2016)
J.S. Bach	WTC Book I: Preludes (1722)	24	Hans Bischoff (1999)	Pietro De Maria (Bach, 2015a)
J.S. Bach	WTC Book II: Preludes (1742)	24	Hans Bischoff (1999)	Pietro De Maria (Bach, 2015b)
F. Chopin	Op. 28, Preludes (1839)	24	G. Henle Verlag (2007)	Vladimir Ashkenazy (Chopin, 1993)
C. Debussy	Preludes, Book I and Book II (1910; 1913)	24	Durand & Co (1910; 1913)	Claudio Arrau (Debussy 1991a, 1991b)
J.C. Fischer	Ariadne Musica Organaedum (Preludes) (1702)	20	Gayk Aboyan (2012)	MIDI Encoding (Fischer, 2022)
C. Guillet	24 Fantasies (1610)	24	M.M. Gavioli (2012)	MIDI Encoding (Guillet, 2022)
J.N. Hummel	Op. 67, 24 Preludes (1826)	24	Universal Edition(1891)	MIDI Encoding from YouTube (Hummel, 2012)
J.N. Hummel	Op. 125, 24 Etudes (1833)	24	Crocker Music (2020)	Danielle Laval (Hummel, 1992)
F. Kalkbrenner	Op. 20, 24 Grand Etudes (1816)	24	N. Simrock (1825)	John Khouri (Kalkbrenner, 2014b)
F. Kalkbrenner	Op. 88, 24 Preludes (1827)	24	Pleyel & Co (1827)	John Khouri (Kalkbrenner, 2014a)
N. Kapustin	24 Jazz Preludes (1988)	24	Edition Schott (2017)	Catherine Gordeladze (Kapustin, 2011)
S. Rachmaninoff	Op. 3, Prelude in C-sharp minor. Op. 23, 10 Preludes. Op. 32, 13 Preludes (1892; 1903; 1910)	24	Breitkopf & Härtel (1911)	Vladimir Ashkenazy (Rachmaninoff, 1995)
A. Scriabin	Op. 11, 24 Preludes (1895)	24	M.P. Belaieff (1897)	Daniel Levy (Scriabin, 2015)
D. Shostakovich	Op. 34, 24 Preludes (1933)	24	MCA Music (1945)	Konstantin Scherbakov (Shostakovich, 2003)
D. Shostakovich	Op. 87, 24 Preludes & Fugues (Preludes) (1951)	24	MCA Music (1973)	Vladimir Ashkenazy (Shostakovich, 1999)

Note. Metadata pertaining to the sets analyzed, including information about the number of pieces included, the publisher of the score we used when preparing incipits, and the performer of the corresponding audio excerpts.

Recordings

As most sets are widely performed and recorded, we followed a previously developed procedure for selecting a prominent performer. This involved identifying the five performances of each complete set that appear most often in the Naxos Music Library (NAXOS; Heymann, 2019) and Classical Music Archives (CMA; Schwob, 2019), then selecting the highest-ranking performer appearing in both lists (for further details see Kelly et al., 2021). This identified the album of choice for 13 of 16 sets, but not for those by Hummel, Guillet, or Fischer, as these three had no identifiable commercial piano recordings.

The set by Hummel (Op. 67) can be found on YouTube (Tao, 2012), which we captured using Audacity® (The Audacity Team, 2007) at a 44.1 kHz sample rate. According to the video creator, the performance, which features dynamics and tempo changes, was sequenced in SONAR4 using the EastWest Boesendorfer 290 Virtual Studio Technology (VST) plug-in. To the best of our knowledge there is no recording available of the set by Guillet. We therefore encoded each of the preludes as MIDI files, using the Addictive Keys Studio Grand VST piano and exporting them as .wav files sampled at 44.1 kHz, matching their tempi and dynamics to a MIDI encoding realized by recorder (Aeolian Consort, 2019). Rather than using a commercial recording of the set by Fischer (composed for organ, not piano), we encoded each of the preludes as we had done with those of Guillet, using a piano timbre and aligning the average root mean squared energy and tempo to that of the organist’s recording of the set. We undertook these steps for the sets by Guillet and Fischer to avoid presenting participants with sets performed with timbres clearly different from the piano used in 14 of the 16 sets.

To avoid an abrupt cutoff after the 8-measure excerpts, and following Battcock and Schutz (2019), we faded out the recording of each prelude after eight measures. The fadeout began at the onset of the ninth measure and lasted 2 seconds. We assigned each audio file a 3-digit ID number corresponding to the respective notated excerpt (as discussed below).

Scores

We produced 9-measure excerpts of the scores of each prelude by using digital scans of the score and Microsoft PowerPoint to fade out the ninth measure so that it was invisible from the 10th measure onward. We removed any information from the score excerpt that could identify the composer. Fifteen of Hummel’s Op. 67 preludes consist of fewer than nine measures, so we included these preludes in full. One of Kalkbrenner’s Op. 20 preludes opens with a long cadenza-like improvisatory section, so we produced an excerpt comprising the next nine measures of the prelude. We assigned each score excerpt an ID number corresponding to its audio file.

Prior to the analysis, participants analyzed a trial set using materials prepared in the manner described above. This practice set consisted of 24 preludes: 10 by Busoni (1881/1927) (five major and five minor), four by Clementi (1811/1896) (two major and two minor), and 10 by Kabalevsky (1934/1947) (five major and five minor).

Procedure

For each participant we prepared a package consisting of (1) the 381 score excerpts (order randomized uniquely for each participant); (2) instructions (Supplemental Appendix A); (3) links to a shared Google Drive folder with audio files; and (4) links to individual rating sheets on Google Drive. In the instructions we asked participants to rate the mode of the score excerpts (major/minor) in the order presented, referring to the corresponding audio file as necessary. We explicitly requested participants rate the relative mode of the music, rather than how the performance sounds.

For each excerpt, we asked participants to indicate (1) mode on a 7-point Likert-type scale from 1 (entirely minor) to 7 (entirely major); (2) relative confidence in their rating on a 3-point Likert-type scale from 1 (lowest 3^rd of confidence) to 3 (highest 3^rd of confidence); and (3) brief notes for subsequent discussion (see Supplemental Appendix A). Participants recorded their responses on separate spreadsheets.

Before distributing materials, we held a group meeting in which we described the goals of the study, provided a timeline, discussed compensation, and explained that the focus was on internal consistency rather than group agreement. We asked participants to aim to distribute their ratings evenly between the three categories of minor (most, middle, and least corresponding to 1, 2, and 3 on the Likert-type rating scale) and three categories of major (least, middle, and most corresponding to 5, 6, and 7). During the meeting we also pointed out that the category of 4 need not necessarily be used as much as the ratings of 1–3 (minor) and 5–7 (major). We included a column on participants’ response spreadsheets to flag whether an excerpt was atonal (yes/no), differentiating whether ratings reflected either (1) an equal division of major/minor or (2) atonality. We also explained that the ninth measure of excerpts (where applicable) should only be considered when it clarified the preceding measures (e.g., a modulation clarifying a previously ambiguous passage). Beyond following these general guidelines, we asked them to use their own musical intuition and discretion, telling them that they should feel free to modify previous ratings as the procedure unfolded and their understanding increased. We also asked them to generate a personal rubric to guide and explain their ratings (see Table 2 for an example and Supplemental Appendix E), which should contain information sufficient for future analysts to reconstruct their thinking and end up with similar ratings should they strictly follow their rubric. Participants updated their rubrics throughout the course of the AMP procedure.

Table 2.

Sample rating rubric.

Rating	Criterion
1	Characteristics of the most minor third of excerpts
2	. . .
3	. . .
4	Characteristics representative of excerpts that are either atonal/ambiguous or equally/neither major (n)or minor
5	. . .
6	. . .
7	Characteristics representative of the most major third of excerpts

Note. Sample given to participants (in initial group meeting taking place prior to the beginning of group discussions) on how to structure their rubrics.

In the independent categorization phase, participants completed their ratings independently over two consecutive weeks prior to the start of the academic semester in Fall 2022; to maximize independence, we asked them not to discuss ratings during this initial stage. We asked them to rate the excerpts in the order presented, which had been randomized uniquely for each participant. However, we also told them that they should feel free to modify their previous ratings as they encountered more examples and updated their rubrics to achieve the most equal use of least/middle/most major (or minor) examples.

In the weekly review phase, participants attended 12 weekly meetings to discuss the group’s ratings of 12–31 excerpts. Five days prior to each meeting, we distributed a document containing each score excerpt, along with ratings and anonymized comments from all five participants. We randomized the order of excerpts in this document, although all participants received the same document. In each of the first few meetings the group discussed 12 excerpts, starting with those where there was least agreement. In the first two meetings, these included excerpts for which a range of major/minor ratings had been made, to prompt discussion over degrees of major/minor (i.e., what might make a minor piece a 2 vs a 3). As the group became familiar with this approach, we added more excerpts each week.

At the beginning of each meeting, we reminded participants that the goal was not agreement, but to consider all perspectives. After each meeting we invited participants to update their ratings as necessary, encouraging them to add explanatory notes. Over the course of the 12 meetings, we reviewed 192 (50.4%) of the 381 excerpts with the most disagreement. Participants changed 251 ratings (13.2% of 5*381 = 1,905) during this phase (see Figure 1), reflecting new insights and/or a better understanding of 155 (40.7%) of the 381 excerpts.

Figure 1.

Stages of study.

In the final review (FR) phase, we distributed individualized packets to each participant based upon their ratings at T₂ to facilitate careful comparisons of adjacent categories. We organized excerpts in each of the seven categories in descending order of confidence (randomizing excerpts within each confidence level). Each participant reviewed their individualized packet—adjusting ratings if they desired—with the goal of encouraging consistency between their ratings and rubric after having completed the first two phases. This took place over 3 weeks. In FR1 each participant reviewed excerpts they had rated 3, 4, and 5; in FR2 excerpts they had rated major (i.e., 7, 6, and 5); and in FR3 excerpts they had rated minor (i.e., 1, 2, and 3). We applied this relative judgment-focused approach in keeping with recommended best practice on reducing noise through relative versus absolute judgments in domains such as job performance ratings (Kahneman et al., 2021).

After completing the final review phase, we asked participants to attend three further weekly meetings in which they wrote brief vignettes describing 12 excerpts and noting specific structural elements of each. We selected seven of these 12 as category exemplars (Supplemental Appendix B; Figure 2), defined as pieces with unanimous ratings (e.g., all participants assigned Shostakovich’s Prelude in A Major a rating of 5). We selected the other five excerpts to illustrate various degrees of disagreement (Supplemental Appendix C) instead of agreement. Participants discussed and wrote the first seven vignettes (exemplars) as a group but wrote the five disagreement vignettes independently.

Figure 2.

The incipit of each of the category exemplars shown in Supplemental Appendix B.

Score incipits of category exemplars

Before the first of the three vignette-writing meetings, we gave participants an example by author JDS (Supplemental Appendix D). After writing a practice vignette, they wrote vignettes individually for an excerpt unanimously rated 4 (Debussy’s L123 Prelude 4). In the second and third meetings they wrote vignettes for excerpts unanimously rated major (5–7) and minor (1–3), respectively. These form the basis of the summary descriptions of each exemplar (by JDS) in Supplemental Appendix B.

After the third meeting, we asked participants to write five additional vignettes for excerpts receiving a range of ratings (Supplemental Appendix C). These would provide insight into how their ratings had been influenced by different structural elements and thus why they had disagreed. We chose these excerpts by sampling from five levels of disagreement (i.e., the top 1.8%, 5.5%, 8.1%, 14.2%, and 25.5%) while aiming for diversity in composer, historical era, and the nominal mode. Taken together, the vignettes in Supplemental Appendices B and C, and the participants’ biographies in Supplemental Appendix E, suggest how their musical training may have influenced decision making. For example, one participant pursuing a PhD in music theory (P4), emphasized cadences and functional harmonic movement because of their interest in chord-function theory; another with a strong background in jazz (P5) gave more weight to the bass line and less to chromaticism.

Results

The quantitative data (i.e., mode ratings) offer two distinct types of information. The first comprises the means of the five participants’ ratings for all 381 pieces at T3, which can be considered a reference standard (i.e., information assumed to be a reliable best estimate) of relative mode. The second comprises the variability across the five participants’ ratings for all 381 pieces. We see the averages as useful outcomes of this proof-of-concept project, and the analysis of variability as helpful in understanding the effect of AMP itself. Consequently, we address each in turn before considering how review phases affected Type A and Type B disagreement.

Exploring mean ratings across different timepoints

To assess changes in ratings across timepoints, we conducted a series of Wilcoxon signed-rank tests, calculating test statistics and accompanying r effect sizes using the rstatix package (Kassambara, 2023) in R (R Core Team, 2019). We chose paired tests because T₃ ratings depended on those collected at T₁; we used non-parametric analyses to account for non-normality in the distribution of ratings at each timepoint.

Figure 3 depicts changes in the average ratings of the 381 excerpts between T₁ and T₃. Average mode ratings increased for 96 excerpts (25.20% of excerpts; T₁: M = 4.37, SD = 1.7; T₃: M = 4.68; SD = 1.72), decreased for 121 (31.76%; T₁: M = 3.59, SD = 1.67; T₃: M = 3.25, SD = 1.72), and remained unchanged for 164 excerpts (43.04%; M = 4.2, SD = 2.09).

Figure 3.

Changes in average ratings of excerpts between T₁ and T₃.

To evaluate how the AMP procedure affected the average ratings of the 381 excerpts between initial (M = 4.05, SD = 0.76) and final (M = 4.02, SD = 0.57) review, we performed a Wilcoxon signed-rank test assessing the null hypothesis that the distribution of differences in average ratings between T₁ and T₃ is symmetrical about 0. The test revealed that the AMP procedure led to a small yet significant change, V = 13823, p = .03, r = .10 (small effect). Although interesting to note, these mean values (and their changes) are less helpful in understanding the effects of AMP than exploring variability across raters.

The effect of AMP on rating variability at each timepoint

To gain insight into the degree of consensus among participants, we examined the consistency and variability of ratings across timepoints. To assess inter-rater reliability, we computed alpha coefficients for ratings of each excerpt at T₁, T₂, and T₃ using the krippendorffsalpha package in R (Hughes, 2022). Krippendorff suggests that a coefficient of .8 indicates a reasonable threshold of agreement (Krippendorff, 2004). For ordinal data, this is achieved by computing squared distances across the five participants’ ratings of a given excerpt. To account for problems associated with ordinality in the rating scale arising from atonal ratings, we excluded all ratings where participants used the atonal flag (affecting 88 of the 1905 ratings) and measured distances across the remaining ratings of affected excerpts. Overall, agreement increased between T₁ (α = .82, 95% jackknife CI = [.80, .85], MSE = 0.78) and T₂ (α = .90 [.89, .91], MSE = 0.43), but did not increase at T₃ (α = .90 [.89, .92], MSE = 0.43).

We measured the variability in ratings of each excerpt using standard deviations (SD). Variability decreased significantly from T₁ (M = 0.76) to T₃ (M = 0.57), V = 22051, p < .01, r = .46 (moderate effect), illustrated in Figure 4.

Figure 4.

Standard deviations of ratings by number of excerpts reviewed.

Analysis of final review stage

Figure 5 summarizes the effect of the final review phase, showing all 175 changes made by participants working to ensure consistency between their rubrics and evaluations. As shown in the summary panel (bottom right), most of these changes entailed adjustments to adjacent categories (e.g., changing a rating of 3 to 2), although we observed a few instances of moves by as far as two categories (e.g., changing a 6 to a 4). In total, the five participants changed 30, 30, 58, 24, and 33 ratings, respectively.

Figure 5.

Changes in ratings during final review.

We next assessed whether the inclusion of excerpts in the group discussion phase affected changes in the final review phase using a Mann-Whitney test. The test revealed the magnitude of change in the final review phase for the 70 excerpts discussed (M = 1.08; SD = 0.27) did not differ significantly from the 76 excerpts not discussed (M = 1.04; SD = 0.20, W = 3,675, p = .35) in the weekly review phase. To assess whether the absence of significance indicated equivalency, we performed a two one-sided procedure using the TOSTER package in R (Lakens et al., 2018). We determined the smallest effect size of interest by using the 90% confidence interval from the Mann-Whitney test to yield a significant equivalence test at α = .05. The analysis revealed an equivalent change in magnitude between the two groups, W = 296, p < .01 (r effect size = −.03 [−.18, .11]), suggesting additional discussion of excerpts would not lead to meaningful differences between outcomes in the final review.

Analyzing types of disagreement

To explore Type A and Type B disagreement using statistical methods, we assumed that participants’ ratings at T₃ reflect their best assessments and compared ratings in each phase to the mean rating at T₃. Although this is an admittedly imperfect approach, computing deviations from the mean is normative when assessing unverifiable judgments, such as disparities in judicial sentencing or setting insurance premiums (Kahneman et al., 2021). Using this estimate, we quantified disagreement at each of the timepoints depicted in Figure 1 (T_1–3) with RMSEs:

$R M S E_{T} = \sqrt{\frac{\sum_{e = 1}^{381} \sum_{p = 1}^{5} {(y_{e, p} - {\bar{y}}_{e, T = 3})}^{2}}{n}},$

where y is the rating of excerpt e by participant p at a given timepoint T_1–₃, ${\bar{y}}_{e, T = 3}$ is the mean rating of excerpt e at T₃, and n is the total number of ratings (5*381 = 1,905).

Presumably T₁ ratings contain both Type A and Type B (resolved through subsequent discussions and/or reflection) disagreement. Therefore, assuming RMSE of ratings at T₃ reflects Type A disagreement (which remains after the full AMP procedure), we inferred Type B disagreement by subtracting Type A from total disagreement at a given timepoint. Doing this at each timepoint (T₁, T₂, T₃) offers insight into the amount of Type B disagreement across the procedure. We carried out Wilcoxon signed-rank tests, finding RMSEs of the 381 excerpts decreased significantly between T₁ (RMSE = 0.84) and T₂ (RMSE = 0.60), V = 11,758, p < .01, r = .60 (large effect); between T₂ and T₃ (RMSE = 0.58); V = 6,031, p < .01, r = .13 (small effect);³ and between T₁ and T₃, V = 26,545, p < .01, r = .55 (large effect). This indicates that the AMP procedure significantly reduced Type B disagreement in this proof-of-concept project.

Discussion

Analysis of the average ratings of each of the 381 pieces illustrates a small but statistically significant change in ratings between T₁ (following initial review) and T₃ (following final review). This presumably represents a shift closer to the true individual perspectives of participants. However, as the accuracy of unverifiable judgments is ultimately unknowable, and spurious variability contributes to noise (Andrade, 2013), the effect of AMP is best seen through changes in the variability of ratings for each piece.

We note a significant reduction in RMSE (from 0.84 to 0.58) and increased consistency (α = .82 to α = .90) between T₁ to T₃. The reduction of Type B disagreement (disagreement resolved by correcting errors and considering other perspectives) is one of the strengths of AMP, since in addition to improving piece-wise averages, it helps clarify individual perspectives (Type A disagreement). Consequently, we will now turn our discussion to the specific ways in which this approach to decision hygiene improved participants’ ratings.

Observations on weekly review: Outcomes of group discussions

In reviewing the qualitative comments and reflecting upon deliberations during the group discussion phase, we note four distinct ways in which participants reduced Type B disagreement between T₁ and T₂. First, they corrected errors (e.g., P1 changed their rating of one excerpt from 4 to 2 having realized they had misread a clef). Second, they had new personal insights (e.g., P5 changed their rating of the excerpt from Bach’s Prelude in F sharp minor from 2 to 3 following discussion). Third, they gained insight from the ratings and explanatory notes of other participants (e.g., P3 changed their rating of the excerpt from Guillet’s Fantasy no. 15 from 4 to 3, writing “several people noted how it sounds a bit more minor than major, despite having a strong tonal center. I came to agree with that”). Fourth, they updated their personal rubrics (e.g., P4 changed their rating of the excerpt from Kapustin’s Prelude in F minor from 1 to 2, saying “[This was] one of the first ones I [rated] so it doesn’t fit my rubric [anymore. I’m] changing [my rating] to a 2 [since] it has a raised sixth, but no raised seventh”).

Many Type B disagreements reflect non-trivial differences. For example, P1 changed their rating of the excerpt from Rachmaninoff’s Prelude in B major from 5 to 6 “after considering a comment about minor chords fitting more easily into major [keys] than major chords fitting into minor [keys].” P3 changed their rating of the excerpt from Kalkbrenner’s Prelude in D-flat major from 5 to 6, noting “valid points about conventional harmonic movement and cadences—something that, according to my own rubric, makes this piece fit into a higher major third [of ratings].” Some Type B disagreement reflects participants learning from one another in the weekly reviews. P5 changed their rating of the excerpt from Guillet’s Fantasy no. 16 from 6 to 4 after being

convinced in our discussions that these modal chorales should be analyzed more in context of the individual lines than in timing of harmonies. Based on that understanding, [it] seems far less major than initially, but much more confusing and ambiguous.

As some degree of consensus may have arisen from discussion (as with the Guillet excerpt above), it is important to consider the role of groupthink. To that end, we aimed to mitigate its harmful effects by (1) asking participants to provide initial ratings prior to any discussion, and (2) making any changes to ratings done after the group discussion and in private. We took these steps based on findings from Kahneman et al. (1998), demonstrating that gathering independent ratings from jury members and aggregating them before group deliberation affords the beneficial aspects of considering different perspectives while mitigating the problems of groupthink. In addition to those standard approaches, we opened each weekly review meeting by reminding participants that the goal of the discussions was not to find agreement, but rather to consider all perspectives. In addition to those standard steps, we included a third step for reducing groupthink unique to our application of decision hygiene—after the weekly discussion phase, participants completed a final review phase independently, with the goal of ensuring their ratings aligned with their own (personalized) rubric. Although this is not part of standard decision hygiene approaches, we believe it is an invaluable part of AMP, which applies scientifically-oriented principles of decision hygiene to artistic contexts such as music analysis.

Observations on the final review phase: Internal consistency between rubrics and ratings

To promote internal consistency, we asked participants to complete a final review to align their rubrics and ratings as much as possible. This made use of participants’ greater clarity (e.g., less noisy perspectives) about their respective classifications at T₂ following weekly discussions and corrections of self-identified errors, resulting in changes to 175 ratings in total. We observed an equivalent amount of change in ratings during the final review regardless of whether excerpts had been discussed previously. Therefore, we doubt that doubling the length of weekly review (which would have been needed to cover all excerpts) would have meaningfully improved our outcomes. We note this here both to address any theoretical concerns over the number of excerpts discussed, as well as offer practical insight for colleagues interested in optimally applying AMP to other topics.

We designed the final review to achieve two goals: (a) allow participants to review previous responses along with a clarified understanding of their personal rubrics, and (b) offer a chance for ranked comparisons between adjacent categories, which are known to be more accurate than absolute judgments. Based upon our understanding of decision hygiene, we suspect this leads to better alignment between rubrics and ratings. Although it is not possible to assess this quantitatively, the qualitative data afford useful insights.

The comments and changes made during the weekly meetings in the final review phase suggest that this phase helped participants align their ratings with their fully considered perspectives. For example, all five participants initially gave a rating of 1 to the excerpt from Chopin’s Prelude in D minor, which consists only of the tonic triad D F A. However, P4 later changed their rating to a 3 after reflecting upon the importance of cadences and leading tones in their finalized rubric. P3 changed ratings of all excerpts they believed were “modal polyphonic works” to a rating of 4 to be consistent with their rubric. P4’s rubric stressed the importance of modulating to the relative major for minor-mode pieces, and therefore gave Kalkbrenner’s Prelude in F sharp minor, op. 20, a rating of 7, despite the other participants rating it 3 or 4.

It is interesting to note that changes discussed in the previous paragraph moved participants’ ratings for those three excerpts farther from the group mean, thereby increasing their RMSEs. This illustrates one of the idiosyncrasies of applying decision hygiene to music analysis. In some cases, greater clarity of individual perspectives, through development and use of personal rubrics, can actually increase Type A disagreement. This serves as a useful reminder that although statistical approaches such as the use of RMSE are undoubtedly useful, they tell only part of the story.

Extrapolating from controlled studies to real-world settings

To what extent do our findings regarding Type B disagreement here apply to analysis of music under more typical conditions? When considering this important issue, we note that similar questions arose when a landmark study of 208 federal judges revealed stark disparities between their independent assessments of 16 theoretical cases (Clancy et al., 1981). In contemplating how such findings translate to real-world situations, Kahneman et al. (2021) posit that

these studies, which involve tightly controlled experiments, almost certainly understate the magnitude of noise in the real world of criminal justice. Real-life judges are exposed to far more information than what the study participants received in the carefully specified vignettes of these experiments. Some of this additional information is relevant, of course, but there is also ample evidence that irrelevant information, in the form of small and seemingly random factors, can produce major differences in outcomes. (p. 16)

We find this context useful when contemplating the implications of our study. Here, each participant rated excerpts of solo piano music when studying scores and recordings presented sequentially in relatively quick succession. In contrast to normative approaches to music analysis, this represents near-optimal conditions for minimizing Type B disagreement. Therefore, the controlled conditions here almost certainly understate its probable magnitude in real-world conditions. In other words, the challenges associated with judgment leading to Type B disagreement in this study are likely present in analyses undertaken by individual scholars. However, they are difficult to detect in those analyses, like noise in other unverifiable judgments.

Outcomes of this procedure: Data, rubrics, vignettes, and changes in perspective

This proof of concept of AMP yields several useful quantitative and qualitative outcomes. For example, the quantitative data (specifically the mean final ratings of all 381 pieces) offer a useful complement to quantifications of relative mode generated using the mirmode algorithm in MIRtoolbox. That MATLAB-based program for automating analysis of musical features extracts relative mode as single value between +1 and −1. Given its broad use in MIR (Beveridge & Knox, 2009; Vasu & Choudhary, 2022; Y.-H. Yang & Chen, 2011), it appears to provide valuable approximations used in psychological experiments (e.g., Quinto & Thompson, 2013). However, recent research suggests significant noise that has previously gone unnoticed (Swierczek et al., press; Zhou et al., 2023). The novel relative-mode data collected here could serve as a valuable reference standard for assessing and improving these algorithms. This is currently being explored in an ongoing collaboration between researchers in the fields of music cognition and music information retrieval (Eerola & Schutz, 2025). Additionally, data on relative mode could serve as a more accurate cue for use in perceptual experiments exploring musical emotion, as mode, often treated as a binary, is widely recognized as the most significant contributor to perceived valence in Western classical music (Eerola et al., 2013).

Beyond the relative mode data, this AMP proof-of-concept project yielded complementary qualitative data. The first are the five rubrics crafted by experienced analysts differentiating seven levels of relative mode in the same corpus of 381 excerpts. The rubrics provide insight into the five sets of ratings based upon them. In future studies, this could lead to novel ways of exploring how different perspectives on music analysis play out in a particular corpus (e.g., asking five groups of raters to evaluate sets of pieces using the five rubrics resulting from this study).

In addition to the rubrics, the exemplars offer useful anchor points for degrees of majorness/minorness, and the vignettes describing them clarify how specific features affected participants’ ratings. This is particularly noteworthy, as algorithmic predictions of relative mode (e.g. MIRtoolbox’s mirmode) lack any explanation as to how features are identified/weighted. Therefore, the qualitative data provide novel insight into the rationale for various degrees of majorness/minorness (e.g., relative emphasis on cadences, melodies). In addition to the exemplar vignettes, the disagreement vignettes vividly articulate the heterogeneity of Type A disagreements, offering useful insight into divergent perspectives.

Limitations and future directions

We note two limitations regarding the resultant dataset. First, although representing mode on a single spectrum from major to minor captures some of its complexity (Persichetti, 1961), our desire to include composers such as Debussy, who often eschewed tonal conventions, meant including some pieces that could not be easily accommodated. We originally considered using different rating scales for major, minor, and ambiguous-mode excerpts but eventually decided on a single rating scale with the option of indicating atonality for several reasons including time limitations. However, as participants flagged only around 3% of excerpts as atonal (see Supplemental Appendix F), the 4 rating appears mostly to have been assigned to excerpts that were both major and minor.

A second limitation of the dataset concerns our stimuli. Although sets of preludes form a useful corpus, they were all composed for keyboard and thus fail to represent some aspects of composition such as texture in works for instrumental or vocal ensemble or orchestra. Also, most pieces had their keys named explicitly in their titles. Although we removed this identifying information from the scores presented to participants, it is possible that these works may not reflect the composers’ use of mode in other contexts. Therefore, other corpora could lead to somewhat different rubrics.

As this article focuses on evaluating the AMP procedure instead of testing hypotheses about the resultant mode ratings, we separately note two limitations of the procedure. First, AMP requires a significant amount of preparation and time for execution, as well as resources for acquiring materials and compensating participants. Second, it requires a dedicated group of analysts committed to regular meetings. Had any participants ceased their involvement mid-way through the project, much of its value would have been lost. To that end, scholars interested solely in average ratings might consider using only the first phase of AMP, which would capture many of the benefits of collective analysis (albeit lacking the nuanced information on individual perspectives). We recommend colleagues interested in AMP should reflect upon these trade-offs and decide upon the best use of their time and resources.

Conclusion

In this article we describe an approach to collective music analysis, adapted from best practices for noise reduction in domains involving unverifiable judgments, such as judicial sentencing. Rather than adopting them literally, we took care to adapt the general principles of a scientifically grounded procedure to a music analysis context. We see this approach as essential for domains where both noise reduction and the preservation of individual perspectives (Type A disagreement) are valued. We believe that AMP effectively balances these two considerations, and in this proof of concept led not only to a rich dataset, but also rubrics and vignettes aligned with those data, thereby making it suitable for both quantitative and qualitative analysis.

In conclusion, we note that disputes among theorists (e.g., the often-heard remark of “I don’t hear it that way”) are generally presumed to reflect disagreements in perspective (what we call Type A disagreement), rather than errors or oversights (Type B disagreement). Although there are precedents for taking a qualitative approach to exploring the existence of disagreement (e.g., Bergé et al., 2009; Forte, 1965), to the best of our knowledge this is the first study of music analysis quantitatively exploring these types of disagreement. Distinguishing between Type A and Type B disagreement is difficult, if not impossible, without meaningful collaboration. Therefore, our study lends credence to the view that collaborative efforts can offer new insights through “projects that may exceed the capacities of a single individual” (Society for Music Theory [SMT], 2018), and hope others will find this project and detailed information in the appendices and supplementary materials useful in their own work.

Supplemental Material

sj-docx-1-msx-10.1177_10298649251385727 – Supplemental material for Analysis from multiple perspectives (AMP): Applying decision hygiene to analysis of musical structure

Supplemental material, sj-docx-1-msx-10.1177_10298649251385727 for Analysis from multiple perspectives (AMP): Applying decision hygiene to analysis of musical structure by Max Delle Grazie, Cameron J Anderson, Jonathan De Souza and Michael Schutz in Musicae Scientiae

Supplemental Material

sj-docx-2-msx-10.1177_10298649251385727 – Supplemental material for Analysis from multiple perspectives (AMP): Applying decision hygiene to analysis of musical structure

Supplemental material, sj-docx-2-msx-10.1177_10298649251385727 for Analysis from multiple perspectives (AMP): Applying decision hygiene to analysis of musical structure by Max Delle Grazie, Cameron J Anderson, Jonathan De Souza and Michael Schutz in Musicae Scientiae

Footnotes

Acknowledgements

The authors would like to thank Jordan McClean and Jamie Ling for their contributions to score and stimuli preparation, and Konrad Sweirczek for helpful insight on disagreement in music analysis. This study was inspired to a great extent by Noise: A Flaw in Human Judgment (Kahneman et al., 2021), and we drew upon it in the framing of this work. Although we cite specific sections, we would also like to acknowledge the role of the book as a whole in shaping this narrative (as it would have been cumbersome for readers if we cited each particular idea throughout the text).

ORCID iDs

Max Delle Grazie

Cameron J Anderson

Jonathan De Souza

Michael Schutz

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was conducted in the Don Wright Faculty of Music through a Visiting Research Chair award from Western University in Fall of 2022 in conjunction with support from a Social Sciences and Humanities Research Council of Canada (SSHRC) Insight Grant to MS. CJA is supported in part by funding from the Social Sciences and Humanities Research Council.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

Notes

References

Aeolian Consort. (2019). Aeolian consort. http://aeolianconsort.chakin.com/Guillet.htm

Alkan

C. V.

(1847). 25 Preludes, op. 31 [Musical score]. Schlesinger.

Alkan

C. V.

(2016). 25 Preludes Op. 31 [Recorded by L. Martin] [CD]. Naxos. (Original work published 1847)

Anderson

C. J.

Schutz

(2022). Exploring historic changes in musical communication: Deconstructing emotional cues in preludes by Bach and Chopin. Psychology of Music, 50(5), 1424–1442. https://doi.org/10.1177/03057356211046375

Andrade

(2013). Signal-to-noise ratio, variability, and their relevance in clinical trials. The Journal of Clinical Psychiatry, 74(5), Article 19584.

Armstrong

J. S.

(2001). Combining forecasts. In Armstrong

J. S.

(Ed.), Principles of forecasting: A handbook for researchers and practitioners (pp. 417–439). Kluwer Academic.

The Audacity Team. (2007). Audacity [Computer software].

Baadjou

V. A.

Wijsman

S. I.

Ginsborg

Guptill

De Lisle

Rennie-Salonen

Visentin

Ackermann

B. J.

(2019). Health education literacy and accessibility for musicians: A global approach—Report from the Worldwide Universities Network Project. Medical Problems of Performing Artists, 34(2), 105–107.

Bach

J. S.

(1999a). The well-tempered Clavier, book I: Preludes [Musical score]. Hans Bischoff. (Original work published 1722)

10.

Bach

J. S.

(1999b). The well-tempered Clavier, book II: Preludes [Musical score]. Hans Bischoff. (Original work published 1742)

11.

Bach

J. S.

(2015a). The well-tempered Clavier book I [CD] (Recorded by P. De Maria). DECCA Music Group Limited. (Original work published 1722)

12.

Bach

J. S.

(2015b). The well-tempered Clavier book II [CD] (Recorded by P. De Maria). DECCA Music Group Limited. (Original work published 1742)

13.

Barnett

M. L.

Boddupalli

Nundy

Bates

D. W.

(2019). Comparative accuracy of diagnosis by collective intelligence of multiple physicians vs individual physicians. JAMA Network Open, 2(3), Article e190096.

14.

Battcock

Schutz

(2019). Acoustically expressing affect. Music Perception, 37(1), 66–91. https://doi.org/10.1525/MP.2019.37.1.66

15.

Beach

Mintz

Palmer

(1969). Analysis symposium: Beethoven: Sonata, Op. 53. Journal of Music Theory, 13(2), 186–217. https://doi.org/10.2307/842986

16.

Benjamin

W. E.

(1978). “Pour les Sixtes”: An analysis. Journal of Music Theory, 22(2), 253–290. https://doi.org/10.2307/843399

17.

Bergé

Caplin

W. E.

Hoe

(2009). Beethoven’s Tempest sonata: Perspectives of analysis and performance. Peeters. https://books.google.ca/books?id=ddtqQgAACAAJ

18.

Beveridge

Knox

(2009). An exploration of the effect of structural and acoustical features on perceived musical emotion. In Proceedings of audio mostly 4th conference on interaction with sound (p. 98). Glasgow Caledonian University. https://researchonline.gcu.ac.uk/en/publications/an-exploration-of-the-effect-of-structural-and-acoustical-feature/fingerprints/

19.

Boatwright

Oster

(1966). Analysis symposium: W. A. Mozart, Menuetto in D major for Piano (K. 355). Journal of Music Theory, 10(1), 18–52. https://doi.org/10.2307/843298

20.

Busoni

(1927). 24 Preludes, Op. 37, BV 181 (Gino Tagliapietra). G. Ricordi. (Original work published 1881)

21.

Byros

(2012). Meyer’s Anvil: Revisiting the schema concept. Music Analysis, 31(3), 273–346. https://doi.org/10.1111/j.1468-2249.2012.00344.x

22.

Chopin

(1993). Chopin: 24 Preludes, Op. 28 [CD] (Recorded by V. Ashkenazy). The Decca Record Company Limited. (Original work published 1839)

23.

Chopin

(2007). Preludes, op 28 [Musical score]. G. Henle Verlag. (Original work published 1839)

24.

Clancy

Bartolomeo

Richardson

Wellford

(1981). Sentence decisionmaking: The logic of sentence decisions and the extent and sources of sentence disparity. The Journal of Criminal Law & Criminology, 72(2), Article 524.

25.

Clementi

(1896). Preludes and exercises, Op. 43 (Ed. Vogorich). G. Schirmer. (Original work published 1811)

26.

Clendinning

J. P.

Marvin

E. W.

(2016). The musician’s guide to theory and analysis. W. W. Norton.

27.

Dalkey

(1969). An experimental study of group opinion: The Delphi method. Futures, 1(5), 408–426. https://doi.org/10.1016/S0016-3287(69)80025-X

28.

Damschroder

. (2010). Review of Pieter Bergè ed., Beethoven’s Tempest Sonata: Perspectives of Analysis and Performance (Peeters, 2009). Music Theory Online, 16(2).

29.

Davidov

Trooskin

S. Z.

Shanker

B.-A.

Yip

Eng

Crystal

Chernyavsky

V. S.

Deen

M. F.

May

. (2010). Routine second-opinion cytopathology review of thyroid fine needle aspiration biopsies reduces diagnostic thyroidectomy. Surgery, 148(6), 1294–1301.

30.

Debussy

(1910). Preludes, book 1 [Musical score]. Durand. (Original work published 1910)

31.

Debussy

(1913). Preludes, book 2 [Musical score]. Durand. (Original work published 1913)

32.

Debussy

(1991a). Preludes, book I [Recorded by C. Arrau]. Philips Classics Production. (Original work published 1910)

33.

Debussy

(1991b). Preludes, book II [Recorded by C. Arrau]. Philips Classics Production. (Original work published 1913)

34.

Delle Grazie

Anderson

C. J.

Schutz

. (2025). Breaking with common practice: Exploring modernist musical emotion. Psychology of Music. Advance online publication. https://doi.org/10.1177/03057356241296852

35.

Eerola

Friberg

Bresin

(2013). Emotional expression in music: Contribution, linearity, and additivity of primary musical cues. Frontiers in Psychology, 4, Article 487. https://doi.org/10.3389/fpsyg.2013.00487

36.

Eerola

Schutz

(2025). Major-minorness in tonal music: Evaluation of relative mode estimation using expert ratings and audio-based key-finding principles. Psychology of Music. Advance online publication. https://doi.org/10.1177/03057356251326065

37.

Fischer

J. C.

(2012). Ariadne Musica Organaedum (Preludes) [Musical score]. Gayk Aboyan. (Original work published 1702)

38.

Fischer

J. C.

(2022). Ariadne Musica Organaedum (Preludes) [MIDI encoding]. Gayk Aboyan. (Original work published 1702)

39.

Forte

. (1965). Editorial. Journal of Music Theory, 9(2), 338. http://www.jstor.org/stable/843174

40.

Frankel

M. E.

(1973). Criminal sentences: Law without order. Hill & Wang.

41.

Goldenberg

(2006). Journal of Music Theory over the years: Content analysis of the articles and related aspects. Journal of Music Theory, 50(1), 25–63.

42.

Guillet

(2012). 24 fantasies [Musical score]. M.M. Gavioli. (Original work published 1610)

43.

Guillet

(2022). 24 fantasies [MIDI encoding]. (Original work published 1610)

44.

Heymann

(2019). Naxos music library. Naxos Digital Services. https://mcmaster-naxosmusiclibrary-com.libaccess.lib.mcmaster.ca/homepage.asp

45.

Hughes

(2022). krippendorffsalpha: Measuring agreement using Krippendorff’s alpha coefficient. https://CRAN.R-project.org/package=krippendorffsalpha

46.

Hummel

J. N.

(1891). 24 preludes, op 67 [Musical score]. Universal Edition. (Original work published 1826)

47.

Hummel

J. N.

(1992). 24 Etudes, op. 125 [CD] [Performed by D. Laval]. Naive. (Original work published 1833)

48.

Hummel

J. N.

(2020). 24 Etudes, op. 125 [Musical score]. Crocker Music. (Original work published 1833)

49.

Hummel

J. N.

(2012). Hummel, Johann Nepomuk/24 Preludes for Piano, Op.67. YouTube. (Original work published 1610). https://www.youtube.com/watch?v=H96a3FQCKpw

50.

Janis

I. L.

(1972). Victims of groupthink: A psychological study of foreign-policy decisions and fiascoes. Houghton Mifflin.

51.

Kabalevsky

(1947). 24 Preludes, op. 38. International Music. (Original work published 1934)

52.

Kahneman

Lovallo

Sibony

(2019, March 4). A structured approach to strategic decisions. MIT Sloan Management Review. https://sloanreview.mit.edu/article/a-structured-approach-to-strategic-decisions/

53.

Kahneman

Schkade

Sunstein

(1998). Shared outrage and erratic awards: The psychology of punitive damages. Journal of Risk and Uncertainty, 16(1), 49–86. https://doi.org/10.1023/A:1007710408413

54.

Kahneman

Sibony

Sunstein

(2021). Noise: A flaw in human judgment. William Collins.

55.

Kalkbrenner

J. N.

(1825). 24 grand Etudes, op 20 [Musical score]. N. Simrock. (Original work published 1816)

56.

Kalkbrenner

J. N.

(1827). 24 Preludes, op. 88 [Musical score]. Pleyel.

57.

Kalkbrenner

J. N.

(2014a). 24 Preludes op. 88 [CD] [Performed by J. Khouri]. San Francisco Forte Piano. (Original work published 1827)

58.

Kalkbrenner

J. N.

(2014b). 24 Etudes op. 20 [CD] [Performed by J. Khouri]. San Francisco Forte Piano. (Original work published 1816)

59.

Kapustin

(2011). 24 Jazz Preludes [CD] (Performed by C. Gordeladze). Naxos. (Original work published 1988)

60.

Kapustin

(2017). 24 Jazz Preludes [Musical score]. Edition Schott. (Original work published 1988)

61.

Kassambara

(2023). rstatix: Pipe-friendly framework for basic statistical tests. https://CRAN.R-project.org/package=rstatix

62.

Kelly

B. O.

Anderson

C. J.

Schutz

(2021). Exploring changes in the emotional classification of music between eras. Auditory Perception & Cognition, 4(1–2), 121–131. https://doi.org/10.1080/25742442.2021.1988422

63.

Klausner

S. Z.

(1970). Scientific and humanistic study of religion: A comment on “Christianity and Symbolic Realism..” Journal for the Scientific Study of Religion, 9(2), 100–106.

64.

Krippendorff

(2004). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30(3), 411–433. https://doi.org/10.1111/j.1468-2958.2004.tb00738.x

65.

Lakens

Scheel

A. M.

Isager

P. M.

(2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963

66.

Larson

(1997). The problem of prolongation in “tonal” music: Terminology, perception, and expressive meaning. Journal of Music Theory, 41(1), 101–136.

67.

Martin

(2008). The Tristan chord resolved. Intersections, 28(2), 6–30.

68.

McCrary

J. M.

Altenmüller

(2020). Self-report fatigue management for instrumental musicians: A Delphi survey. Medical Problems of Performing Artists, 35(4), 208–213.

69.

Passarotto

Altenmüller

Müllensiefen

(2025). Music performance assessment: Noise in judgments and reliability of measurements. Psychology of Aesthetics, Creativity, and the Arts, 19(4), 822–835. https://doi.org/10.1037/aca0000574

70.

Persichetti

(1961). Twentieth century harmony. W. W. Norton.

71.

Quinto

Thompson

W. F.

(2013). Composers and performers have different capacities to manipulate arousal and valence. Psychomusicology: Music, Mind & Brain, 23(3), 137–137. https://doi.org/10.1037/a0034775

72.

Rachmaninoff

(1911). Prelude in C-sharp minor, op.3, 10 Preludes, op 23, 13 preludes, op. 32) [Musical score]. Breitkopf & Härtel. (Original works published 1892, 1903, 1910)

73.

Rachmaninoff

(1995). Prelude in C-sharp minor, op.3, 10 Preludes, op 23, 13 preludes, op. 32) [CD] [Performed by V. Askenazy]. DECCA Music Group. (Original works published 1892, 1903, 1910)

74.

R Core Team. (2019). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing.

75.

Schoenberg

(1983). Theory of harmony. Univ of California Press.

76.

Schoenberg

Stein

(1969). Structural functions of harmony (Issue 478). W. W. Norton.

77.

Schwob

(2019). Classical archives. Classical Archives. https://www.classicalarchives.com/

78.

Scriabin

(1897). 24 Preludes, op. 11 [Musical score]. M. P. Belaieff. (Original work published 1895)

79.

Scriabin

(2015). 24 Preludes, op. 11 [CD] (Performed by D. Levy). Edelweiss Emission. (Original work published 1895)

80.

Shostakovich

(1945). 24 Preludes, op 34 [Musical score]. MCA Music. (Original work published 1933)

81.

Shostakovich

(1973). 24 Preludes & Fugues, op. 87 [Musical score]. MCA Music. (Original work published 1951)

82.

Shostakovich

(1999). 24 Preludes & Fugues, op. 87 [CD] (Performed by V. Ashkenazy). DECCA Music Group. (Original work published 1951)

83.

Shostakovich

(2003). 24 Preludes, op. 34 [CD] [Performed by K. Scherbakov]. Naxos. (Original work published 1933)

84.

Society for Music Theory. (2018). SMT Statement on the Assessment of Collaborative Research and Publications. https://societymusictheory.org/sites/default/files/governance-documents/statement-on-collaboration.pdf

85.

Surowiecki

(2005). The wisdom of crowds. Anchor Books.

86.

Swierczek

Schutz

(in press). Evaluating musical predictions with multiple versions of a work. Music & Science.

87.

Tao

(2012). Johann Nepomuk Hummel 24 Preludes for Piano, Op.67. YouTube. https://www.youtube.com/watch?v=H96a3FQCKpw&ab_channel=HummelNote

88.

Temperley

Tan

(2012). Emotional connotations of diatonic modes. Music Perception: An Interdisciplinary Journal, 30(3), 237–257.

89.

Vasu

Choudhary

(2022). Music information retrieval using similarity based relevance ranking techniques. Scalable Computing: Practice and Experience, 23(3), 103–114.

90.

Yang

C. S.

(2014). Have interjudge sentencing disparities increased in an advisory guidelines regime? Evidence from Booker. New York University Law Review, 89(4), 1268–1342.

91.

Yang

Y.-H.

Chen

H. H.

(2011). Music emotion recognition (1st ed.). CRC Press.

92.

Zhou

Anderson

C. J.

Schutz

(2023). Accuracies in algorithmic predictors of musical emotion. Canadian Acoustics. https://awc.caa-aca.ca/index.php/AWC/AWC23/paper/view/1112/532

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.75 MB

0.00 MB

4.17 MB