Sage Journals: Discover world-class research

Abstract

A persistent concern in music performance assessments is the quality of ratings assigned by judges. Differences in rater judgment are most concerning in cases when student performances are scored by different raters (i.e., disconnected rating designs)—bringing into question the comparability of scores between raters and students. We used data from a formal solo music performance assessment to demonstrate and explore the impact of different data collection designs and a statistical adjustment procedure on the estimates and rank-ordering of student performances. Our results indicated notable discrepancies in conclusions about individual student performances between conditions where all raters scored all students, designs with no common performances between raters, designs that included overlapping performances between raters, and the results from a post hoc adjustment procedure for disconnected designs. We discuss the implications of our results for the design and interpretation of music performance assessments.

Keywords

music performance assessment rater effects rating design many-facet Rasch model group anchoring

Get full access to this article

View all access options for this article.

References

Abeles

H. F.

Hoffer

C. R.

Klottman

R. H.

(1994). Foundations of music education (2nd ed.). Schirmer Books.

Allsup

R. E.

(2016). Remixing the classroom: Toward an open philosophy of music education. Indiana University Press.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Andrich

D. A.

(1982). An index of person separation in latent trait theory, the traditional KR.20 indices and the Guttman scale response pattern. Education Research and Perspectives, 9, 95–104.

Andrich

D. A.

(2018). Rasch rating-scale model. In van der Linden

W. J.

(Ed.), Handbook of item response theory (Vol. 1, pp. 31–50). CRC Press.

Angoff

W. H.

(1971). Scales, norms, and equivalent scores. Educational Testing Service.

Austin

J. R.

(1988). The effect of music contest format on self-concept, motivation, achievement, and attitude of elementary band students. Journal of Research in Music Education, 36, 95–107. https://doi.org/10.2307/3345243

Azizan

N. H.

Mahmud

Rambli

(2020). Rasch rating scale item estimates using maximum likelihood approach: Effects of sample size on the accuracy and bias of the estimates. International Journal of Advanced Science and Technology, 29(4), 2526–2531.

Banister

(1992). Attitudes of high school band directors toward the value of marching band and concert band contests and selected aspects of the overall band program. Missouri Journal of Research in Music Education, 29, 49–57.

10.

Bond

T. G.

Yan

Heene

(2020). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Routledge.

11.

Bonk

W. J.

Ockey

G. J.

(2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89–110. https://doi.org/10.1191/0265532203lt245oa

12.

Boyle

J. D.

(1992). Program evaluation for secondary school music programs. NASSAP Bulletin, 76, 63–68. https://doi.org/10.1177/019263659207654413

13.

Boyle

J. D.

Radocy

R. E.

(1987). Measurement and evaluation of musical experiences. Schirmer Books.

14.

Burnsed

Hinkle

King

(1985). Performance evaluation reliability at selected concert festivals. Journal of Band Research, 21, 22–29.

15.

Colwell

(1970). The evaluation of music teaching and learning. Englewood Cliffs, NJ: Prentice-Hall.

16.

Crocker

Algina

(1986). Introduction to classical and modern test theory. Holt, Rinehart and Winston.

17.

Crochet

L. S.

(2006). Repertoire selection practices of band directors as a function of teaching experience, training, instructional level, and degree of success [Doctoral dissertation, University of Miami, Coral Gables, FL].

18.

Engelhard

Jr. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge.

19.

Engelhard

Jr. Wang

(2020). Rasch models for solving measurement problems (Vol. 187). Sage.

20.

Engelhard

Jr. Wind

S. A.

(2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Taylor & Francis.

21.

Florida Bandmasters Association. (2017). Florida Bandmasters Association adjudication manual. https://fba.flmusiced.org/media/1655/adjudication-manual-2017.pdf

22.

Franklin

J. O.

(1979). Attitudes of school administrators, band directors, and band students towards selected activities of the public school band program [Unpublished doctoral dissertation]. Northwestern State University of Louisiana.

23.

Giombini

(2025). Music criticism reconsidered: Bias, expertise, and the language of sound. Philosophies, 10(1), Article 18. https://doi.org/10.3390/philosophies10010018

24.

Haladyna

T. M.

Downing

S. M.

(2004). Construct-irrelevant variance in high-stakes testing. Educational Measurement: Issues and Practice, 23(1), 17–27. https://doi.org/10.1111/j.1745-3992.2004.tb00149.x

25.

Hash

P. M.

McCart

(2024). Solo music performance assessment criteria: A systematic review of the literature. Contributions to Music Education, 49, 67–90. https://doi.org/10.3389/fpsyg.2024.1467434

26.

Henry

M. L.

(2015). The musical experiences, career aspirations, and attitudes towards the music education profession of all-state musicians. Journal of Music Teacher Education, 24(2), 40–53.

27.

Howard

K. K.

(1994). A survey of Iowa high school band students’ self-perceptions and attitudes toward types of music contests [Unpublished doctoral dissertation]. University of Iowa.

28.

Howard

R. L.

(2002). Repertoire selection practices and the development of a core repertoire for the middle school concert band [Unpublished doctoral dissertation]. University of Florida.

29.

Hurst

C. W.

(1994). A nationwide investigation of high school band directors’ reasons for participating in music competitions [Unpublished doctoral dissertation]. The University of North Texas.

30.

Johnson

D. W.

Johnson

R. T.

(2002). Meaningful assessment: A manageable and cooperative process. Allyn & Bacon.

31.

Kegelaers

Bakker

F. C.

Oudejans

R. R. D.

Kouwenhoven

(2023). “Don’t forget Shakespeare”: A qualitative pilot study into the performance evaluations of orchestra audition panelists. Psychology of Music, 51(1), 347–354. https://doi.org/10.1177/03057356221087443

32.

Kirchhoff

(1988). The school and college band: Wind band pedagogy in the United States. In Gates

J. T.

(Ed.), Music education in the United States: Contemporary issues (pp. 259–276). The University of Alabama Press.

33.

Kolen

M. J.

Brennan

R. L.

(2014). Test equating, scaling, and linking: Methods and practices (3rd ed.). Springer.

34.

Linacre

J. M.

(1989). Many-facet Rasch measurement. MESA Press.

35.

Linacre

J. M.

(1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7(4), 328. https://www.rasch.org/rmt/rmt74m.htm

36.

Linacre

J. M.

(2004). Test validity and Rasch measurement. Rasch Measurement Transactions, 18(1), 970–971.

37.

Linacre

J. M.

(2020). A user’s guide to FACETS: Rasch-model computer programs (Version 3.83.4) [Computer software]. winsteps.com. http://www.winsteps.com/manuals.htm

38.

Messick

(1989). Validity. In Linn

R. L.

(Ed.), Educational measurement (3rd ed., pp. 13–103). Macmillan.

39.

Messick

(1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741

40.

Moura

Dias

Verissomo

Oliveria-Silva

Serra

(2024). Solo music performance assessment criteria: A systematic review. Frontiers in Psychology, 15, Article 1467434. https://doi.org/10.3389/fpsyg.2024.1467434

41.

Nakatsuhara

Inoue

Berry

Galaczi

(2016). Exploring performance across two delivery modes for the same L2 speaking test: Face-to-face and video-conferencing delivery: A preliminary comparison of test-taker and examiner behaviour (IELTS Partnership Research Papers 1). IELTS Partners: British Council, Cambridge English Language Assessment and IDP: IELTS Australia. https://www.ielts.org/teaching-and-research/research-reports/ielts-partnership-research-paper-1

42.

Payne

Ward

(2020). Admission and assessment of music degree candidates. Journal of Music Teacher Education, 29(2), 10–21. https://doi.org/10.1177/1057083719878388

43.

Robitzsch

Kiefer

(2020). TAM: Test analysis modules (Version 3.5-19) [Computer software]. https://CRAN.R-project.org/package=TAM

44.

Salas

Tannenbaum

S. I.

Kraiger

Smith-Jentsch

K. A.

(2012). The science of training and development in organizations: What matters in practice. Psychological Science in the Public Interest, 12(2), 74–101. https://doi.org/10.1177/1529100612436661

45.

Sick

(2013). Judging plans and disjoint subsets. Shiken Research Bulletin, 17(1), 27–32.

46.

Sivill

J. R.

(2004). Students’ and directors’ perceptions of high school band competitions [Unpublished doctoral dissertation]. Bowling Green State University

47.

Smith

R. M.

(2004). Fit analysis in latent trait models. In Smith

E. V.

Smith

R. M.

(Eds.), Introduction to Rasch measurement (pp. 73–92). JAM Press.

48.

Sweeney

C. R.

(1998). A description of student and band director attitudes toward concert band competition [Doctoral dissertation]. University of Miami.

49.

Texas Music Educators Association. (2025). Texas All-State. from https://www.tmea.org/all-state/

50.

Virginia Music Educators Association. (2013). Virginia Music Educators Association adjudication guidelines. https://www.vboda.org/documents/General/VBODAAdjudicatorManualDRAFT.pdf

51.

Wang

Engelhard

(2019). Conceptualizing rater judgments and rating processes for rater-mediated assessments. Journal of Educational Measurement, 56(3), 582–609. https://doi.org/10.1111/jedm.12226

52.

Ward

Payne

(2018). A survey of admission standards and procedures of higher education in institutions in the United States in comparison to national core music standards. In Brophy

T. S.

Fautley

(Eds.), Context matters: Selected papers from the 6th International Symposium on Assessment in Music Education (pp. 409–425). GIA.

53.

Wesolowski

B. C.

(2017). Exploring rater cognition: A typology of raters in the context of music performance assessment. Psychology of Music, 45(3), 375–399.

54.

Wesolowski

B. C.

(2019a). Item response theory in music testing. In Brophy

(Ed.), The Oxford handbook of assessment policy and practice in music education (pp. 479–503). New York: Oxford University Press.

55.

Wesolowski

B. C.

(2019b). Predicting operational rater-type classifications using Rasch measurement theory and Random Forests: A music performance assessment perspective. Journal of Educational Measurement, 56(3), 610–625.

56.

Wesolowski

B. C.

Amend

Barnstead

Edwards

Everhart

Goins

Grogan

Herceg

Jenkins

Johns

McCarver

Schaps

Sorrell

Williams

(2017). The development of a secondary-level solo wind instrument performance rubric using the multifaceted Rasch partial credit measurement model. Journal of Research in Music Education, 65(1), 95–119.

57.

Wesolowski

Wind

S. A.

Engelhard

Jr. (2015). Rater fairness in music performance assessment: Evaluating model-data fit and differential rater functioning. Musicae Scientiae, 19(2), 147–170. https://doi.org/10.1177/1029864915589014

58.

Wesolowski

B. C.

Wind

S. A.

Engelhard

Jr. (2016a). Examining rater precision in music performance assessment: An analysis of rating scale structure using the multifaceted Rasch partial credit model. Music Perception, 33(5), 662–678.

59.

Wesolowski

B. C.

Wind

S. A.

Engelhard

Jr. (2016b). Rater analyses in music performance assessment: Application of the many facet Rasch model. In Brophy

(Ed.), Connecting practice, measurement, and evaluation: The Fifth International Symposium on assessment in music education (pp. 335–356). Chicago, IL: GIA Publications.

60.

Wesolowski

B. C.

Wind

S. A.

(2019). Pedagogical considerations for examining rater variability in rater-mediated assessments: A three-model framework. Journal of Educational Measurement, 56(3), 521–546. https://doi.org/10.1111/jedm.12224

61.

Wind

S. A.

(2019). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159–171. https://doi.org/10.1177/0146621618789391

62.

Wind

S. A.

Wesolowski

B. C.

(2018). Evaluating differential rater accuracy over time in solo music performance assessment. Bulletin of the Council of Research in Music Education, 215, 33–55. https://doi.org/10.5406/bulcouresmusedu.215.0033

63.

Wind

S. A.

Engelhard

Wesolowski

(2016). Exploring the effects of rater linking designs and rater fit on achievement estimates within the context of music performance assessments. Educational Assessment, 21(4), 278–299. https://doi.org/10.1080/10627197.2016.1236676

64.

Wind

S. A.

Jones

(2019). The effects of incomplete rating designs in combination with rater effects. Journal of Educational Measurement, 56(1), 76–100. https://doi.org/10.1111/jedm.12201

65.

Wind

S. A.

Stager

(2019). The impacts of characteristics of disconnected subsets on group anchoring in incomplete rater-mediated assessment networks. Psychological Test and Assessment Modeling, 61(1), 13–36.

66.

Wolfe

E. W.

(2013). A bootstrap approach to evaluating person and item fit to the Rasch model. Journal of Applied Measurement, 14(1), 1–9.

67.

Adams

R. J.

(2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339–355.

68.

Zhang

Elder

(2014). Investigating native and non-native English-speaking teacher raters’ judgements of oral proficiency in the College English Test-Spoken English Test (CET-SET). Assessment in Education: Principles, Policy & Practice, 21(3), 306–325. https://doi.org/10.1080/0969594X.2013.845547

69.

Zumbo

B. D.

(2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223–233. https://doi.org/10.1080/15434300701375832

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.52 MB

Rater Connectedness Affects Student Achievement Estimates and Ordered Rankings in Formal Music Performance Assessments

Abstract

Keywords

Get full access to this article

References

Supplementary Material