Sage Journals: Discover world-class research

Abstract

Machine learning and artificial intelligence (AI) are increasingly used within organizational research and practice to generate scores representing constructs (e.g., social effectiveness) or behaviors/events (e.g., turnover probability). Ensuring the reliability of AI scores is critical in these contexts, and yet reliability estimates are reported in inconsistent ways, if at all. The current article critically examines reliability estimation for AI scores. We describe different uses of AI scores and how this informs the data and model needed for estimating reliability. Additionally, we distinguish between reliability and validity evidence within this context. We also highlight how the parallel test assumption is required when relying on correlations between AI scores and established measures as an index of reliability, and yet this assumption is frequently violated. We then provide methods that are appropriate for reliability estimation for AI scores that are sensitive to the generalizations one aims to make. In conclusion, we assert that AI reliability estimation is a challenging task that requires a thorough understanding of the issues presented, but a task that is essential to responsible AI work in organizational contexts.

Keywords

machine learning artificial intelligence natural language processing reliability psychometrics

Get full access to this article

View all access options for this article.

References

Affourtit

Allen

K. S.

Reddock

C. M.

Fursman

P. M.

(2022). Comparing empirically keyed and random forest scoring models in biodata assessments. Personnel & Assessment Decisions, 8, 62–68. https://doi.org/10.25035/pad.2022.01.007

Allen

D. G.

(2008). Retaining talent: A guide to analyzing and managing employee turnover. SHRM Foundation.

Breiman

(2001). Random forests. Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324

Brown

T. B.

Mann

Ryder

Subbiah

Kaplan

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

D. M.

Winter

, … Sutskever

(2020). Language models are few-shot learners. arXiv, 2005, 14165.

Campbell

D. T.

Fiske

D. W.

(1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105. https://doi.org/10.1037/h0046016

Campion

M. C.

Campion

M. A.

Campion

E. D.

Reider

M. H.

(2016). Initial investigation into computer scoring of candidate essays for personnel selection. Journal of Applied Psychology, 101, 958–975. https://doi.org/10.1037/apl0000108

Campion

E. D.

Campion

M. A.

Johnson

Carretta

T. R.

Romay

Dirr

Deregla

Mouton

. (2024). Using natural language processing to increase prediction and reduce subgroup differences in personnel selection decisions. Journal of Applied Psychology. Advance online publication. https://doi.org/10.1037/apl0001144

Chen

Zaharia

Zou

(2024). How is ChatGPT’s behavior changing over time? arXiv. https://arxiv.org/abs/2307.09009

Christiansen

N. D.

Robie

Burns

G. N.

Loy

R. W.

Speer

A. B.

Jacobs

(2021). Effects of applicant response distortion on the relationship between personality trait scores and cognitive ability. Personality and Individual Differences, 171. https://doi.org/10.1016/j.paid.2020.110542

10.

Christiansen

N. D.

Robie

Burns

G. N.

Speer

A. B.

(2017). Using item-level covariance to detect response distortion on personality measures. Human Performance, 30, 116–134. https://doi.org/10.1080/08959285.2017.1319366

11.

Cronbach

L. J.

Meehl

P. E.

(1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. https://doi.org/10.1037/h0040957

12.

Cucina

J. M.

Vasilopoulos

N. L.

Busciglio

H. H.

Cozma

DeCostanza

A. H.

Martin

N. R.

Shaw

M. N.

(2019). The effects of empirical keying of personality measures on faking and criterion-related validity. Journal of Business and Psychology, 34, 337–356. https://doi.org/10.1007/s10869-018-9544-y

13.

Demszky

Yang

Yeager

D. S.

Bryan

C. J.

Clapper

Chandhok

Eichstaedt

J. C.

Hecht

Jamieson

Johnson

Jones

Krettek-Cobb

Lai

Mitchel

N. J.

Ong

D. C.

Dweck

C. S.

Gross

J. J.

Pennebaker

J. W.

(2023). Using large language models in psychology. Nature Reviews Psychology, 1–14.

14.

Devlin

Chang

M. W.

Lee

Toutanova

(2018). BERT: Pre-training of deep bidirectional transformers for language understanding, arxiv:1810.04805, 1–13.

15.

Domingos

(2012). A few useful things to know about machine learning. Communications of the ACM, 55, 78–87. https://doi.org/10.1145/2347736.2347755

16.

Fan

Sun

Liu

Zhao

Zhang

Chen

Glorioso

Hack

(2023). How well can an AI chatbot infer personality? Examining psychometric properties of machine-inferred personality scores. Journal of Applied Psychology, 108(8), 1277–1299. https://doi.org/10.1037/apl0001082

17.

Furr

R. M.

(2018). Psychometrics: An introduction (3rd ed.). Sage.

18.

Ghasemi

Mousavi

A. H.

Ebrahimi

(2024). Comprehensive survey of reinforcement learning: From algorithms to practical challenges. arXiv. https://doi.org/10.48550/arXiv.2411.18892

19.

Goodfellow

Bengio

Courville

(2016). Deep learning. The MIT Press.

20.

Gulliksen

(1950). Theory of mental tests. John Wiley & Sons. https://doi.org/10.1037/13240-000

21.

Liu

Gao

Chen

(2006). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv preprint arXiv, 2006, 03654.

22.

Hickman

Bosch

Saef

Tay

Woo

S. E.

(2022a). Automated video interview personality assessments: Reliability, validity, and generalizability investigations. Journal of Applied Psychology, 107(8), 1323–1351. https://doi.org/10.1037/apl0000695

23.

Hickman

Huynh

Gass

Booth

Kuruzovich

Tay

(2024a). Whither bias goes, I will go: An integrative, systematic review of algorithmic bias mitigation. Journal of Applied Psychology. https://doi.org/10.1037/apl0001255

24.

Hickman

Liff

Rottman

Calderwood

(2024b). The effects of the training sample size, ground truth reliability, and NLP method on language-based automatic interview scores’ psychometric properties. Organizational Research Methods. https://doi.org/10.1177/10944281211061337

25.

Hickman

Tay

Woo

S. E.

(2024c). Are automated video interviews smart enough? Behavioral modes, reliability, validity, and bias of machine learning cognitive ability assessments. Journal of Applied Psychology. Advance online publication. https://doi.org/10.1037/apl0001236

26.

Hickman

Thapa

Tay

Cao

Srinivasan

(2022b). Text preprocessing for text mining in organizational research: Review and recommendations. Organizational Research Methods, 25(1), 114–146. https://doi.org/10.1177/1094428120971683

27.

Hofstee

W. K.

de Raad

Goldberg

L. R.

(1992). Integration of the big five and circumplex approaches to trait structure. Journal of Personality and Social Psychology, 63, 146–163. https://doi.org/10.1037/0022-3514.63.1.146

28.

Hussain

Binz

Mata

Wulff

D. U.

(2024). A tutorial on open-source large language models for behavioral science. Behavior Research Methods. https://doi.org/10.3758/s13428-024-02455-8

29.

James

Witten

Hastie

Tibshirani

(2017). An introduction to statistical learning: With applications in R (7th ed.). Springer.

30.

Kobayashi

V. B.

Mol

S. T.

Berkers

H. A.

Kismihók

Den Hartog

D. N.

(2018). Text mining in organizational research. Organizational Research Methods, 21, 733–765. https://doi.org/10.1177/1094428117722619

31.

Köchling

Wehner

M. C.

(2020). Discriminated by an algorithm: A systematic review of discrimination and fairness by algorithmic decision-making in the context of HR recruitment and HR development. Business Research, 13(3), 795–848. https://doi.org/10.1007/s40685-020-00134-w

32.

Koutsoumpis

Ghassemi

Oostrom

Holtrop

Van Breda

Zhang

de Vries

R. E.

(2024). Beyond traditional interviews: Psychometric analysis of asynchronous video interviews for personality and interview performance evaluation using machine learning. Computers in Human Behavior, 154, 108–128. https://doi.org/10.1016/j.chb.2023.108128

33.

Kuhn

Johnson

(2013). Applied predictive modeling. Springer-Verlag.

34.

Lance

C. E.

Butts

M. M.

Michels

L. C.

(2006). The sources of four commonly reported cutoff criteria: What did they really say? Organizational Research Methods, 9(2), 202–220. https://doi.org/10.1177/1094428105284919

35.

Landers

R. N.

Auer

E. M.

Dunk

Langer

Tran

K. N.

(2023). A simulation of the impacts of machine learning to combine psychometric employee selection system predictors on performance prediction, adverse impact, and number of dropped predictors. Personnel Psychology. https://doi.org/10.1111/peps.12587

36.

Schmidt

F. L.

Putka

D. J.

(2009). The multifaceted nature of measurement artifacts and its implications for estimating construct-level relationships. Organizational Research Methods, 12(1), 165–200. https://doi.org/10.1177/1094428107302900

37.

LeBreton

J. M.

Scherer

K. T.

James

L. R.

(2014). Corrections for criterion reliability in validity generalization: A false prophet in a land of suspended judgment. Industrial and Organizational Psychology: Perspectives on Science and Practice, 7(4), 478–500. https://doi.org/10.1017/S1754942600006775

38.

Liff

Mondragon

Gardner

Hartwell

C. J.

Bradshaw

. (2024). Psychometric properties of automated video interview competency assessments. Journal of Applied Psychology. Advance online publication. https://doi.org/10.1037/apl0001173

39.

Liu

McNeney

Capman

J. F.

Lowery

S. B.

Kitching

Nimbkar

Boyce

A. S.

(2023). Developing and validating automated scoring for an audio constructed response simulation. Personnel Psychology, 34–49. [Part of composite article that combined separate studies, with the composite article titled Improving measurement and prediction in personnel selection through the application of machine learning]. https://doi.org/10.1111/peps.12608

40.

McDonald

R. P.

(1999). Test theory: A unified treatment. Lawrence Erlbaum Associates Publishers.

41.

McGraw

K. O.

Wong

S. P.

(1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46. https://doi.org/10.1037/1082-989X.1.1.30

42.

Meincke

Mollick

E. R.

Terwiesch

(2024). Prompting diverse ideas: Increasing AI idea variance. arXiv. https://doi.org/10.48550/arXiv.2402.01727

43.

Nakaishi

Nishikawa

Hukushima

(2024). Critical Phase Transition in a Large Language Model. arXiv.

44.

AERA, APA , & NCME . (2014). Standards for educational and psychological testing. American Psychological Association.

45.

Nunnally

J. C.

(1978). Psychometric theory (2nd ed.). McGraw-Hill.

46.

Nunnally

J. C.

Bernstein

I. H.

(1994). Psychometric theory (3rd ed.). McGraw-Hill.

47.

Oswald

F. L.

Behrend

T. S.

Putka

D. J.

Sinar

(2020). Big data in industrial-organizational psychology and human resources management: Forward progress for organizational research and practice. Annual Review of Organizational Psychology and Organizational Behavior, 7, 505–533. https://doi.org/10.1146/annurev-orgpsych-032117-104553

48.

Oswald

F. L.

Putka

D. J.

(2016). Statistical methods for big data: A scenic tour. In Tonidandel

King

Cortina

(Eds.), Big data at work: The data science revolution and organizational psychology (pp. 43–63). Routledge.

49.

Podsakoff

P. M.

MacKenzie

S. B.

Lee

J.-Y.

Podsakoff

N. P.

(2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88(5), 879–903. https://doi.org/10.1037/0021-9010.88.5.879

50.

Putka

D. J.

Beatty

A. S.

Reeder

M. C.

(2018). Modern prediction methods: New perspectives on a common problem. Organizational Research Methods, 21(3), 689–732. https://doi.org/10.1177/1094428117697041

51.

Putka

D. J.

McCloy

R. A.

Diaz

(2008). Ill-structured measurement designs in organizational research: Implications for estimating interrater reliability. Journal of Applied Psychology, 93(5), 959–981. https://doi.org/10.1037/0021-9010.93.5.959

52.

Putka

D. J.

Oswald

F. L.

Landers

R. N.

Beatty

A. S.

McCloy

R. A.

M. C.

(2023). Evaluating a natural language processing approach to estimating KSA and interest job analysis ratings. Journal of Business and Psychology, 38, 385–410. https://doi.org/10.1007/s10869-022-09824-0

53.

Putka

D. J.

Sackett

P. R.

(2010). Reliability and validity. In Farr

J. L.

Tippins

N. T.

(Eds.), Handbook of employee selection (pp. 9–49). Routledge/Taylor & Francis Group.

54.

Rottman

Gardner

Liff

Mondragon

Zuloaga

(2023). New strategies for addressing the diversity–validity dilemma with big data. Journal of Applied Psychology, 108, 1425–1444. https://doi.org/10.1037/apl0001084

55.

Sackett

P. R.

Zhang

Berry

C. M.

Lievens

(2022). Revisiting meta-analytic estimates of validity in personnel selection: Addressing systematic overcorrection for restriction of range. Journal of Applied Psychology, 107(11), 2040–2068. https://doi.org/10.1037/apl0000994

56.

Sajjadiani

Sojourner

A. J.

Kammeyer-Mueller

J. D.

Mykerezi

(2019). Using machine learning to translate applicant work history into predictors of performance and turnover. Journal of Applied Psychology, 104, 1207–1225. https://doi.org/10.1037/apl0000405

57.

Sartori

Orrù

(2023). Language models and psychological sciences. Frontiers in Psychology, 14.

58.

Schmidt

F. L.

Ilies

(2003). Beyond alpha: An empirical examination of the effects of different sources of measurement error on reliability estimates for measures of individual-differences constructs. Psychological Methods, 8(2), 206–224. https://doi.org/10.1037/1082-989X.8.2.206

59.

Short

J. C.

McKenny

A. F.

Reid

S. W.

(2018). More than words? Computer-aided text analysis in organizational behavior and psychology research. Annual Review of Organizational Psychology and Organizational Behavior, 5, 415–435. https://doi.org/10.1146/annurev-orgpsych-032117-104622

60.

Simms

L. J.

(2008). Classical and modern methods of psychological scale construction. Social and Personality Psychology Compass, 2(1), 414–433. https://doi.org/10.1111/j.1751-9004.2007.00044.x

61.

SIOP. (2018). Principles for the validation and use of personnel selection procedures (5th ed. Industrial and Organizational Psychology, 11, 1–97. https://doi.org/10.1017/iop.2018.195

62.

Society for Industrial and Organizational Psychology (SIOP). (2023). Considerations and recommendations for the validation and use of AI-based assessments for employee selection. https://www.siop.org/Portals/84/SIOP%20Considerations%20and%20Recommendations%20for%20the%20Validation%20and%20Use%20of%20AI-Based%20Assessments%20for%20Employee%20Selection%20010323.pdf?ver=5w576kFXzxLZNDMoJqdIMw%3d%3d

63.

Speer

A. B.

(2018). Quantifying with words: An investigation of the validity of narrative-derived performance scores. Personnel Psychology, 71(3), 299–333.

64.

Speer

A. B.

(2021). Scoring dimension-level job performance from narrative comments: Validity and generalizability when using natural language processing. Organizational Research Methods, 24, 572–594. https://doi.org/10.1177/1094428120930815

65.

Speer

A. B.

Christiansen

N. D.

Robie

Jacobs

R. R.

(2022). Measurement specificity with modern methods: Using dimensions, facets, and items from personality assessments to predict performance. Journal of Applied Psychology, 107, 1428–1439. https://doi.org/10.1037/apl0000618

66.

Speer

A. B.

Delacruz

(2021). Introducing a supervised alternative to forced-choice personality scoring: A test of validity and resistance to faking. International Journal of Selection and Assessment, 29, 448–466. https://doi.org/10.1111/ijsa.12345

67.

Speer

A. B.

Delacruz

A. Y.

Chawota

T. C.

Wegmeyer

L. W.

Tenbrink

A. T.

Gibson

Frost

(2024a). A meta-analysis of the degree to which faking impacts the criterion-related validity of personality assessments. International Journal of Selection & Assessment, 1–23. https://doi.org/10.1111/ijsa.12518

68.

Speer

A. B.

Delacruz

A. Y.

Wegmeyer

L. J.

(2024b). Measuring work attitudes with less: Supervised construct scoring to shorten work attitude measures. Journal of Occupational and Organizational Psychology, 1–2. https://doi.org/10.1111/joop.70001

69.

Speer

A. B.

Dutta

Chen

Trussell

(2019). Here to stay or go? Connecting turnover research to applied attrition modeling. Industrial and Organizational Psychology: Perspectives on Science & Practice, 12, 277–301. https://doi.org/10.1017/iop.2019.22

70.

Speer

A. B.

Perrotta

Kordsmeyer

T. L.

(2021). How rater factors influence performance appraisal elevation: Using transformers to understanding performance narratives. In Paper presented at 37th annual society for industrial & organizational psychology conference, Seattle, WA.

71.

Speer

A. B.

Perrotta

Kordsmeyer

T. L.

(2024c). Taking it easy: Off-the-shelf versus fine-tuned supervised modeling of performance appraisal text. Organizational Research Methods, 19. https://doi.org/10.1177/10944281241271249

72.

Speer

A. B.

Perrotta

Tenbrink

A. P.

Wegmeyer

L. J.

Delacruz

A. Y.

Bowker

(2023). Turning words into numbers: Assessing work attitudes using natural language processing. Journal of Applied Psychology, 108, 1027–1045. https://doi.org/10.1037/apl0001061

73.

Speer

A. B.

Siver

S. R.

Christiansen

N. D.

(2020). Applying theory to the black box: A model for empirically scoring biodata. International Journal of Selection and Assessment, 28, 68–84. https://doi.org/10.1111/ijsa.12271

74.

Suresh

Guttag

. (2021). A framework for understanding sources of harm throughout the machine learning life cycle. In Proceedings of the 1st ACM conference on equity and access in algorithms, mechanisms, and optimization (pp. 1–9).

75.

Tay

Woo

S. E.

Hickman

Booth

B. M.

D'Mello

(2022). A conceptual framework for investigating and mitigating machine-learning measurement bias (MLMB) in psychological assessment. Advances in Methods and Practices in Psychological Science, 5(1). https://doi.org/10.1177/25152459211061337

76.

Tay

Woo

S. E.

Hickman

Saef

R. M.

(2020). Psychometric and validity issues in machine learning approaches to personality assessment: A focus on social media text mining. European Journal of Personality, 34, 826–844. https://doi.org/10.1002/per.2290

77.

Thompson

Koenig

Mracek

D. L.

Tonidandel

(2023). Deep learning in employee selection: Evaluation of algorithms to automate the scoring of open-ended assessments. Journal of Business and Psychology, 38, 509–527. https://doi.org/10.1007/s10869-023-09874-y

78.

Vandenberg

R. J.

Lance

C. E.

(2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70. https://doi.org/10.1177/109442810031002

79.

Viswesvaran

Ones

D. S.

Schmidt

F. L.

(1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81(5), 557–574. https://doi.org/10.1037/0021-9010.81.5.557

80.

Yankov

Speer

A. B.

(2023). Comparing three machine learning algorithms for scoring assessment center text data. Personnel Psychology, 13–23. [Part of composite article that combined separate studies, with the composite article titled Improving measurement and prediction in personnel selection through the application of machine learning]. https://doi.org/10.1111/peps.12608

Reliability Evidence for AI-Based Scores in Organizational Contexts: Applying Lessons Learned From Psychometrics

Abstract

Keywords

Get full access to this article

References