Abstract
Sound measurement is at the foundation of psychological science. In the past century, development of the construct-validation process has propelled the field to meaningful theory testing (Clark & Watson, 2019; Strauss & Smith, 2009). Yet most of what is known about psychological constructs relies on self-report measurement, which has weaknesses such as socially desirable responding, overreporting and underreporting, cultural and retrospective biases, and limitations in self-insight (e.g., Paulhus & Vazire, 2007). Psychologists strive to incorporate multimethod assessment into research and practice (APA Task Force on Psychological Assessment and Evaluation Guidelines, 2020) because it increases the validity of assessment (Hopwood & Bornstein, 2014; Meyer et al., 2001). However, actual use of multimethod assessment is rare, in large part because it can be burdensome and time-consuming. Advances in artificial intelligence (AI), in particular, large language models (LLMs), provide opportunities to incorporate, improve, and facilitate multimethod psychological assessment.
Language as an assessment tool has several strengths. It is behavioral, providing a more objective approach to assessment, and it can be natural, providing ecological validity to assessment, thereby avoiding some of the inherent limitations of self-report questionnaires. It is also rich, allowing individuals to express themselves in ways that break free of traditional rating scales (Kjell et al., 2024). Using language to study psychological constructs has already greatly expanded understanding of them (Pennebaker et al., 2003). With extraordinary recent advances in technology, language will likely continue to expand knowledge of psychological characteristics at a rapid pace.
There are also more practical advantages to using language as an assessment tool. Language as an assessment tool is scalable. Validated LLM tools could be more easily implemented into routine research and clinical activities that involve speech, supplementing self-report assessments and saving time and resources for both participants/patients and researchers/clinicians. LLM psychological-assessment tools could also greatly enhance assessment coverage. For example, well-developed LLM-based tools may assess a wide variety of psychological constructs from a single language sample in a setting in which a similar amount of assessment may take many hours of questionnaire completion. In emergencies or particularly low-resource situations, validated LLMs might provide assessments from language when no other assessment would be available.
The goal of this overview is to provide an accessible guide for psychologists to use LLMs to assess psychological constructs through language. We first present the history, significance, and development of the transformer-based LLM; explore the experimental-design process; and consider important issues related to LLM ethics, implementation, and future directions. We also present helpful techniques, tools, and code. Included on the accompanying GitHub page are a coding-based tutorial on using LLMs for psychological assessment and files containing specific code examples for applying the techniques we describe in our second section. Although we strive for an introductory level of description, we use many machine-learning terms that are essential for understanding and working with LLMs. For that reason, we also include a glossary (Table 1). Table 1 includes definitions and useful software packages in which certain procedures can be performed.
Glossary
Note: LLM = large language model; NLP = natural language processing.
Development of Transformer-Based LLMs
Language is central to human identity. Psychologists have long been interested in the relevance of human-language expression for understanding a person (Sanford, 1942). The ability to use language to assess psychological constructs was significantly bolstered by the development of word-counting programs (Pennebaker & King, 1999). “Dictionaries,” the backbone of this technique, use scoring rules derived from expert ratings of words to score psychological constructs from text. This method is also known as a “bag of words” approach. The Linguistic Inquiry and Word Count (LIWC) software (Pennebaker et al., 2003) uses a bag-of-words approach to count word use in text documents and score psychological constructs. It was developed as a simple text-analysis program that has continued to be refined since its inception, and versions continue to be released (Boyd et al., 2022). LIWC provides scoring of various emotion- and cognitive-process categories in addition to grammatical and language-use categories from text. It has become the most influential text-analysis program in psychology, demonstrating the ability to shine light on attention, emotion, social, thinking, and personality processes from language (Tausczik & Pennebaker, 2010).
However, early statistical language-processing efforts struggled with language tasks because human language can be ambiguous, with rule exceptions and meaning changes across contexts (Johri et al., 2021; Khurana et al., 2023). Initial models had a finite set of rules and inflexible decision-making algorithms and were unable to understand linguistic nuances. Furthermore, it was impossible to write rules and meanings for every scenario.
“Word embeddings” became an important solution (Almeida & Xexéo, 2019). Word embeddings are lists of numeric values (i.e., word vectors) that represent the meaning of words across multiple dimensions, capturing semantic and syntactic connections between words. Early models used two main strategies to generate word embeddings: (a) Prediction-based models (e.g., Word2vec; Mikolov et al., 2013) generate word embeddings by predicting a target word from context words (i.e., words immediately surrounding it) or by predicting context words from a target word. (b) Count-based models (e.g., GloVe; Pennington et al., 2014) generate word embeddings through counting global word co-occurrence in a text body. These early embedding models drastically improved the ability of computer programs to understand language, but the embeddings were static—that is, each word had only one embedding (Almeida & Xexéo, 2019). This was a problem because words with changing, context-dependent meanings would have the same word embeddings regardless of how the word was used in a particular instance.
Recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) are deep learning-neural-network frameworks for natural language processing (NLP). Although they process static word embeddings, they update with word context, appropriately mapping words to different possible meanings based on surrounding words (Johri et al., 2021; Khurana et al., 2023). RNNs and LSTMs improved model performance because they are better at maintaining accuracy across changing contexts. These models no longer followed predetermined rules and instead developed dynamic algorithms for decision-making that could update with greater exposure to language samples (Johri et al., 2021). Although these updated models outperformed previous methods, they required exposure to large amounts of data to learn words in different contexts. RNN and LSTM models are also limited in efficiency because they process language one word at a time, leading to long training times. These models require significant computational resources and still struggle to maintain understanding of word context over text that is longer than one sentence (Min et al., 2023; Vaswani et al., 2017).
The transformer-model architecture, which was the foundation for the development of LLMs, can provide a context-specific, quantitative representation of language (Vaswani et al., 2017). The transformer was a significant advance largely because of its unique “self-attention” mechanism. Self-attention allows the model to process all words in relation to all other words in a text sample simultaneously, as opposed to older methods that used sequential attention (Fig. 1). Sequential attention could lead to information buildup and forgetting of information that came earlier in a text sample. In the transformer, because all words communicate with each other directly, relations between words can be more accurately captured and retained across longer lengths of text.

(a) Method used by previous natural-language-processing models to process text. Each word is processed individually. Model would initially perceive pronouns refer to “dinner” before processing is complete. (b) Method used by transformers’ self-attention mechanism to process text. Pronoun references are clearly understood.
A transformer model is a deep-learning model that generally consists of “encoders” and “decoders” (Vaswani et al., 2017). But transformer models can vary in their composition of encoders or decoders. Both encoders and decoders consist of self-attention “layers” that help transformers generate contextualized representations of input text (i.e., how the tokens in the text relate to one another). Encoders consist of a self-attention layer and a feed-forward neural network. Input first goes through the self-attention layer, in which relationships between each token and every other token in the sentence are learned. Multiple layers of encoders with similar architecture can be “stacked,” meaning input is processed through multiple encoder layers sequentially, which allows the model to capture more complex patterns. A decoder processes the output from the encoder and also has attention and feed-forward neural-network layers. The decoder’s attention layer is referred to as an “encoder-decoder attention” layer, and it helps the decoder focus on relevant parts of the input sequence from the encoder. Because the decoder is used to generate text, its self-attention layer uses masking, in which the tokens on the right side of the sequence are masked so that the decoder cannot see future words of the sentence it is learning to generate. This prevents the model from knowing future tokens and constrains it to focusing only on preceding tokens to generate new text. Similar to encoders, decoders can also be stacked.
The advances in contextual understanding and speed provided by the transformer architecture enabled the creation of LLMs. The transformer allows the processing of massive amounts of data for training, often from online repositories. Initial transformer models were developed with a variety of large data sets for the time (Hadi et al., 2024) For example, Google’s Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2019) was pretrained on English Wikipedia and BooksCorpus (11,038 free books from the web). This training process allowed the development of a model with millions of parameters for identifying words and thousands of embeddings, which gives the model a general understanding of language.
In subsequent years, advances in computing resources have enabled the size of language models to grow (Hadi et al., 2024). Whereas initial transformer models were trained with millions of parameters (totaling less than 200 GB of storage), models are now being trained on hundreds of billions of parameters (requiring more than 7 TB of storage), resulting in more powerful and versatile language models. Although the term “LLM” is formally used to describe these newer, larger models, in this article, we use “LLM” to include the initial transformer models as well.
Broadly, transformer-based models can be divided into three types: encoder-only, decoder-only, and encoder-decoder. Tasks in which input text needs to be understood to generate output text requires encoder-decoder architecture, for example, language translation (translating text from one language to another), summarization (distilling texts to only the main points, reformatting language [e.g., speech to text]), and question answering. Here, an encoder changes the input text into a numerical representation while considering context of the text, and a decoder uses the input numerical representation to generate the output text one token at a time (this is also called “autoregressive text generation”). LLMs such as T5 (Text-to-Text Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformer) are encoder-decoder models.
Encoder-only models are used in the scenarios focused on understanding input text to perform tasks such as text classification (sorting language into categories), named entity recognition, sentiment analysis, or retrieval tasks. Models such as BERT, RoBERTa (Y. Liu et al., 2019), and DeBERTa (He et al., 2020) fall under this category. Decoder-only architectures are popular and used for generative tasks in which responses are predicted one token at a time. This architecture is used for large-scale generative models such as GPT (Generative Pretrained Transformer). Decoder-only models pretrained on large text can perform generative tasks such as summarization, question answering, and sentence completion. To that end, most transformer models can be used for more than one language task.
Experimental-Design Process
LLMs show immense promise for psychological research and measurement, yet using these models remains complex and is often made more difficult by a lack of documentation for specific uses. In this section, we outline the experimental-design process from start to finish and identify relevant considerations at each step. This overview emphasizes details specific to NLP and the use of LLMs, but an understanding of general machine learning (ML) is also required for carrying out such analyses. Although brief definitions of relevant terms are included in Table 1, we also recommend helpful articles to explore these topics in more detail. Figure 2 presents a road map of the process. In each section, we discuss relevant concepts and decision-making considerations and provide examples from different areas of psychology and a continuous working example from research assessing Big Five/Five-Factor Model (FFM) personality traits from interview language (Oltmanns et al., 2025).

Overview of experimental-design process. (a) Data collection, (b) language conversion, (c) text preprocessing, (d) LLM technique, (e) LLM selection, (f) model evaluation, (g) model training considerations and (h) model visualization.
In our working example, a representative community sample of 1,409 older adults was recruited from the St. Louis, Missouri, area. The mean age of the sample was 59.5 years old; 54.5% identified as female, 65% identified as White, 32.7% identified as Black/African American, 2.3% identified as other, and 1.7% reported Hispanic/Latino descent. Participants completed life-narrative interviews in which they were asked to divide their adult life into three or four chapters, title their chapters, and then briefly describe those chapters. Next, participants were asked about high and low points, best and worst characters, and a turning point in their life story. Interviews lasted about 20 min, on average. Participants then completed the self-report NEO-Personality Inventory–Revised (NEO-PI-R; Costa & McCrae, 1992), from which five broad personality trait domains were scored (neuroticism, extraversion, openness, agreeableness, and conscientiousness). The NEO-PI-R scores were used to train language models of personality from the life-narrative interviews.
Throughout this section, we include “pseudocode” to show how python code may be used to complete certain steps. Each pseudocode is numbered with sequential steps for a given task. The names of snippets containing actual python code that go along with the pseudocode are included in parentheses next to the pseudocode titles and are located in the accompanying GitHub repository (https://github.com/mehak25/Intro-to-LLM). This repository also includes a hands-on coding tutorial on using LLMs.
The first decision in the process is what the researcher would like to ultimately predict or classify. This should influence data collection (Fig. 2a). In our working example, we collected life-narrative interviews to examine whether individuals’ patterns of language use in storytelling could reliably predict personality traits. At each stage of Figure 2, there are important decisions to be made.
Data collection
We focus on several forms of natural language that show promise for psychological assessment with NLP (Fig. 2a). Each type of language data has its own unique strengths and limitations in the data-analysis pipeline. Ideally, multiple forms may be used in tandem to provide a more robust estimate and understanding of a psychological construct. Model-prediction accuracy is heavily influenced by the quality and quantity of the data, making data collection and data preprocessing one of the most important considerations before the analysis process (Demszky et al., 2023). Note that models trained on one language-sample type may not apply well to other language-sample types, and this will be a critical area of investigation in the future (cf. Chekroud et al., 2024).
First, language may be collected through prompts, for example, recording verbal responses to prompted questions or written tasks. The process for collecting prompted language can be self-administered, allowing participants to complete tasks without researchers and potentially in more comfortable locations. Collecting a sufficient amount of language from prompts can be difficult. Prompts to encourage speaking engagement, such as asking carefully planned, open-ended questions or multipart questions or including explicit prompts or timers, can be helpful to encourage continued speech.
Second, interviews capture an authentic and targeted language exchange between individuals. This may include job interviews, clinical interviews, group interviews, or life-narrative interviews. It can be helpful to consider what kind of language would be most useful for future modeling purposes and how to encourage it in the interview. One potentially important downstream consideration for analysis is separating the interviewer and interviewee in audio files. It can be beneficial to record with multiple microphones, which makes it easier to separate the speakers downstream.
Third, social media posts are the most common form of natural written language being used for NLP (Chancellor & De Choudhury, 2020). Social media data often result in large sample sizes with short text length and would contain multiple status updates from individual users. Both Facebook and Twitter/X provide open-source application to programming interfaces (APIs) to download large amounts of text. 1 APIs are tools that provide access to complex software programs or systems. Social media status language is unique—there may be topics an individual is more or less likely to post publicly about, and there may be many participants and patients who do not use social media, which will affect the validity of the assessment.
Fourth, ambulatory methods are used to collect more naturalistic and ecologically valid language data in everyday life (Mehl, 2017; Trull & Ebner-Priemer, 2013). Ambulatory recordings have several advantages: They can (a) have high ecological validity, (b) be collected multiple times over the course of a day, and (c) capture emotions and behaviors in real time (Lazarević et al., 2020). Ambulatory recordings are often implemented using smartphones, smartwatches, or other wearable recording devices. The Electronically Activated Recorder (Mehl, 2017) is available as a smartphone application that passively records speech in a naturalistic environment. If ambulatory data collection is active, it can be burdensome and uniquely difficult to collect.
Fifth, electronic health records (EHRs) are secure digital copies of patient charts including clinical notes from different settings, test results, and diagnoses. Extracting language data from EHRs can provide information on clinical treatment, professional opinion, and testing history that may reveal a significant amount about psychological functioning. For example, Y. Liu et al. (2023) used language models to find stigmatizing language in clinical notes to understand physician bias in patient assessment. LLMs can classify social determinants of health and behavioral-health data from clinical notes in EHRs (Englhardt et al., 2024; Milligan et al., 2024).
Sixth, LLMs can assess language that was written by psychologists for professional purposes, for example, questionnaire items, vignette text, language from formal psychological-testing measures, clinical diagnostic criteria and symptoms descriptions, intervention scripts, and research articles. This language can enlighten the test-development process and examine coherence between clinical and assessment materials and human responses to these materials. Although clinical and assessment language is not natural language, use of LLMs to improve these materials and understanding of them is promising.
Language conversion
In this section, we discuss several important considerations in processing audio or image files into text files for downstream NLP (Fig. 2b).
Audio processing and transcription
After data collection, raw language samples need to be converted into formats better suited for analysis. This commonly includes transcribing speech samples from audio files to text but could also be reformatting digital language or transferring handwritten language to digital formats (Subramani et al., 2020). Conversion can be completed manually or through automated processes. Automatic speech recognition requires much less time and fewer financial resources but is more likely to contain errors. Options include APIs such as OpenAI’s Whisper, Google’s Speech-to-Text, and Microsoft’s Azure. The accuracy of automatic-transcription tools has improved dramatically in the past few years (Spiller et al., 2023). These tools can be used on premise (i.e., implemented on a secure server at the researcher’s institution), which is essential if the sample contains confidential information.
Speaker diarization
When speakers on the same audio track need to be separated, this is called “speaker diarization.” Open-source diarization tools include SpeechBrain (Ravanelli et al., 2021), pyannote (Bredin et al., 2020), and WhisperX (Bain et al., 2023). These tools extract speech features from the audio signal and then use deep-learning models to differentiate between speakers based on unique voice characteristics (e.g., variations in pitch, volume, vocal-cord vibration). Diarization is still a difficult task to automate and often has errors, so manual review may be necessary.
Text preprocessing
Successful transcription produces text files. However, further text preprocessing may be needed, for example, to isolate language of interest or match the expected text formatting of an LLM (Fig. 2c).
Text isolation
Researchers might be interested in isolating language from one person. In interview transcriptions, speaker labels (e.g., “interviewer:”; “interviewee:”) are helpful. An example of speaker isolation is included in the GitHub repository. Other creative strategies can be helpful: For example, if the interviewer and interviewee need to be separated, the speaker who asks (or answers) more questions may be used as a proxy to identify them. Furthermore, speakers of interest may be identified by their use of specific words, phrases, or topics that they may be more likely to use.
De-identification
Language data may contain confidential information that should either be de-identified or analyzed on a secure local sever (Hoory et al., 2021). Named entity recognition (NER) is an NLP technique that can de-identify text samples. NER locates predetermined words or phrases in text. For example, the open-source package
Stop words
Stop words are commonly used words (e.g., “a,” “the,” “is,” “in”) that have traditionally been removed from language samples during preprocessing because their widespread use provided little unique information. LLMs capture contextual information from language, so they tend to work best when stop words are preserved (Shekhar et al., 2024), including contractions and all words and tenses.
Tokenization
Tokenization is the process of breaking down raw text into smaller units called “tokens,” which serve as input into the LLM (W. X. Zhao et al., 2023). Tokenizers split text into words and meaningful subword units. For example, the word “wind” remains [wind] after tokenization. However, “windsurf” becomes [wind] and [##surf], and “windsurfer” becomes [wind], [##surf], and [##er]. Tokenization strategy varies by LLM, but most NLP packages make it easy to do. A brief code example of tokenization of a text using the Hugging Face transformers library is shared on the GitHub (called “tokenizer.py”).
Pseudocode (GitHub file: tokenizer.py):
Initialize pretrained tokenizer.
Loop over each word in your text to encode it into a token.
Add special tokens such as [CLS] to mark the beginning of the text and [SEP] to mark the separation between sentences.
LLM techniques for psychological assessment
Feature extraction, fine-tuning, and prompt engineering are three primary ways to use LLMs for psychological assessment (Fig. 2d). Each is explained in detail below, along with example applications in different areas of psychology. The flowchart in Figure 3 may help guide the decision of which technique to use.

Flowchart of techniques for using large language models for psychological assessment. Striped line = optional.
Feature extraction
One straightforward application of LLMs is to obtain contextualized embeddings from an input text. Unlike static embeddings, contextualized embeddings vary depending on how words appear in a sentence, thus capturing nuanced meaning specific to a given context. These embeddings can then be used in downstream analyses (Hussain et al., 2023).
For example, Wulff and Mata (2023) used an LLM to extract contextualized-embedding features from the item language of multiple personality questionnaires. Results indicated that feature extraction can be useful for examining construct validity: Some questionnaires may claim to measure the same construct even though the embedding features show discrepancies (e.g., “jingle fallacy”), and other questionnaires may claim to measure different constructs even though the embedding representations are similar (e.g., “jangle fallacy”). In addition, feature extraction has been used to support the validity of personality structure: LLM word embeddings related to personality show similar factor structure to that from previous research with human ratings (Cutler & Condon, 2023). Correlations were even stronger for the LLM embeddings than the previous ratings data, indicating LLMs may be an effective way to explore personality. Abdurahman et al. (2024) used contextualized embeddings from a pretrained LLM to represent the semantic meaning of items from self-report personality questionnaires. They then used these embeddings to predict individuals’ scores on previously unseen personality items based on linguistic similarity.
The following pseudocode demonstrates how to obtain and save embeddings after feeding text data through a pretrained LLM. Once generated, the embeddings can be used as conventional predictor variables in traditional regression models (e.g., linear regression).
Pseudocode (GitHub file: save_embeddings.py):
For each text window, tokenize the text and then pass it through the model.
Retrieve the output of the last layer of the model. The output of the language model is three-dimensional (number of samples, number of tokens, embedding size).
Obtain document-level embeddings by either taking the mean of embeddings across all tokens or extracting the embedding associated with the first token (typically the [CLS] token).
Fine-tuning
The process of further training pretrained LLMs with more specific data is called “fine-tuning.” During fine-tuning, model weights are updated to reflect domain-specific language (e.g., language of interest to the psychologist) and adapt model decisions to best fit a specific task (e.g., predicting or scoring a personality trait, as in our working example). During fine-tuning, model weights can be updated either for the whole model or partial model depending on how much computational power is available for training (“training cost”). Fine-tuning is helpful for creating specialized models without the burden of needing very large, labeled data sets (Chae & Davidson, 2023; Demszky et al., 2023). To reduce the training cost of fine-tuning, a few samples can be used to update model weights, which can be called “few-shot fine-tuning.” Few-shot fine-tuning is powerful because performance can be improved with much fewer data than would be required to train a model from scratch. There are some challenges with fine-tuning: It is all the more important to have high-quality labeled data when fine-tuning. That is, the construct validity of the label measurements will be critical because the model will only be as good as the labels (i.e., the quality of the assessment of the dependent variables; Chancellor & De Choudhury, 2020). In addition, fine-tuning is computationally expensive, requiring LLMs to be hosted on large servers to run the training cycle.
Fine-tuning is particularly well suited for assessing psychological constructs from language data. During fine-tuning, a pretrained language model’s parameters are updated to reflect how language relates to the construct of interest, including subtle or nuanced patterns that might be imperceptible to human raters (Luxton, 2014). The resulting fine-tuned model can then accurately assess the construct from new, previously unseen language samples. Simchon et al. (2023) fine-tuned a model predicting personality traits from social media posts. The model was able to identify language patterns indicative of FFM personality traits and use that information to predict the personalities of new users. Our working example also uses fine-tuning to predict personality traits from interview language that does not explicitly ask about personality.
Fine-tuning is also being studied in clinical and social psychology. Ohse et al. (2024) used fine-tuning for depression assessment. The researchers fine-tuned BERT and GPT 3.5 using language responses to a depression interview with 12 labeled examples (i.e., interview transcripts with the corresponding depression scores). They evaluated the models using the F1 score, which is the harmonic mean of model precision and recall. Fine-tuned GPT 3.5 outperformed fine-tuned BERT in the prediction of depression from language in interviews (F1 scores = .82 vs. .62). In social psychology, fine-tuning was used to assess political beliefs from social media posts (Gül et al., 2024). GPT 3.5, Llama 2, and Mistral LLMs were fine-tuned to predict user alignment with political figures and stances (e.g., climate change, feminism). Fine-tuned GPT performed the best; F1 scores were all over .80 and at times exceeded .90. In cognitive psychology, LLMs have been used to explore the connection between cognitive abilities and behavior (Hardy et al., 2023). Fine-tuned LLMs may eventually be useful tools for the study of cognitive processes, such as memory, attention, perception, reasoning, and learning.
As model size continues to grow, the cost of traditional fine-tuning continues to increase beyond available resources. This has been addressed through parameter-efficient fine-tuning (PEFT; Lialin et al., 2024). “PEFT” is a broad term referring to any strategy that updates only a small set of model weights (e.g., a subset of existing model weights). PEFT strategies continue to be developed, and many have demonstrated success compared with traditional fine-tuning. For example, Lin et al. (2024) trained a model using PEFT to generate positive alternatives to cognitive distortions, and their PEFT model outperformed other models.
In our working example—fine-tuning a model to predict FFM personality ratings from interview language—embeddings are updated to reflect nuances of the language used by the interviewee, and associations between the language and levels of personality ratings are learned. After training, the model can be used to predict personality ratings from unseen interview language (future research will have to assess the generalizability of the language sample that the model can be used with). See example code relevant for fine-tuning in the GitHub and the upcoming Model Training Considerations subsection.
Prompt engineering
Prompt engineering involves carefully designing input text (“prompts”) to guide the output of LLMs, enabling improved performance without updating or retraining the model weights. It can be used to perform tasks or generate new text. Text prompts can be manually provided by the researchers to pretrained LLMs to either classify input text or perform a specific task. These prompts are considered “hard prompts” because they are specific directives given to the model. Hard prompts are used when output text needs to strictly adhere to certain criteria (e.g., provide a specific assessment score, summarize text, or provide another response). Instructional tokens can also be used in hard prompts by adding them at the beginning of the input sequence. For example, if you want the LLM to provide a factual answer to your question, you can prepend your question with the instructional token “[ANSWER-QUESTION]” (e.g., “[ANSWER_QUESTION] What is the capital of United States?”). Interacting with LLMs through hard prompting can be simple, intuitive, and less computationally intensive.
Prompting has different strategies based on the amount of available labeled data. Zero-shot prompting prompts a model with instructions for a task but will not provide example data or example answers for the model to learn from. One-shot prompting prompts the model with instructions and one labeled example before the model completes the task, and few-shot prompting includes multiple labeled examples in the prompt before the model provides its response. The key difference from fine-tuning is that the labeled examples included in prompts do not modify any model parameters. When model weights are not updated, there is no computational-training cost.
Another strategy is “soft prompting,” which is most used in a supervised-learning context. Soft prompting requires model training and therefore may be referred to as prompt “tuning.” The soft prompt is a set of trainable embeddings that are added to the input text. The embeddings from the soft prompt are trained with labeled examples. These embeddings then act like a filter, cuing the model as to what language is associated with the task. Soft prompting is less computationally intensive than fine-tuning because only the added prompt embeddings need to be updated. Peng et al. (2024) compared hard prompting and soft prompting when identifying adverse events and social determinants of health from clinical narratives. Soft prompting performed better than hard prompting, indicating that LLMs can learn better from trainable soft-prompt embeddings than human-generated hard prompts. Soft prompting reduced computing costs by 97% compared with fine-tuning. However, large models with several billion parameters were required for soft-prompt models to show these benefits.
In prompt-engineering studies, the prompt can vary for each case in the data set to improve results or better study individual differences. For example, K. Yang et al. (2024) used LLMs to assess social attitudes and the propensity to be influenced by social contexts based on demographics (e.g., age, race, location, income, education level). The model performed poorly in zero-shot prompting. However, few-shot prompting that included labeled examples customized to match certain profile features for each individual improved performance.
Overall, prompt engineering allows for model customizations without the same data and resource requirements as fine-tuning, making it quicker (Chae & Davidson, 2023). In contrast to fine-tuning, model parameters are not updated, which is the most significant concern about prompt engineering. This is because psychology uses require generalizable, nuanced knowledge about a topic (Demszky et al., 2023). However, as the barriers to fine-tuning continue to grow for the newer, more advanced LLMs (e.g., model size, closed-source), prompt engineering has become an exceedingly popular and effective strategy (Hua et al., 2024).
Prompt engineering has been applied across a variety of psychology domains. In cognitive psychology, GPT-4 predictions were compared with human-memory performance (Huff & Ulakçı, 2024). GPT-4 was prompted to rate the relatedness of pairs of (a) context and (b) garden-path sentences and the memorability of the garden-path sentences. GPT-4 ratings of memorability significantly corresponded with human-memory performance. This indicates LLMs may have utility as cognitive-assessment tools in the future. In personality psychology, zero-shot prompting was employed to assess personality traits from social media posts (Peters & Matz, 2023). The LLM was hard prompted to attend to how personalities were reflected in language from online posts and to provide a numerical rating for each of the FFM personality traits. Results demonstrated moderate effect sizes for predicting personality.
Zero-shot prompting of GPT-3.5 has been used to assess attitudes in social psychology (Simons et al., 2024). Hard prompts were used to obtain GPT ratings on individuals’ attitude certainty, importance, and moral conviction from social media posts. The GPT ratings replicated prior factor-analytic structure and internal-consistency reliability of human-attitude ratings. This study was notable for its adherence to a psychometric construct-validation approach for evaluating LLM-generated ratings based on language.
In clinical psychology, Tu et al. (2024) used zero-shot and few-shot prompting for posttraumatic-stress-disorder (PTSD) assessment from language in clinical interviews. GPT-4 performed best with few-shot prompting, and zero-shot prompting performed best with Llama-2. Predicting several different variable types from several different interview types, GPT-4 was, on average, 10% more accurate than Llama-2, reaching an accuracy of 68%. GPT-4 showed close similarity to human ratings for PTSD-related scale variables and more conservative predictions, whereas Llama-2 consistently overpredicted. Jeon et al. (2024) used a two-step prompting strategy to identify suicide risk from social media posts. In the first step, MentaLlama (Llama, fine-tuned on social media data related to mental health) was assigned an expert identity, provided a dictionary with suicide-related terms, and asked to extract key phrases from the posts. Jeon et al. found that few-shot prompting in Step 1 performed better than zero-shot, so a few labeled examples were added to the prompt. In the second step, a more generic LLM was prompted to summarize key phrases, and multiple summaries were evaluated for consistency. Recall of suicide-related posts was consistently high. Different expert-identity assignments were found to influence the extracted phrases, indicating that prompting LLMs to have different roles may produce different results.
Some research has used both fine-tuning and prompt engineering for psychological assessment. Galatzer-Levy et al. (2023) conducted zero-shot prompting with an LLM that had previously been fine-tuned on sources of medical language. The fine-tuned model was prompted to assess psychiatric functioning from clinical interviews and performed particularly well for depression detection but displayed difficulties with co-occurring diagnoses. Lin et al. (2024) combined and compared tuning and engineering strategies for two tasks in a Mandarin Chinese data set: (a) detecting cognitive distortions (i.e., problem-thinking styles related to depression) and (b) generating positively framed alternatives. Comparison of fine-tuning a pretrained language model versus transfer learning found fine-tuning was more accurate in detecting cognitive distortions. The researchers then compared fine-tuning, prompt tuning (P-tuning, Version 2), and prompt engineering for generating positive alternatives to cognitive distortions. The prompt-tuned model (ChatGLM-6B with soft embeddings) outperformed both the fine-tuned model and prompt engineering at generating positively reframed sentences. These findings suggest that prompt tuning a smaller model can be more efficient than fine-tuning or prompt engineering for generating psychologically meaningful text.
Hard prompts can be provided to the LLM with or without examples (e.g., text and variable score pairs). Below is an example of a hard prompt, which can be enhanced with additional instructions, such as specifying a perspective or task: Language: [include the text here] Based on the text, please rate the level of [construct of interest here] by providing a numerical score [insert scale here].
Soft prompts, in contrast, involve prepending trainable embeddings to the model input. The following pseudocode demonstrates how to prepend soft prompts to language input embeddings:
Pseudocode (GitHub file: soft_prompt.py):
Create soft prompt by giving prompt length and model-embedding size to
Prepend the soft prompt to the input of the model.
Train the model using updated input.
Processing labels
There are several important considerations for processing psychological-variable labels (i.e., dependent variables) that may be predicted using LLMs for psychological assessment. For details on these considerations, including merging text data with psychological-variable data, scaling of variables, splitting the data set for training and testing, and avoiding data leakage, see the Supplemental Material available online.
LLM selection
Key LLM-selection decision points include their training data, text limits, size (measured in number of parameters and memory required to store the model), usage limits, and model transparency (Fields et al., 2024; Fig. 2e). It is becoming more common for models to have “model cards” that provide this information in an organized fashion (Mitchell et al., 2019). Other important considerations in LLM selection include characteristics of the assessment data, task specifics (e.g., what you want the model to do), and computing resources. Table 2 describes common LLMs, including their training data, model size, and text limits. Parameters are the building blocks of LLMs and include weights, biases, word embeddings, neural-network layers, self-attention mechanisms, and feed-forward neural networks. LLMs are classified as small if they contain fewer than 1 billion parameters, medium if they contain 1 to 10 billion parameters, large if they contain 10 to 100 billion parameters, and very large if they contain more than 100 billion parameters (Minaee et al., 2024).
Potentially Useful Large Language Models for Psychological Assessment
Note: K = thousand; M = million; B = billion; T = trillion; API = application to programming interfaces; BERT = Bidirectional Encoder Representation from Transformers; GPT = Generative Pretrained Transformer.
Google’s BERT (Devlin et al., 2019) is one of the earliest and most frequently used LLMs. BERT is a small, encoder-only model best suited for tasks requiring understanding of full-text sequences, such as text classification or NER. Additional BERT-based models continue to be developed, such as RoBERTa (an optimized version of BERT using more training data and a longer training time among other training improvements; Y. Liu et al., 2019), DistilBERT (a slimmer, faster version of BERT; Sanh et al., 2019), and XLNet (which incorporates non-English languages; Z. Yang et al., 2019). Although the term “LLM” generally does not include the initial transformer models mentioned previously, they remain a great option because of modest computing requirements and optimization for text classification.
GPTs (Achiam et al., 2023) are a family of decoder-only models by OpenAI that marked the transition to formal LLMs. These are very large models, containing more than 175 billion parameters, that are behind ChatGPT. Although prior GPT models have been publicly released, the most advanced models may be unavailable to the public. However, some can be fine-tuned through APIs. Another family of LLMs is the Llama family by Meta (Touvron et al., 2023). Llama models range in size from medium to large and are open-source, meaning the model weights are available to the research community (Minaee et al., 2024). For more information about the structure of specific models, performance comparisons, and training considerations, see Minaee et al. (2024), Naveed et al. (2024), and W. X. Zhao et al. (2023).
LLMs are becoming increasingly accessible. Hugging Face is an open-source community that provides tool access (Hussain et al., 2023). Hugging Face has two main components: first, an online repository that stores trained language models, information regarding model performance, publicly available data sets, and detailed tutorials and second, a series of python libraries that have simplified code to access transformer models, tokenizers, and optimization tools. In addition, Hugging Face stores domain- and task-specific models previously created by others that are open to the public, for example, BERT-based classification models trained on social media posts to predict sentences discussing anxiety or depression, Llama-based chatbots trained to provide empathic support and resources about mental-health treatment, and RoBERTa-based models fine-tuned on PubMed articles.
Maximum sequence length
LLMs have varying maximum sequence lengths—also called “context windows”—which limit the number of tokens that can be input into the model at one time. If the token limit is exceeded, the text input will be truncated at the token limit, potentially cutting off important information. Some earlier models, such as BERT, have relatively short limits (e.g., 512 tokens, which is around 400 words), whereas models such as GPT and Llama support context windows of several thousand tokens. Recently, some models have pushed these limits upward of 200,000 tokens (e.g., Claude; Anthropic, 2024). Although larger context windows may improve performance on long texts, they also significantly increase computational cost and memory requirements, leading to less common use in applied research to date (Y. Ding et al., 2024).
Currently, there are multiple other strategies to process longer texts (see Fig. 4). (a) Truncate the text (i.e., discard all text that is beyond the token limit). This is the default strategy, so if long texts are not managed in other strategic ways, models will automatically truncate texts. (b) Trim the text (i.e., select portions of the original text to stay under the token limit). Research has shown that performance is better when tokens are selected from throughout the document rather than simply truncating (Tuteja & Juclà, 2023). (c) Chunk the text. “Chunking” is when the text is chunked into blocks of text that are within the token limit. For example, if the token limit is 512 and there are 1,536 tokens total, chunking the text would split the original long text into three chunks. The chunks can then be input to the model separately, and the results are averaged across them. (d) Use a “sliding window” approach. In a sliding-window approach, the original text is split into blocks that are below the token limit, but the blocks contain overlapping text that is referred to as the “stride.” This overlap helps preserve the context across chunks but will increase training time.

Strategies for handling long text. (1) Truncate text that is longer than the token limit. (2) Trim selected text so it is shorter than the token limit. (3) Chunk the long text into segments the same length as the token limit. (4) Split long text into segments shorter than the token limit, with each segment overlapping.
Other techniques may involve using one batch per document or hierarchical modeling. LLMs process data in batches, updating model parameters after each batch. Creating one batch for each long document enables the model to process one full document at a time. Hierarchical-modeling techniques may also organize long texts into manageable chunks and ensure adequate aggregation of units into participant-level representations (Dai et al., 2022; M. Ding et al., 2020; Wu et al., 2021). This may address the concern of text-participant attribution and can help with equal weighting of text samples when some participants’ texts are longer than others.
The following pseudocode example demonstrates how to implement the sliding-window approach:
Pseudocode (GitHub file: sliding_window.py):
Loop over the text and divide text into subtexts of length of window.
Use overlap variable to decide how much overlap to keep between subtexts.
Tokenize each subtext using a new or pretrained tokenizer available on hugging face or simpletransformers.
Required computing resources
Computing resources are of the utmost importance (Kaddour et al., 2023). Small LLMs can be run using the central processing unit of any computer, but many require graphics processing units (GPUs). GPUs are computer processors that were originally designed for video gaming that perform parallel computations and process large data quickly, making them well suited for machine learning and working with LLMs. Baseline GPU memory requirements for fine-tuning LLMs can reach upward of 80 GB (also the size of the largest commercially available GPUs; Tuggener et al., 2024). To estimate how much memory is required, a rule of thumb is 8 ×
In our working example, we used university-based computing resources. On-demand access to cloud servers can be helpful, but university-based computing was more cost-effective and helpful for batch job processing. Even with a small language model, running the fine-tuning analyses required more than 30 GB of GPU RAM.
Managing memory usage is also critical for working with LLMs. We explored strategies to reduce both static and dynamic memory requirements, including precision reduction, data streaming, gradient checkpointing, and mini-batch optimization. For a detailed discussion of these strategies and implementation examples, see the Supplemental Material.
Model evaluation
Models must be configured during training to produce the desired output (Fig. 2f). In NLP tasks, language-based predictions generally fall into two categories, classification and regression, each with its own evaluation metrics (Berggren et al., 2019). Language can be used to predict a binary classification (e.g., Does someone have a specific attribute, yes or no?), multiclass classifications (e.g., a set of possible labels), or continuous values (e.g., a ratio score). Multiclass labels can be nominal (e.g., predicting one of five political affiliations) or ordinal (e.g., predicting one of four increasing difficulty levels). Models can also be trained as multilabel classifiers in which multiple labels can be selected for each language sample. Finally, a regression task trains models to predict continuous values.
Classification and regression tasks are evaluated using different metrics. Classification evaluation metrics are focused on prediction accuracy. Regression-based metrics are focused on reducing prediction error. Descriptions of evaluation across different metrics can be further studied in tutorials by Vickers et al. (2024) and Pargent et al. (2023).
Most documentation about language modeling uses the term “language classification” to describe the broad category of tasks mentioned above (including regression). Most available information refers to classification tasks and not regression tasks. For some tools, such as simple transformers (Rajapakse, 2019), the default information will address classification tasks, but steps to convert the code to regression are included in the documentation. In some cases, information about classification will still apply to regression because a regression task can be conceptualized as a classification task with one label. In general, classification tasks tend to achieve better overall performance, but regression tasks offer more precise predictions and are often more relevant to psychological constructs, which are often measured continuously.
In our working example, the personality scores were continuous, and the model was trained to complete a regression task. A model can be configured for regression using the simple transformers library by setting the regression parameter to “True” in ClassificationArgs, as shown in regression.py on Line 55. Line 80 to 81 shows how to extract the predictions of personality ratings during testing.
Model-training considerations
Analyzing text data with LLMs relies heavily on general ML procedures (Fig. 2g). Pargent et al. (2023), Choi et al. (2020), Badillo et al. (2020), T. Jiang et al. (2020), and Pandey et al. (2020) are helpful overview articles and tutorials. Coursera (https://www.coursera.org/) and Towards Data Science (https://towardsdatascience.com/) are also practical resources for examples, tutorials, and discussions.
Cross-validation
Cross-validation is a technique used to estimate model reliability and accommodate limited amounts of data (Yates et al., 2023). The data are divided into equal portions, or “folds.” The number of folds may vary (referred to as “k”); five or 10 are the most common. K minus 1 fold is used to train the model, and the remaining fold is used to test the model. This process is repeated until each fold has been the testing fold. The overall estimate of model performance is the average of all combinations (Wong, 2015). For smaller data sets, leave-one-out cross-validation is recommended (see Table 1). Cross-validation is important because it provides a more reliable estimate of model performance, reducing bias from randomness in the data. Variability in performance across iterations can indicate inconsistencies in the data, increased data complexity, or difficulties with the model’s ability to learn (Shulga, 2018). In our working example, we used five-fold cross-validation to help estimate model performance.
Hyperparameter tuning
Hyperparameters are settings that affect how a model learns and how they are adjusted to optimize model performance. Customizing these settings is known as “hyperparameter tuning.” There are many hyperparameters. Three are emphasized as having the greatest impact: learning rate, batch size, and number of epochs (Devlin et al., 2019). The learning rate determines how much the model’s parameters are adjusted in response to training examples. Higher learning rates may speed up the training process but may not result in optimal model performance because they may not be sensitive enough. Lower learning rates remedy this problem but will slow down the training process. Specifically for LLMs, learning rates tend to be much smaller than with other ML models because LLMs operate best with subtle adjustments. Learning rate warm-up strategies are also useful when training LLMs because they gradually increase the learning rate at the onset of training, facilitating stability. Batch size is the number of data samples that are seen by the model before calculating errors and updating the parameters. Batch size is dependent on available computing resources because all data for a given batch need to be held in memory before the model’s weights are updated. Epochs are the number of times the model passes back and forth through the entire data set. Training for too few epochs can result in underfitting such that the model does not learn enough about the data. Training for too many epochs can result in overfitting such that a model learns too much about the data and then does not perform well on other, unseen data.
When determining values for hyperparameters, it is recommended to begin with the same values used to train the base LLM (Devlin et al., 2019). These values are likely published, and some models (e.g., BERT) even recommend possible ranges for hyperparameter values for future fine-tuning. It is then important to experiment with different settings to determine what works best for a particular data set. There are multiple strategies for finding optimal hyperparameter values; grid search, automatic optimization, or random search are the most common (Bischl et al., 2023). A grid search will systematically train multiple iterations of models, trying every combination of values within the given ranges. Automatic-optimization strategies will dynamically adjust the values of specific hyperparameters each iteration, testing values that are uniquely promising and using algorithms to predict what those values would be. Random search tests a wide variety of values within a specified range, with no meaningful decisions about which values to try. Note that the optimal settings for a given model may not fall in the recommended-values range. In this situation, automatic-optimization strategies can be helpful because they can efficiently expand hyperparameter values away from the recommended ranges based on context-specific information.
The tuning process can be time-consuming and requires significant computing resources because each combination of parameters is used to train the entire language model. Overfitting is a concern (X. Liu & Wang, 2021). Several strategies are recommended to avoid overfitting in hyperparameter tuning: (a) Early stopping prevents models from overfitting by determining the optimal number of epochs and ending the training process once the model’s performance is no longer improving after a specified number of epochs, typically five to 10 (Dodge et al., 2020). (b) The optimal values are those that resulted in the greatest average performance across all validation folds—not the best values of any individual run. (c) Dropout and weight decay can reduce overfitting. Dropout randomly removes connections between elements of the model during training, and weight decay adds penalties to highly influential paths to encourage the model to examine patterns more generally (Srivastava et al., 2014). After the optimal hyperparameters are determined, the full model should be retrained using these values. See the GitHub page for important training arguments for hyperparameters and early stopping.
Weights and Biases is a helpful tool for hyperparameter tuning (Biewald, 2020). This software is free for students, educators, and academic researchers and facilitates hyperparameter sweeps. Weights and Biases can be integrated with other libraries (e.g., Simple Transformers, Hugging Face transformers) to automatically log training and evaluation data in real time and visualize model performance. Figure 5 shows an example of a hyperparameter-tuning log. The results of each combination of learning rate and epochs are plotted, indicating model performance with respect to different combinations of these hyperparameters. Note the range in performance across different combinations, providing useful information about optimal hyperparameter settings.

Example of Weights and Biases hyperparameter-tuning log.
Model visualization
Deep learning uses nonlinear relations across multiple layers, which make it difficult to understand precisely how LLMs make decisions (this is known as the “black box” problem). Techniques are being developed to increase model decision explainability (H. Zhao et al., 2024). However, simple model visualizations can be helpful (Fig. 2h). One simple method is to correlate token usage from the text input with the target variable. Examining tokens that appear in more than 10% of the sample and selecting those with the highest positive and highest negative correlations is a straightforward approach to identify potentially important features.
Topic modeling is useful for providing insights into the content of language data and decreasing the manual labor required by exploring themes qualitatively.
If working with longer language samples, it is helpful to split samples into sentence-level data when performing topic modeling to adequately capture the variation of topics discussed by one person. Because the narratives were so long in our working example, we reformatted the data set to have utterances from participants in unique rows. Using
It is also helpful to visualize embeddings in two-dimensional space. t-distributed stochastic neighbor embedding (t-SNE; Van der Maaten & Hinton, 2008) is a dimensionality-reduction technique used to visualize high-dimensional data, such as LLM embeddings, in two-dimensional space. The relative positioning of data points in the visualization provides insight into semantic meaning similarity. CLS embeddings, in particular, are useful for visual inspection because they represent the embedding for the full-text sample. The following pseudocode demonstrates visualizing embeddings in two-dimensional space using t-SNE:
Pseudocode (GitHub file: embedding_visualize.py):
Extract CLS embeddings from pretrained or fine-tuned model for each text in the data set.
Use t-SNE to transform embeddings into two-dimensional space.
Plot the scatter plot for all samples.
Attention weights for each token in the input text can also be visualized (Vig, 2019). This provides information about the importance of each language feature for prediction of the outcome. Create a two-dimensional matrix to visualize CLS tokens:
Pseudocode (GitHub file: attention_visualize.py):
Extract attention layers from the model output.
Select layer and head for which to view attention weights (most commonly used 0th layer and 0th attention head).
This will provide a square matrix as a two-dimensional array.
A heat map can illustrate how much attention (or weight) is given to each token in the input text to perform the output task. Create a heat map of the above attention matrix, which will show how each token is semantically connected to each other token in the input text:
Pseudocode (GitHub file: attention_visualize.py):
Extract attention layers from model output.
Select attention layer and head to visualize.
Visualize the attention matrix using heat map.
Important Issues for Consideration and Future Directions
In this section, we discuss issues, implementation, and future directions that will be important for using LLMs for psychological assessment.
Ethical considerations
LLMs contain biases that are prevalent in society and that researchers and the field at large should be aware of and prepared to continuously address in a transparent manner (Bender et al., 2021). Working with LLMs may involve sensitive data that need to be handled securely for the privacy and respect for research participants and patients. LLMs require significant energy resources that have a detrimental environmental impact.
Bias
LLMs can be conceptualized as “stochastic parrots” that lack human understanding of meaning. With some randomness, they confidently repeat back what they were trained on, which will include stereotypes and harmful biases that are prevalent in online training data (Bender et al., 2021). Training data from vast online samples reflect society at large. As a result, they will have negative biases against minority groups that can perpetuate harm. Research has demonstrated bias in LLMs across gender, race, culture, and other demographics (Raza et al., 2024), including showing a preference for male pronouns for certain professions (de Vassimon Manela et al., 2021), indicating some religious groups are more violent than others (Abid et al., 2021), favoring majority groups (Zhang et al., 2020), and propagating differential treatment recommendations based on race (Omiye et al., 2023). These biases emerge when LLMs are trained on data that provide an imbalanced or inaccurate representation of a group or do not represent them at all. Although LLMs contain bias, the level varies (Nadeem et al., 2020; Raza et al., 2024). Researchers may select LLMs based on fairness evidence. In the future, it may be beneficial to concentrate on specific representative training samples rather than simply collecting as much training data as possible (Bender et al., 2021).
It is unclear whether bias in LLMs can be eliminated. Without careful evaluation of bias in psychological research, they can be perpetuated and amplified. For example, LLMs trained on biased data may perpetuate job and financial inequality, amplify harmful content online, misdiagnose and influence clinician decision-making in health care, and otherwise prioritize majority backgrounds (Ferrara, 2023). Techniques are being developed that may reduce model biases, including data augmentation, bias-correction algorithms, and fairness metrics (Cai et al., 2024; Liang et al., 2021; Raza et al., 2024; Sun et al., 2019). However, these techniques are not able to fully remove bias. Psychological researchers using LLMs for psychological assessment must be (a) aware of bias, especially that which is directly relevant to their area of research; (b) active in ensuring fairness in model development (e.g., comparing model results and predictions across various groups); (c) transparent about the existence of biases in the models that they use (e.g., describing the biases and their potential influence on the results in discussion sections); (d) constantly updating their LLM use in accordance with the latest techniques to reduce harm; and (e) supporting or collaborating with researchers from minority groups, especially groups that might be a focus of the research. Psychology-research conferences should have regular panels with experts on LLMs to help spread awareness of bias and best-use practices to manage bias in research. Together, these strategies will help the field understand and mitigate bias, reduce the possibility of harm, and improve useful models.
Privacy
Text data are often more sensitive than questionnaire data, and it is imperative to take measured steps to protect it. Research participants and patients should complete transparent consent forms with the potential risks and benefits and plans for data usage in accordance with American Psychological Association (APA) ethical principles (APA, 2017). Data should be de-identified when possible. There should be a secure, password-protected, and encrypted server where the data can be stored and accessible only to authorized personnel (who all have training in data security). In addition, the server and its network can include a firewall to protect the data. Regular audits of the security system can be conducted to prevent data breaches.
At times, it may be necessary to work with third-party service providers. This must be completed in a manner that research participants and patients have consented to and is compliant with the relevant regulation authorities (e.g., Institutional Review Boards, Health Insurance Portability and Accountability Act, General Data Protection Regulation). The minimum number of third parties should be involved in the process. When using APIs, connections should be secure, authenticated, and encrypted. Vendors will have compliance standards that should be reviewed.
Environmental impact
Deep learning is computationally expensive. It requires significant power, which leads to a growing carbon footprint (Patterson et al., 2021). As a result, researchers are devising ways to train models more efficiently and reduce negative consequences, such as excessive water usage and CO2 emissions (Rillig et al., 2023). The estimated energy usage for an analysis can be directly calculated, which can be helpful for planning efficient analyses (Hershcovich et al., 2022; Strubell et al., 2020). Researchers should be aware of the energy use that potential analyses would require and take steps to reduce unnecessary analyses. This may include reporting training times, using efficient computational hardware and models, and being aware of power resources used—for example, by data centers and cloud-computing services (Strubell et al., 2020). Researchers should also consider any potential positive downstream environmental impacts of a model (Hershcovich et al. 2022). The rapid increase of power required and used for training LLMs poses serious ethical dilemmas for researchers that should be understood, prioritized, and addressed in transparent ways moving forward.
Other LLM limitations
LLMs will generalize only to the population in which they were developed. Researchers should strive for approximately equal representation for every group that a model should generalize to (Ntoutsi et al., 2020). This means continuing to emphasize the inclusion and collection of language from diverse groups. Of course, most models will not include an accurate representation of everyone. This must be acknowledged in model-description materials and research articles. This will help prevent the use of models in groups for which the model may not work or even produce harmful results. Recently, some psychology journals have required discussion-section “generalizability statements” that are on par with this recommendation.
LLMs may also generalize only to the situation in which they were trained (e.g., interview, cognitive task, social media). Research should cross-validate models across contexts, and models should not be applied across context without validation in the new context. Models built from text gathered in controlled environments may not apply to models using real-life settings (e.g., ambulatory recordings). However, these generalizability questions will be exciting future research directions.
Token limits are currently limitations in working with LLMs. Early models had relatively smaller token limits, for example, 512 tokens. We have outlined ways to work with longer texts, but this is a primary area of future development. Newer LLMs have much longer token limits that could greatly facilitate LLM comprehension of longer texts. However, models with long token limits should be tested to ensure that they are in fact remembering (or properly maintaining) context across long texts. At the current time, managing token limits can be challenging, but simpler methods are likely to emerge as models continue to advance.
Interpretability and explainability
LLMs use deep-learning techniques that can be thought of as a black box one does not understand. The massive nonlinear complexity of the algorithms and layers in these models can make their decisions indecipherable to humans. This can cause problems for researchers and clinicians because they should be able to justify research conclusions and clinical decisions. As a result, researchers must do what is possible to understand how decisions are being made.
However, techniques exist to inform how LLMs are making decisions and predictions. Attention visualization is used to identify how neural networks are focusing their attention on tokens available to them. The differential weights placed on input tokens by the LLM in the process of their predictions can be examined as a heat map or text highlighting, showing the user which parts of the text are most important to the prediction that was made (e.g., Jeon et al., 2024). SHapley Additive exPlanations (SHAPs) are an explainable AI technique based in game theory that attribute differential importance to the input tokens (Lundberg & Lee, 2017). SHAP values are often visualized in waterfall plots, which help researchers interpret the key token predictors of an outcome of interest. However, because SHAPs require repeated evaluations of a model with different feature combinations—and LLM-based analyses often involve extremely high-dimensional inputs—it may be feasible only in smaller-scale LLM applications.
LLM outputs should also be understood through traditional psychometric-validation techniques. After a model is fine-tuned, for example, it may produce a predicted score for an outcome of interest. In the future, LLM output scores should be validated just as psychological variables have been in the past, with construct validation such as convergent-, discriminant-, and criterion-validity tests (cf. Chancellor & De Choudhury, 2020; Strauss & Smith, 2009). Nomological networks of the model output should be examined (e.g., What other constructs does it predict, and what does it not predict?; cf. Cronbach & Meehl, 1955), helping researchers place LLM-based scores in the broader research literature. Reliability should be understood through tests of internal consistency and test-retest reliability (cf. Simons et al., 2024). Construct-validation techniques will provide an understanding of what LLM-based predicted scores represent just as they facilitated understanding of psychological scores in the past.
Humans and LLM-based psychological assessments
LLMs hold promise for the automation and augmentation of assessment methods; however, results still vary based on the task. Each use case should be validated against human raters to evaluate model performance. For example, Schoenegger et al. (2024) compared the abilities of laypersons, psychology experts, pretrained LLMs, and a specialized AI model trained on personality data to predict correlations between personality items. Results indicated that AI models made better predictions than 85% of individual humans. However, median predictions from the whole group of psychology experts rivaled the specialized AI and outperformed those of pretrained models. This suggests that LLM performance might be superior to most individual evaluators, yet experts still hold advanced knowledge collectively.
Given the limitations outlined above, LLM-based psychological assessments should not be relied on as stand-alone assessments in clinical or applied situations without human oversight. Humans should always have oversight and final judgment over any consequential decisions that might be made by an LLM. Ideally, they will be administered as a tool within an assessment battery of multiple measures. They are currently best considered a potentially helpful tool for understanding psychological phenomena.
Collaboration among psychologists, computer scientists, and others is essential for LLM tools to be as useful for psychological assessment. Professionals from each area have unique insight, questions, and ways of thinking about assessment and developing research projects. Reliance on team science will also reduce the burden on any one scientist to have mastery of all cutting-edge methodologies. Interdisciplinary data-science PhD programs will be important for producing scientists who may help bridge the gap between disciplines. Profitable collaborations will occur when professionals from different areas come together with a mutual respect and put in the time needed to work together efficiently and effectively. Although this can be a challenge, effective interdisciplinary collaboration will be necessary to develop LLM-based psychological-assessment methods that will be as effective as science and medicine will need them to be.
Researchers and clinicians who administer LLMs for assessment should have proper training in their effective use. Currently, we are not aware of any official guidelines or standards. It may be fruitful to develop guidelines that LLM-based researchers should follow to receive the proper training that will provide a foundational knowledge and essential skills that are needed to effectively engage in this area. Furthermore, it would be especially useful to ensure this training provides the tools necessary for researchers to continue to grow their knowledge and stay aware of the latest best practices in the field throughout their careers.
Guidelines for the development and administration of LLM-based psychological assessments may also be helpful. Organizations such as the APA have provided resources and updates on policy for AI generally (APA, 2023). Researchers have also published useful guidance about ethical use of LLMs in science (Parker et al., 2023; Watkins, 2024). These protocols may include standards on transparency, data collection, management, privacy and security, bias mitigation, generalizability, training, and deployment. Organizations that may help develop these standards include research organizations, institutions, professional associations, publishers, or advocacy groups. Guidelines may help promote responsible practices and reduce potential harms. In conducting meta-analytic reviews, psychologists follow the PRISMA guidelines (Page et al., 2021). Beyond universal best practices, we emphasize the importance of flexible guidelines that account for the unique context of model development.
Future directions
There is growing evidence that multimodal models, including more than just language, improve predictive utility (Morales et al., 2018). It will be fruitful to pair LLMs with standard psychological assessments and other technologies to examine unique and combined predictive power across features (e.g., Harari et al., 2017; Jacobson & Bhattacharya, 2022). The transformer model can also be used with nonlanguage predictors (Wang & Sun, 2022). In the future, modeling features from video recordings in tandem with traditional psychological assessments will provide a more holistic assessment of a person.
LLMs will soon have longer context attention, better strategies to mitigate bias, better regulatory standards and guidelines, available training, and techniques for model explainability and interpretability, security, validation, and access. Alternative model architectures have already rivaled the transformer model in NLP, such as state-space models (Gu & Dao, 2023). It is important for researchers to stay informed on these developments. We recommend following journals, new books, podcasts, and online courses and webinars; attending conferences; and maintaining communication with interdisciplinary collaborators. Commitment to rigorous methodology, such as the collection of high-quality data including well-validated assessments with useful and targeted language samples across diverse populations, is also imperative. Consortiums with the purpose of bringing together researchers with similar interests in specific LLM applications may be useful to enhance data size and diversity.
Conclusions
LLMs offer important advantages compared with traditional psychometric approaches, such as the self-report questionnaire. This includes their behavioral nature, scalability, and allowance for a broader range of response possibilities. Language assessments can be derived from routine tasks or in naturalistic environments using smartphones. Despite potential advances, there are significant risks and biases with this technology. Psychologists must be aware of the biases in LLMs and ways of mitigating them.
The purpose of this overview is to provide accessible guidance on a novel and complex methodology. Despite rapid advances, relatively little is known about using LLMs for psychological assessment. Although a growing number of high-quality studies are emerging, many face limitations related to sample size, diversity, language data types, or psychological measurement. We encourage psychologists to strive for strong psychometric, methodological, and interdisciplinary contributions in the evolving area of using LLMs for psychological assessment. We hope this article helps promote them.
Supplemental Material
sj-docx-1-amp-10.1177_25152459251343582 – Supplemental material for Large Language Models for Psychological Assessment: A Comprehensive Overview
Supplemental material, sj-docx-1-amp-10.1177_25152459251343582 for Large Language Models for Psychological Assessment: A Comprehensive Overview by Jocelyn Brickman, Mehak Gupta and Joshua R. Oltmanns in Advances in Methods and Practices in Psychological Science
Footnotes
Transparency
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
