Sage Journals: Discover world-class research

Abstract

Large language models (LLMs) are extraordinary tools demonstrating potential to improve the understanding of psychological characteristics. They provide an unprecedented opportunity to supplement self-report in psychology research and practice with scalable behavioral assessment. However, they also pose unique risks and challenges. In this article, we provide an overview and guide for psychological scientists to evaluate LLMs for psychological assessment. In the first section, we briefly review the development of transformer-based LLMs and discuss their advances in natural language processing. In the second section, we describe the experimental design process, including techniques for language data collection, audio processing and transcription, text preprocessing, and model selection, and analytic matters, such as model output, model evaluation, hyperparameter tuning, model visualization, and topic modeling. At each stage, we describe options, important decisions, and resources for further in-depth learning and provide examples from different areas of psychology. In the final section, we discuss important broader ethical and implementation issues and future directions for researchers using this methodology. The reader will develop an understanding of essential ideas and an ability to navigate the process of using LLMs for psychological assessment.

Keywords

large language models deep learning natural language processing fine-tuning prompt engineering

Sound measurement is at the foundation of psychological science. In the past century, development of the construct-validation process has propelled the field to meaningful theory testing (Clark & Watson, 2019; Strauss & Smith, 2009). Yet most of what is known about psychological constructs relies on self-report measurement, which has weaknesses such as socially desirable responding, overreporting and underreporting, cultural and retrospective biases, and limitations in self-insight (e.g., Paulhus & Vazire, 2007). Psychologists strive to incorporate multimethod assessment into research and practice (APA Task Force on Psychological Assessment and Evaluation Guidelines, 2020) because it increases the validity of assessment (Hopwood & Bornstein, 2014; Meyer et al., 2001). However, actual use of multimethod assessment is rare, in large part because it can be burdensome and time-consuming. Advances in artificial intelligence (AI), in particular, large language models (LLMs), provide opportunities to incorporate, improve, and facilitate multimethod psychological assessment.

Language as an assessment tool has several strengths. It is behavioral, providing a more objective approach to assessment, and it can be natural, providing ecological validity to assessment, thereby avoiding some of the inherent limitations of self-report questionnaires. It is also rich, allowing individuals to express themselves in ways that break free of traditional rating scales (Kjell et al., 2024). Using language to study psychological constructs has already greatly expanded understanding of them (Pennebaker et al., 2003). With extraordinary recent advances in technology, language will likely continue to expand knowledge of psychological characteristics at a rapid pace.

There are also more practical advantages to using language as an assessment tool. Language as an assessment tool is scalable. Validated LLM tools could be more easily implemented into routine research and clinical activities that involve speech, supplementing self-report assessments and saving time and resources for both participants/patients and researchers/clinicians. LLM psychological-assessment tools could also greatly enhance assessment coverage. For example, well-developed LLM-based tools may assess a wide variety of psychological constructs from a single language sample in a setting in which a similar amount of assessment may take many hours of questionnaire completion. In emergencies or particularly low-resource situations, validated LLMs might provide assessments from language when no other assessment would be available.

The goal of this overview is to provide an accessible guide for psychologists to use LLMs to assess psychological constructs through language. We first present the history, significance, and development of the transformer-based LLM; explore the experimental-design process; and consider important issues related to LLM ethics, implementation, and future directions. We also present helpful techniques, tools, and code. Included on the accompanying GitHub page are a coding-based tutorial on using LLMs for psychological assessment and files containing specific code examples for applying the techniques we describe in our second section. Although we strive for an introductory level of description, we use many machine-learning terms that are essential for understanding and working with LLMs. For that reason, we also include a glossary (Table 1). Table 1 includes definitions and useful software packages in which certain procedures can be performed.

Table 1.

Glossary

Terminology	Description	Coding packages
Word embeddings	Multidimensional vectors of numbers representing semantic relationships between words	spaCy, transformers
Parameter	Trainable internal values that are optimized during training to reduce prediction error	torch.nn.Parameter
Deep learning	Advanced type of machine learning inspired by how humans process information; learns patterns by processing data through multiple layers, allowing for increasingly more complex connections	tensorflow, PyTorch, Keras
Model layer	Step in a deep-learning neural-network model that receives information, performs computations, and passes updated information to next step	torch.nn, tensorflow.keras.layers
Transformer models	Advanced neural-network model that captures complex relationships in language using a “self-attention” mechanism, widely used in NLP tasks and the basis for LLMs	transformers, simpletransformers, torch.nn
Feed-forward layer (neural network)	A component in transformer models that comes after self-attention; it processes each word separately, taking the contextual information provided by self-attention about the word. It then transforms that information to make it more useful in the next step.	torch.nn, tensorflow.keras.layers.Dense
Audio processing	Stage in which audio is converted into usable text and speaker segments; typically includes automatic speech recognition (transcribing speech into text) and speaker diarization (identifying who spoke when)	speechbrain, whisper, pyannote-audio, azure
Text preprocessing	Changing language data into a format useful for NLP; includes de-identification, tokenization (splitting text data into smaller units), decisions related to context, and isolating language of interest	spaCy, transformers.AutoTokenizer, scikit-learn
Label	A known outcome or target variable that a language model is trained to predict; like a dependent variable in traditional statistics	Specified in training arguments
Hard prompt	Text instructions written by humans and given to an LLM to guide it in performing a specific task	n/a
Soft prompt	Trainable embeddings prepended to input text and updated during training; they can then guide the model’s behavior or focus without using explicit language instructions.	peft, transformers
Parameter-efficient fine-tuning (PEFT)	Technique to avoid excessive training cost by updating only a small set of model weights during training rather than the full set of parameters	peft, torch.nn.Parameter
Hyperparameter	One of several fixed settings defined before training a machine learning model; hyperparameters can significantly affect model performance and are therefore a key focus during model development.	wandb, torch.optim, simpletransformers
Hyperparameter tuning	The process of adjusting and testing different hyperparameter values to optimize model performance; typically done using a validation data set to identify the best settings before evaluating on test data	wandb, optuna, torch.optim, simpletransformers
Learning rate	A hyperparameter that controls how much the model’s parameters are adjusted in response to each update during model training	torch.optim, tensorflow.keras.optimizers
Batch size	A hyperparameter that determines how many training samples the model processes before calculating errors and updating the parameters	Specified in training arguments
Epochs	A hyperparameter that determines how many full passes the model makes through an entire data set during the training process	Specified in training arguments
Overfitting	When a model capitalizes on specifics in the training data and learns to predict variance unique to the sample rather than broader patterns that will generalize to new data	n/a
Underfitting	When a model fails to capture meaningful patterns in the data and performs poorly on both training and test sets, which can result from an overly simple model, insufficient training time, or not enough data	n/a
Early stopping	Technique used to prevent overfitting by stopping training when model performance on validation data stops improving; requires an evaluation metric (e.g., R²) and a patience interval to decide when to stop	transformers.Trainer, simpletransformers, tensorflow.keras.callbacks.EarlyStopping
Patience interval	The number of consecutive epochs without improvement on a validation metric before training is stopped with early stopping; typical values are 5 or 10 epochs. Setting this too low (e.g., 1) risks prematurely stopping training because improvements may not be strictly linear.	Specified in training arguments

Note: LLM = large language model; NLP = natural language processing.

Development of Transformer-Based LLMs

Language is central to human identity. Psychologists have long been interested in the relevance of human-language expression for understanding a person (Sanford, 1942). The ability to use language to assess psychological constructs was significantly bolstered by the development of word-counting programs (Pennebaker & King, 1999). “Dictionaries,” the backbone of this technique, use scoring rules derived from expert ratings of words to score psychological constructs from text. This method is also known as a “bag of words” approach. The Linguistic Inquiry and Word Count (LIWC) software (Pennebaker et al., 2003) uses a bag-of-words approach to count word use in text documents and score psychological constructs. It was developed as a simple text-analysis program that has continued to be refined since its inception, and versions continue to be released (Boyd et al., 2022). LIWC provides scoring of various emotion- and cognitive-process categories in addition to grammatical and language-use categories from text. It has become the most influential text-analysis program in psychology, demonstrating the ability to shine light on attention, emotion, social, thinking, and personality processes from language (Tausczik & Pennebaker, 2010).

However, early statistical language-processing efforts struggled with language tasks because human language can be ambiguous, with rule exceptions and meaning changes across contexts (Johri et al., 2021; Khurana et al., 2023). Initial models had a finite set of rules and inflexible decision-making algorithms and were unable to understand linguistic nuances. Furthermore, it was impossible to write rules and meanings for every scenario.

“Word embeddings” became an important solution (Almeida & Xexéo, 2019). Word embeddings are lists of numeric values (i.e., word vectors) that represent the meaning of words across multiple dimensions, capturing semantic and syntactic connections between words. Early models used two main strategies to generate word embeddings: (a) Prediction-based models (e.g., Word2vec; Mikolov et al., 2013) generate word embeddings by predicting a target word from context words (i.e., words immediately surrounding it) or by predicting context words from a target word. (b) Count-based models (e.g., GloVe; Pennington et al., 2014) generate word embeddings through counting global word co-occurrence in a text body. These early embedding models drastically improved the ability of computer programs to understand language, but the embeddings were static—that is, each word had only one embedding (Almeida & Xexéo, 2019). This was a problem because words with changing, context-dependent meanings would have the same word embeddings regardless of how the word was used in a particular instance.

Recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) are deep learning-neural-network frameworks for natural language processing (NLP). Although they process static word embeddings, they update with word context, appropriately mapping words to different possible meanings based on surrounding words (Johri et al., 2021; Khurana et al., 2023). RNNs and LSTMs improved model performance because they are better at maintaining accuracy across changing contexts. These models no longer followed predetermined rules and instead developed dynamic algorithms for decision-making that could update with greater exposure to language samples (Johri et al., 2021). Although these updated models outperformed previous methods, they required exposure to large amounts of data to learn words in different contexts. RNN and LSTM models are also limited in efficiency because they process language one word at a time, leading to long training times. These models require significant computational resources and still struggle to maintain understanding of word context over text that is longer than one sentence (Min et al., 2023; Vaswani et al., 2017).

The transformer-model architecture, which was the foundation for the development of LLMs, can provide a context-specific, quantitative representation of language (Vaswani et al., 2017). The transformer was a significant advance largely because of its unique “self-attention” mechanism. Self-attention allows the model to process all words in relation to all other words in a text sample simultaneously, as opposed to older methods that used sequential attention (Fig. 1). Sequential attention could lead to information buildup and forgetting of information that came earlier in a text sample. In the transformer, because all words communicate with each other directly, relations between words can be more accurately captured and retained across longer lengths of text.

Fig. 1.

(a) Method used by previous natural-language-processing models to process text. Each word is processed individually. Model would initially perceive pronouns refer to “dinner” before processing is complete. (b) Method used by transformers’ self-attention mechanism to process text. Pronoun references are clearly understood.

A transformer model is a deep-learning model that generally consists of “encoders” and “decoders” (Vaswani et al., 2017). But transformer models can vary in their composition of encoders or decoders. Both encoders and decoders consist of self-attention “layers” that help transformers generate contextualized representations of input text (i.e., how the tokens in the text relate to one another). Encoders consist of a self-attention layer and a feed-forward neural network. Input first goes through the self-attention layer, in which relationships between each token and every other token in the sentence are learned. Multiple layers of encoders with similar architecture can be “stacked,” meaning input is processed through multiple encoder layers sequentially, which allows the model to capture more complex patterns. A decoder processes the output from the encoder and also has attention and feed-forward neural-network layers. The decoder’s attention layer is referred to as an “encoder-decoder attention” layer, and it helps the decoder focus on relevant parts of the input sequence from the encoder. Because the decoder is used to generate text, its self-attention layer uses masking, in which the tokens on the right side of the sequence are masked so that the decoder cannot see future words of the sentence it is learning to generate. This prevents the model from knowing future tokens and constrains it to focusing only on preceding tokens to generate new text. Similar to encoders, decoders can also be stacked.

The advances in contextual understanding and speed provided by the transformer architecture enabled the creation of LLMs. The transformer allows the processing of massive amounts of data for training, often from online repositories. Initial transformer models were developed with a variety of large data sets for the time (Hadi et al., 2024) For example, Google’s Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2019) was pretrained on English Wikipedia and BooksCorpus (11,038 free books from the web). This training process allowed the development of a model with millions of parameters for identifying words and thousands of embeddings, which gives the model a general understanding of language.

In subsequent years, advances in computing resources have enabled the size of language models to grow (Hadi et al., 2024). Whereas initial transformer models were trained with millions of parameters (totaling less than 200 GB of storage), models are now being trained on hundreds of billions of parameters (requiring more than 7 TB of storage), resulting in more powerful and versatile language models. Although the term “LLM” is formally used to describe these newer, larger models, in this article, we use “LLM” to include the initial transformer models as well.

Broadly, transformer-based models can be divided into three types: encoder-only, decoder-only, and encoder-decoder. Tasks in which input text needs to be understood to generate output text requires encoder-decoder architecture, for example, language translation (translating text from one language to another), summarization (distilling texts to only the main points, reformatting language [e.g., speech to text]), and question answering. Here, an encoder changes the input text into a numerical representation while considering context of the text, and a decoder uses the input numerical representation to generate the output text one token at a time (this is also called “autoregressive text generation”). LLMs such as T5 (Text-to-Text Transfer Transformer) and BART (Bidirectional and Auto-Regressive Transformer) are encoder-decoder models.

Encoder-only models are used in the scenarios focused on understanding input text to perform tasks such as text classification (sorting language into categories), named entity recognition, sentiment analysis, or retrieval tasks. Models such as BERT, RoBERTa (Y. Liu et al., 2019), and DeBERTa (He et al., 2020) fall under this category. Decoder-only architectures are popular and used for generative tasks in which responses are predicted one token at a time. This architecture is used for large-scale generative models such as GPT (Generative Pretrained Transformer). Decoder-only models pretrained on large text can perform generative tasks such as summarization, question answering, and sentence completion. To that end, most transformer models can be used for more than one language task.

Experimental-Design Process

LLMs show immense promise for psychological research and measurement, yet using these models remains complex and is often made more difficult by a lack of documentation for specific uses. In this section, we outline the experimental-design process from start to finish and identify relevant considerations at each step. This overview emphasizes details specific to NLP and the use of LLMs, but an understanding of general machine learning (ML) is also required for carrying out such analyses. Although brief definitions of relevant terms are included in Table 1, we also recommend helpful articles to explore these topics in more detail. Figure 2 presents a road map of the process. In each section, we discuss relevant concepts and decision-making considerations and provide examples from different areas of psychology and a continuous working example from research assessing Big Five/Five-Factor Model (FFM) personality traits from interview language (Oltmanns et al., 2025).

Fig. 2.

Overview of experimental-design process. (a) Data collection, (b) language conversion, (c) text preprocessing, (d) LLM technique, (e) LLM selection, (f) model evaluation, (g) model training considerations and (h) model visualization.

In our working example, a representative community sample of 1,409 older adults was recruited from the St. Louis, Missouri, area. The mean age of the sample was 59.5 years old; 54.5% identified as female, 65% identified as White, 32.7% identified as Black/African American, 2.3% identified as other, and 1.7% reported Hispanic/Latino descent. Participants completed life-narrative interviews in which they were asked to divide their adult life into three or four chapters, title their chapters, and then briefly describe those chapters. Next, participants were asked about high and low points, best and worst characters, and a turning point in their life story. Interviews lasted about 20 min, on average. Participants then completed the self-report NEO-Personality Inventory–Revised (NEO-PI-R; Costa & McCrae, 1992), from which five broad personality trait domains were scored (neuroticism, extraversion, openness, agreeableness, and conscientiousness). The NEO-PI-R scores were used to train language models of personality from the life-narrative interviews.

Throughout this section, we include “pseudocode” to show how python code may be used to complete certain steps. Each pseudocode is numbered with sequential steps for a given task. The names of snippets containing actual python code that go along with the pseudocode are included in parentheses next to the pseudocode titles and are located in the accompanying GitHub repository (https://github.com/mehak25/Intro-to-LLM). This repository also includes a hands-on coding tutorial on using LLMs.

The first decision in the process is what the researcher would like to ultimately predict or classify. This should influence data collection (Fig. 2a). In our working example, we collected life-narrative interviews to examine whether individuals’ patterns of language use in storytelling could reliably predict personality traits. At each stage of Figure 2, there are important decisions to be made.

Data collection

We focus on several forms of natural language that show promise for psychological assessment with NLP (Fig. 2a). Each type of language data has its own unique strengths and limitations in the data-analysis pipeline. Ideally, multiple forms may be used in tandem to provide a more robust estimate and understanding of a psychological construct. Model-prediction accuracy is heavily influenced by the quality and quantity of the data, making data collection and data preprocessing one of the most important considerations before the analysis process (Demszky et al., 2023). Note that models trained on one language-sample type may not apply well to other language-sample types, and this will be a critical area of investigation in the future (cf. Chekroud et al., 2024).

First, language may be collected through prompts, for example, recording verbal responses to prompted questions or written tasks. The process for collecting prompted language can be self-administered, allowing participants to complete tasks without researchers and potentially in more comfortable locations. Collecting a sufficient amount of language from prompts can be difficult. Prompts to encourage speaking engagement, such as asking carefully planned, open-ended questions or multipart questions or including explicit prompts or timers, can be helpful to encourage continued speech.

Second, interviews capture an authentic and targeted language exchange between individuals. This may include job interviews, clinical interviews, group interviews, or life-narrative interviews. It can be helpful to consider what kind of language would be most useful for future modeling purposes and how to encourage it in the interview. One potentially important downstream consideration for analysis is separating the interviewer and interviewee in audio files. It can be beneficial to record with multiple microphones, which makes it easier to separate the speakers downstream.

Third, social media posts are the most common form of natural written language being used for NLP (Chancellor & De Choudhury, 2020). Social media data often result in large sample sizes with short text length and would contain multiple status updates from individual users. Both Facebook and Twitter/X provide open-source application to programming interfaces (APIs) to download large amounts of text.¹ APIs are tools that provide access to complex software programs or systems. Social media status language is unique—there may be topics an individual is more or less likely to post publicly about, and there may be many participants and patients who do not use social media, which will affect the validity of the assessment.

Fourth, ambulatory methods are used to collect more naturalistic and ecologically valid language data in everyday life (Mehl, 2017; Trull & Ebner-Priemer, 2013). Ambulatory recordings have several advantages: They can (a) have high ecological validity, (b) be collected multiple times over the course of a day, and (c) capture emotions and behaviors in real time (Lazarević et al., 2020). Ambulatory recordings are often implemented using smartphones, smartwatches, or other wearable recording devices. The Electronically Activated Recorder (Mehl, 2017) is available as a smartphone application that passively records speech in a naturalistic environment. If ambulatory data collection is active, it can be burdensome and uniquely difficult to collect.

Fifth, electronic health records (EHRs) are secure digital copies of patient charts including clinical notes from different settings, test results, and diagnoses. Extracting language data from EHRs can provide information on clinical treatment, professional opinion, and testing history that may reveal a significant amount about psychological functioning. For example, Y. Liu et al. (2023) used language models to find stigmatizing language in clinical notes to understand physician bias in patient assessment. LLMs can classify social determinants of health and behavioral-health data from clinical notes in EHRs (Englhardt et al., 2024; Milligan et al., 2024).

Sixth, LLMs can assess language that was written by psychologists for professional purposes, for example, questionnaire items, vignette text, language from formal psychological-testing measures, clinical diagnostic criteria and symptoms descriptions, intervention scripts, and research articles. This language can enlighten the test-development process and examine coherence between clinical and assessment materials and human responses to these materials. Although clinical and assessment language is not natural language, use of LLMs to improve these materials and understanding of them is promising.

Language conversion

In this section, we discuss several important considerations in processing audio or image files into text files for downstream NLP (Fig. 2b).

Audio processing and transcription

After data collection, raw language samples need to be converted into formats better suited for analysis. This commonly includes transcribing speech samples from audio files to text but could also be reformatting digital language or transferring handwritten language to digital formats (Subramani et al., 2020). Conversion can be completed manually or through automated processes. Automatic speech recognition requires much less time and fewer financial resources but is more likely to contain errors. Options include APIs such as OpenAI’s Whisper, Google’s Speech-to-Text, and Microsoft’s Azure. The accuracy of automatic-transcription tools has improved dramatically in the past few years (Spiller et al., 2023). These tools can be used on premise (i.e., implemented on a secure server at the researcher’s institution), which is essential if the sample contains confidential information.

Speaker diarization

When speakers on the same audio track need to be separated, this is called “speaker diarization.” Open-source diarization tools include SpeechBrain (Ravanelli et al., 2021), pyannote (Bredin et al., 2020), and WhisperX (Bain et al., 2023). These tools extract speech features from the audio signal and then use deep-learning models to differentiate between speakers based on unique voice characteristics (e.g., variations in pitch, volume, vocal-cord vibration). Diarization is still a difficult task to automate and often has errors, so manual review may be necessary.

Text preprocessing

Successful transcription produces text files. However, further text preprocessing may be needed, for example, to isolate language of interest or match the expected text formatting of an LLM (Fig. 2c).

Text isolation

Researchers might be interested in isolating language from one person. In interview transcriptions, speaker labels (e.g., “interviewer:”; “interviewee:”) are helpful. An example of speaker isolation is included in the GitHub repository. Other creative strategies can be helpful: For example, if the interviewer and interviewee need to be separated, the speaker who asks (or answers) more questions may be used as a proxy to identify them. Furthermore, speakers of interest may be identified by their use of specific words, phrases, or topics that they may be more likely to use.

De-identification

Language data may contain confidential information that should either be de-identified or analyzed on a secure local sever (Hoory et al., 2021). Named entity recognition (NER) is an NLP technique that can de-identify text samples. NER locates predetermined words or phrases in text. For example, the open-source package spaCy (Honnibal & Montani, 2017) recognizes entities across 18 categories, such as names, organizations, geographic locations, and dates. Once identified, these words can be replaced with generic names. In our working example, spaCY was used to de-identify transcripts and remove names of people, organizations, and locations.

Stop words

Stop words are commonly used words (e.g., “a,” “the,” “is,” “in”) that have traditionally been removed from language samples during preprocessing because their widespread use provided little unique information. LLMs capture contextual information from language, so they tend to work best when stop words are preserved (Shekhar et al., 2024), including contractions and all words and tenses.

Tokenization

Tokenization is the process of breaking down raw text into smaller units called “tokens,” which serve as input into the LLM (W. X. Zhao et al., 2023). Tokenizers split text into words and meaningful subword units. For example, the word “wind” remains [wind] after tokenization. However, “windsurf” becomes [wind] and [##surf], and “windsurfer” becomes [wind], [##surf], and [##er]. Tokenization strategy varies by LLM, but most NLP packages make it easy to do. A brief code example of tokenization of a text using the Hugging Face transformers library is shared on the GitHub (called “tokenizer.py”).

Pseudocode (GitHub file: tokenizer.py):

Initialize pretrained tokenizer.

Loop over each word in your text to encode it into a token.

Add special tokens such as [CLS] to mark the beginning of the text and [SEP] to mark the separation between sentences.

LLM techniques for psychological assessment

Feature extraction, fine-tuning, and prompt engineering are three primary ways to use LLMs for psychological assessment (Fig. 2d). Each is explained in detail below, along with example applications in different areas of psychology. The flowchart in Figure 3 may help guide the decision of which technique to use.

Fig. 3.

Flowchart of techniques for using large language models for psychological assessment. Striped line = optional.

Feature extraction

One straightforward application of LLMs is to obtain contextualized embeddings from an input text. Unlike static embeddings, contextualized embeddings vary depending on how words appear in a sentence, thus capturing nuanced meaning specific to a given context. These embeddings can then be used in downstream analyses (Hussain et al., 2023).

For example, Wulff and Mata (2023) used an LLM to extract contextualized-embedding features from the item language of multiple personality questionnaires. Results indicated that feature extraction can be useful for examining construct validity: Some questionnaires may claim to measure the same construct even though the embedding features show discrepancies (e.g., “jingle fallacy”), and other questionnaires may claim to measure different constructs even though the embedding representations are similar (e.g., “jangle fallacy”). In addition, feature extraction has been used to support the validity of personality structure: LLM word embeddings related to personality show similar factor structure to that from previous research with human ratings (Cutler & Condon, 2023). Correlations were even stronger for the LLM embeddings than the previous ratings data, indicating LLMs may be an effective way to explore personality. Abdurahman et al. (2024) used contextualized embeddings from a pretrained LLM to represent the semantic meaning of items from self-report personality questionnaires. They then used these embeddings to predict individuals’ scores on previously unseen personality items based on linguistic similarity.

The following pseudocode demonstrates how to obtain and save embeddings after feeding text data through a pretrained LLM. Once generated, the embeddings can be used as conventional predictor variables in traditional regression models (e.g., linear regression).

Pseudocode (GitHub file: save_embeddings.py):

For each text window, tokenize the text and then pass it through the model.

Retrieve the output of the last layer of the model. The output of the language model is three-dimensional (number of samples, number of tokens, embedding size).

Obtain document-level embeddings by either taking the mean of embeddings across all tokens or extracting the embedding associated with the first token (typically the [CLS] token).

Fine-tuning

The process of further training pretrained LLMs with more specific data is called “fine-tuning.” During fine-tuning, model weights are updated to reflect domain-specific language (e.g., language of interest to the psychologist) and adapt model decisions to best fit a specific task (e.g., predicting or scoring a personality trait, as in our working example). During fine-tuning, model weights can be updated either for the whole model or partial model depending on how much computational power is available for training (“training cost”). Fine-tuning is helpful for creating specialized models without the burden of needing very large, labeled data sets (Chae & Davidson, 2023; Demszky et al., 2023). To reduce the training cost of fine-tuning, a few samples can be used to update model weights, which can be called “few-shot fine-tuning.” Few-shot fine-tuning is powerful because performance can be improved with much fewer data than would be required to train a model from scratch. There are some challenges with fine-tuning: It is all the more important to have high-quality labeled data when fine-tuning. That is, the construct validity of the label measurements will be critical because the model will only be as good as the labels (i.e., the quality of the assessment of the dependent variables; Chancellor & De Choudhury, 2020). In addition, fine-tuning is computationally expensive, requiring LLMs to be hosted on large servers to run the training cycle.

Fine-tuning is particularly well suited for assessing psychological constructs from language data. During fine-tuning, a pretrained language model’s parameters are updated to reflect how language relates to the construct of interest, including subtle or nuanced patterns that might be imperceptible to human raters (Luxton, 2014). The resulting fine-tuned model can then accurately assess the construct from new, previously unseen language samples. Simchon et al. (2023) fine-tuned a model predicting personality traits from social media posts. The model was able to identify language patterns indicative of FFM personality traits and use that information to predict the personalities of new users. Our working example also uses fine-tuning to predict personality traits from interview language that does not explicitly ask about personality.

Fine-tuning is also being studied in clinical and social psychology. Ohse et al. (2024) used fine-tuning for depression assessment. The researchers fine-tuned BERT and GPT 3.5 using language responses to a depression interview with 12 labeled examples (i.e., interview transcripts with the corresponding depression scores). They evaluated the models using the F1 score, which is the harmonic mean of model precision and recall. Fine-tuned GPT 3.5 outperformed fine-tuned BERT in the prediction of depression from language in interviews (F1 scores = .82 vs. .62). In social psychology, fine-tuning was used to assess political beliefs from social media posts (Gül et al., 2024). GPT 3.5, Llama 2, and Mistral LLMs were fine-tuned to predict user alignment with political figures and stances (e.g., climate change, feminism). Fine-tuned GPT performed the best; F1 scores were all over .80 and at times exceeded .90. In cognitive psychology, LLMs have been used to explore the connection between cognitive abilities and behavior (Hardy et al., 2023). Fine-tuned LLMs may eventually be useful tools for the study of cognitive processes, such as memory, attention, perception, reasoning, and learning.

As model size continues to grow, the cost of traditional fine-tuning continues to increase beyond available resources. This has been addressed through parameter-efficient fine-tuning (PEFT; Lialin et al., 2024). “PEFT” is a broad term referring to any strategy that updates only a small set of model weights (e.g., a subset of existing model weights). PEFT strategies continue to be developed, and many have demonstrated success compared with traditional fine-tuning. For example, Lin et al. (2024) trained a model using PEFT to generate positive alternatives to cognitive distortions, and their PEFT model outperformed other models.

In our working example—fine-tuning a model to predict FFM personality ratings from interview language—embeddings are updated to reflect nuances of the language used by the interviewee, and associations between the language and levels of personality ratings are learned. After training, the model can be used to predict personality ratings from unseen interview language (future research will have to assess the generalizability of the language sample that the model can be used with). See example code relevant for fine-tuning in the GitHub and the upcoming Model Training Considerations subsection.

Prompt engineering

Prompt engineering involves carefully designing input text (“prompts”) to guide the output of LLMs, enabling improved performance without updating or retraining the model weights. It can be used to perform tasks or generate new text. Text prompts can be manually provided by the researchers to pretrained LLMs to either classify input text or perform a specific task. These prompts are considered “hard prompts” because they are specific directives given to the model. Hard prompts are used when output text needs to strictly adhere to certain criteria (e.g., provide a specific assessment score, summarize text, or provide another response). Instructional tokens can also be used in hard prompts by adding them at the beginning of the input sequence. For example, if you want the LLM to provide a factual answer to your question, you can prepend your question with the instructional token “[ANSWER-QUESTION]” (e.g., “[ANSWER_QUESTION] What is the capital of United States?”). Interacting with LLMs through hard prompting can be simple, intuitive, and less computationally intensive.

Prompting has different strategies based on the amount of available labeled data. Zero-shot prompting prompts a model with instructions for a task but will not provide example data or example answers for the model to learn from. One-shot prompting prompts the model with instructions and one labeled example before the model completes the task, and few-shot prompting includes multiple labeled examples in the prompt before the model provides its response. The key difference from fine-tuning is that the labeled examples included in prompts do not modify any model parameters. When model weights are not updated, there is no computational-training cost.

Another strategy is “soft prompting,” which is most used in a supervised-learning context. Soft prompting requires model training and therefore may be referred to as prompt “tuning.” The soft prompt is a set of trainable embeddings that are added to the input text. The embeddings from the soft prompt are trained with labeled examples. These embeddings then act like a filter, cuing the model as to what language is associated with the task. Soft prompting is less computationally intensive than fine-tuning because only the added prompt embeddings need to be updated. Peng et al. (2024) compared hard prompting and soft prompting when identifying adverse events and social determinants of health from clinical narratives. Soft prompting performed better than hard prompting, indicating that LLMs can learn better from trainable soft-prompt embeddings than human-generated hard prompts. Soft prompting reduced computing costs by 97% compared with fine-tuning. However, large models with several billion parameters were required for soft-prompt models to show these benefits.

In prompt-engineering studies, the prompt can vary for each case in the data set to improve results or better study individual differences. For example, K. Yang et al. (2024) used LLMs to assess social attitudes and the propensity to be influenced by social contexts based on demographics (e.g., age, race, location, income, education level). The model performed poorly in zero-shot prompting. However, few-shot prompting that included labeled examples customized to match certain profile features for each individual improved performance.

Overall, prompt engineering allows for model customizations without the same data and resource requirements as fine-tuning, making it quicker (Chae & Davidson, 2023). In contrast to fine-tuning, model parameters are not updated, which is the most significant concern about prompt engineering. This is because psychology uses require generalizable, nuanced knowledge about a topic (Demszky et al., 2023). However, as the barriers to fine-tuning continue to grow for the newer, more advanced LLMs (e.g., model size, closed-source), prompt engineering has become an exceedingly popular and effective strategy (Hua et al., 2024).

Prompt engineering has been applied across a variety of psychology domains. In cognitive psychology, GPT-4 predictions were compared with human-memory performance (Huff & Ulakçı, 2024). GPT-4 was prompted to rate the relatedness of pairs of (a) context and (b) garden-path sentences and the memorability of the garden-path sentences. GPT-4 ratings of memorability significantly corresponded with human-memory performance. This indicates LLMs may have utility as cognitive-assessment tools in the future. In personality psychology, zero-shot prompting was employed to assess personality traits from social media posts (Peters & Matz, 2023). The LLM was hard prompted to attend to how personalities were reflected in language from online posts and to provide a numerical rating for each of the FFM personality traits. Results demonstrated moderate effect sizes for predicting personality.

Zero-shot prompting of GPT-3.5 has been used to assess attitudes in social psychology (Simons et al., 2024). Hard prompts were used to obtain GPT ratings on individuals’ attitude certainty, importance, and moral conviction from social media posts. The GPT ratings replicated prior factor-analytic structure and internal-consistency reliability of human-attitude ratings. This study was notable for its adherence to a psychometric construct-validation approach for evaluating LLM-generated ratings based on language.

In clinical psychology, Tu et al. (2024) used zero-shot and few-shot prompting for posttraumatic-stress-disorder (PTSD) assessment from language in clinical interviews. GPT-4 performed best with few-shot prompting, and zero-shot prompting performed best with Llama-2. Predicting several different variable types from several different interview types, GPT-4 was, on average, 10% more accurate than Llama-2, reaching an accuracy of 68%. GPT-4 showed close similarity to human ratings for PTSD-related scale variables and more conservative predictions, whereas Llama-2 consistently overpredicted. Jeon et al. (2024) used a two-step prompting strategy to identify suicide risk from social media posts. In the first step, MentaLlama (Llama, fine-tuned on social media data related to mental health) was assigned an expert identity, provided a dictionary with suicide-related terms, and asked to extract key phrases from the posts. Jeon et al. found that few-shot prompting in Step 1 performed better than zero-shot, so a few labeled examples were added to the prompt. In the second step, a more generic LLM was prompted to summarize key phrases, and multiple summaries were evaluated for consistency. Recall of suicide-related posts was consistently high. Different expert-identity assignments were found to influence the extracted phrases, indicating that prompting LLMs to have different roles may produce different results.

Some research has used both fine-tuning and prompt engineering for psychological assessment. Galatzer-Levy et al. (2023) conducted zero-shot prompting with an LLM that had previously been fine-tuned on sources of medical language. The fine-tuned model was prompted to assess psychiatric functioning from clinical interviews and performed particularly well for depression detection but displayed difficulties with co-occurring diagnoses. Lin et al. (2024) combined and compared tuning and engineering strategies for two tasks in a Mandarin Chinese data set: (a) detecting cognitive distortions (i.e., problem-thinking styles related to depression) and (b) generating positively framed alternatives. Comparison of fine-tuning a pretrained language model versus transfer learning found fine-tuning was more accurate in detecting cognitive distortions. The researchers then compared fine-tuning, prompt tuning (P-tuning, Version 2), and prompt engineering for generating positive alternatives to cognitive distortions. The prompt-tuned model (ChatGLM-6B with soft embeddings) outperformed both the fine-tuned model and prompt engineering at generating positively reframed sentences. These findings suggest that prompt tuning a smaller model can be more efficient than fine-tuning or prompt engineering for generating psychologically meaningful text.

Hard prompts can be provided to the LLM with or without examples (e.g., text and variable score pairs). Below is an example of a hard prompt, which can be enhanced with additional instructions, such as specifying a perspective or task:

Language: [include the text here]

Based on the text, please rate the level of [construct of interest here] by providing a numerical score [insert scale here].

Soft prompts, in contrast, involve prepending trainable embeddings to the model input. The following pseudocode demonstrates how to prepend soft prompts to language input embeddings:

Pseudocode (GitHub file: soft_prompt.py):

Create soft prompt by giving prompt length and model-embedding size to nn.Parameter package in PyTorch.

Prepend the soft prompt to the input of the model.

Train the model using updated input.

Processing labels

There are several important considerations for processing psychological-variable labels (i.e., dependent variables) that may be predicted using LLMs for psychological assessment. For details on these considerations, including merging text data with psychological-variable data, scaling of variables, splitting the data set for training and testing, and avoiding data leakage, see the Supplemental Material available online.

LLM selection

Key LLM-selection decision points include their training data, text limits, size (measured in number of parameters and memory required to store the model), usage limits, and model transparency (Fields et al., 2024; Fig. 2e). It is becoming more common for models to have “model cards” that provide this information in an organized fashion (Mitchell et al., 2019). Other important considerations in LLM selection include characteristics of the assessment data, task specifics (e.g., what you want the model to do), and computing resources. Table 2 describes common LLMs, including their training data, model size, and text limits. Parameters are the building blocks of LLMs and include weights, biases, word embeddings, neural-network layers, self-attention mechanisms, and feed-forward neural networks. LLMs are classified as small if they contain fewer than 1 billion parameters, medium if they contain 1 to 10 billion parameters, large if they contain 10 to 100 billion parameters, and very large if they contain more than 100 billion parameters (Minaee et al., 2024).

Table 2.

Potentially Useful Large Language Models for Psychological Assessment

Model family	Parameters	Developer	API	Open-source	License	Training data	Maximum token limit
BERT	4M–340M	Google (Devlin et al., 2019)	Hugging Face	Yes	Apache 2.0	BookCorpus and English Wikipedia	512
GPT	Estimated 8B–1T+	OpenAI (Achiam et al., 2023)	OpenAI API	No	Proprietary	Publicly available and licensed data	128K
Llama	1B–405B	Meta	Llama API, Hugging Face	Yes	Custom	Publicly available online data	128K
Claude	Estimated 70B–2T+	Anthropic	Anthropic API	No	Proprietary	Public internet information, nonpublic third-party data, internally created data; fine-tuned with Constitutional AI	1M
Gemini	1.8B–1T+	Google DeepMind	Gemini API, Google Vertex AI	No	Proprietary	Publicly available filtered web data, licensed third-party data	2M
Falcon	1B–180B	Technology Innovation Institute	Hugging Face	Yes	Apache 2.0	RefinedWeb, filter of Common Crawl data set	32K
Mistral	7B–120B+	Mistral AI (A. Q. Jiang et al., 2023)	Mistral AI, Hugging Face	Some	Apache 2.0	Open web data	128K
DeepSeek	7B–671B	Hangzhou DeepSeek AI	DeepSeek API, Hugging Face	Some	Apache 2.0 (most)	Web data, code, math	128K
Qwen	0.5B–110B	Alibaba Cloud	Alibaba Cloud, Hugging Face	Some	Apache 2.0	Public web documents, encyclopedia, books, math, code, synthetic data	1M
Gemma	1B–27B	Google DeepMind	Google Vertex AI, Hugging Face	Yes	Apache 2.0	Web documents, code, and math	128K

Note: K = thousand; M = million; B = billion; T = trillion; API = application to programming interfaces; BERT = Bidirectional Encoder Representation from Transformers; GPT = Generative Pretrained Transformer.

Google’s BERT (Devlin et al., 2019) is one of the earliest and most frequently used LLMs. BERT is a small, encoder-only model best suited for tasks requiring understanding of full-text sequences, such as text classification or NER. Additional BERT-based models continue to be developed, such as RoBERTa (an optimized version of BERT using more training data and a longer training time among other training improvements; Y. Liu et al., 2019), DistilBERT (a slimmer, faster version of BERT; Sanh et al., 2019), and XLNet (which incorporates non-English languages; Z. Yang et al., 2019). Although the term “LLM” generally does not include the initial transformer models mentioned previously, they remain a great option because of modest computing requirements and optimization for text classification.

GPTs (Achiam et al., 2023) are a family of decoder-only models by OpenAI that marked the transition to formal LLMs. These are very large models, containing more than 175 billion parameters, that are behind ChatGPT. Although prior GPT models have been publicly released, the most advanced models may be unavailable to the public. However, some can be fine-tuned through APIs. Another family of LLMs is the Llama family by Meta (Touvron et al., 2023). Llama models range in size from medium to large and are open-source, meaning the model weights are available to the research community (Minaee et al., 2024). For more information about the structure of specific models, performance comparisons, and training considerations, see Minaee et al. (2024), Naveed et al. (2024), and W. X. Zhao et al. (2023).

LLMs are becoming increasingly accessible. Hugging Face is an open-source community that provides tool access (Hussain et al., 2023). Hugging Face has two main components: first, an online repository that stores trained language models, information regarding model performance, publicly available data sets, and detailed tutorials and second, a series of python libraries that have simplified code to access transformer models, tokenizers, and optimization tools. In addition, Hugging Face stores domain- and task-specific models previously created by others that are open to the public, for example, BERT-based classification models trained on social media posts to predict sentences discussing anxiety or depression, Llama-based chatbots trained to provide empathic support and resources about mental-health treatment, and RoBERTa-based models fine-tuned on PubMed articles.

Maximum sequence length

LLMs have varying maximum sequence lengths—also called “context windows”—which limit the number of tokens that can be input into the model at one time. If the token limit is exceeded, the text input will be truncated at the token limit, potentially cutting off important information. Some earlier models, such as BERT, have relatively short limits (e.g., 512 tokens, which is around 400 words), whereas models such as GPT and Llama support context windows of several thousand tokens. Recently, some models have pushed these limits upward of 200,000 tokens (e.g., Claude; Anthropic, 2024). Although larger context windows may improve performance on long texts, they also significantly increase computational cost and memory requirements, leading to less common use in applied research to date (Y. Ding et al., 2024).

Currently, there are multiple other strategies to process longer texts (see Fig. 4). (a) Truncate the text (i.e., discard all text that is beyond the token limit). This is the default strategy, so if long texts are not managed in other strategic ways, models will automatically truncate texts. (b) Trim the text (i.e., select portions of the original text to stay under the token limit). Research has shown that performance is better when tokens are selected from throughout the document rather than simply truncating (Tuteja & Juclà, 2023). (c) Chunk the text. “Chunking” is when the text is chunked into blocks of text that are within the token limit. For example, if the token limit is 512 and there are 1,536 tokens total, chunking the text would split the original long text into three chunks. The chunks can then be input to the model separately, and the results are averaged across them. (d) Use a “sliding window” approach. In a sliding-window approach, the original text is split into blocks that are below the token limit, but the blocks contain overlapping text that is referred to as the “stride.” This overlap helps preserve the context across chunks but will increase training time.

Fig. 4.

Strategies for handling long text. (1) Truncate text that is longer than the token limit. (2) Trim selected text so it is shorter than the token limit. (3) Chunk the long text into segments the same length as the token limit. (4) Split long text into segments shorter than the token limit, with each segment overlapping.

Other techniques may involve using one batch per document or hierarchical modeling. LLMs process data in batches, updating model parameters after each batch. Creating one batch for each long document enables the model to process one full document at a time. Hierarchical-modeling techniques may also organize long texts into manageable chunks and ensure adequate aggregation of units into participant-level representations (Dai et al., 2022; M. Ding et al., 2020; Wu et al., 2021). This may address the concern of text-participant attribution and can help with equal weighting of text samples when some participants’ texts are longer than others.

The following pseudocode example demonstrates how to implement the sliding-window approach:

Pseudocode (GitHub file: sliding_window.py):

Loop over the text and divide text into subtexts of length of window.

Use overlap variable to decide how much overlap to keep between subtexts.

Tokenize each subtext using a new or pretrained tokenizer available on hugging face or simpletransformers.

Required computing resources

Computing resources are of the utmost importance (Kaddour et al., 2023). Small LLMs can be run using the central processing unit of any computer, but many require graphics processing units (GPUs). GPUs are computer processors that were originally designed for video gaming that perform parallel computations and process large data quickly, making them well suited for machine learning and working with LLMs. Baseline GPU memory requirements for fine-tuning LLMs can reach upward of 80 GB (also the size of the largest commercially available GPUs; Tuggener et al., 2024). To estimate how much memory is required, a rule of thumb is 8 × X, where X is the number of billion parameters of the LLM being used, stored in a 16-bit format (Mittal, 2024). There are public platforms and cloud servers that allow researcher access to GPUs (e.g., Google Colab offers free access to GPUs with 16 GB of memory and paid access to GPUs with 40 GB of memory). Microsoft Azure Machine Learning offers paid GPU access to a wider range of power and memory configurations than Colab. Institutions may also have shared access to more powerful GPU computing resources.

In our working example, we used university-based computing resources. On-demand access to cloud servers can be helpful, but university-based computing was more cost-effective and helpful for batch job processing. Even with a small language model, running the fine-tuning analyses required more than 30 GB of GPU RAM.

Managing memory usage is also critical for working with LLMs. We explored strategies to reduce both static and dynamic memory requirements, including precision reduction, data streaming, gradient checkpointing, and mini-batch optimization. For a detailed discussion of these strategies and implementation examples, see the Supplemental Material.

Model evaluation

Models must be configured during training to produce the desired output (Fig. 2f). In NLP tasks, language-based predictions generally fall into two categories, classification and regression, each with its own evaluation metrics (Berggren et al., 2019). Language can be used to predict a binary classification (e.g., Does someone have a specific attribute, yes or no?), multiclass classifications (e.g., a set of possible labels), or continuous values (e.g., a ratio score). Multiclass labels can be nominal (e.g., predicting one of five political affiliations) or ordinal (e.g., predicting one of four increasing difficulty levels). Models can also be trained as multilabel classifiers in which multiple labels can be selected for each language sample. Finally, a regression task trains models to predict continuous values.

Classification and regression tasks are evaluated using different metrics. Classification evaluation metrics are focused on prediction accuracy. Regression-based metrics are focused on reducing prediction error. Descriptions of evaluation across different metrics can be further studied in tutorials by Vickers et al. (2024) and Pargent et al. (2023).

Most documentation about language modeling uses the term “language classification” to describe the broad category of tasks mentioned above (including regression). Most available information refers to classification tasks and not regression tasks. For some tools, such as simple transformers (Rajapakse, 2019), the default information will address classification tasks, but steps to convert the code to regression are included in the documentation. In some cases, information about classification will still apply to regression because a regression task can be conceptualized as a classification task with one label. In general, classification tasks tend to achieve better overall performance, but regression tasks offer more precise predictions and are often more relevant to psychological constructs, which are often measured continuously.

In our working example, the personality scores were continuous, and the model was trained to complete a regression task. A model can be configured for regression using the simple transformers library by setting the regression parameter to “True” in ClassificationArgs, as shown in regression.py on Line 55. Line 80 to 81 shows how to extract the predictions of personality ratings during testing.

Model-training considerations

Analyzing text data with LLMs relies heavily on general ML procedures (Fig. 2g). Pargent et al. (2023), Choi et al. (2020), Badillo et al. (2020), T. Jiang et al. (2020), and Pandey et al. (2020) are helpful overview articles and tutorials. Coursera (https://www.coursera.org/) and Towards Data Science (https://towardsdatascience.com/) are also practical resources for examples, tutorials, and discussions.

Cross-validation

Cross-validation is a technique used to estimate model reliability and accommodate limited amounts of data (Yates et al., 2023). The data are divided into equal portions, or “folds.” The number of folds may vary (referred to as “k”); five or 10 are the most common. K minus 1 fold is used to train the model, and the remaining fold is used to test the model. This process is repeated until each fold has been the testing fold. The overall estimate of model performance is the average of all combinations (Wong, 2015). For smaller data sets, leave-one-out cross-validation is recommended (see Table 1). Cross-validation is important because it provides a more reliable estimate of model performance, reducing bias from randomness in the data. Variability in performance across iterations can indicate inconsistencies in the data, increased data complexity, or difficulties with the model’s ability to learn (Shulga, 2018). In our working example, we used five-fold cross-validation to help estimate model performance.

Hyperparameter tuning

Hyperparameters are settings that affect how a model learns and how they are adjusted to optimize model performance. Customizing these settings is known as “hyperparameter tuning.” There are many hyperparameters. Three are emphasized as having the greatest impact: learning rate, batch size, and number of epochs (Devlin et al., 2019). The learning rate determines how much the model’s parameters are adjusted in response to training examples. Higher learning rates may speed up the training process but may not result in optimal model performance because they may not be sensitive enough. Lower learning rates remedy this problem but will slow down the training process. Specifically for LLMs, learning rates tend to be much smaller than with other ML models because LLMs operate best with subtle adjustments. Learning rate warm-up strategies are also useful when training LLMs because they gradually increase the learning rate at the onset of training, facilitating stability. Batch size is the number of data samples that are seen by the model before calculating errors and updating the parameters. Batch size is dependent on available computing resources because all data for a given batch need to be held in memory before the model’s weights are updated. Epochs are the number of times the model passes back and forth through the entire data set. Training for too few epochs can result in underfitting such that the model does not learn enough about the data. Training for too many epochs can result in overfitting such that a model learns too much about the data and then does not perform well on other, unseen data.

When determining values for hyperparameters, it is recommended to begin with the same values used to train the base LLM (Devlin et al., 2019). These values are likely published, and some models (e.g., BERT) even recommend possible ranges for hyperparameter values for future fine-tuning. It is then important to experiment with different settings to determine what works best for a particular data set. There are multiple strategies for finding optimal hyperparameter values; grid search, automatic optimization, or random search are the most common (Bischl et al., 2023). A grid search will systematically train multiple iterations of models, trying every combination of values within the given ranges. Automatic-optimization strategies will dynamically adjust the values of specific hyperparameters each iteration, testing values that are uniquely promising and using algorithms to predict what those values would be. Random search tests a wide variety of values within a specified range, with no meaningful decisions about which values to try. Note that the optimal settings for a given model may not fall in the recommended-values range. In this situation, automatic-optimization strategies can be helpful because they can efficiently expand hyperparameter values away from the recommended ranges based on context-specific information.

The tuning process can be time-consuming and requires significant computing resources because each combination of parameters is used to train the entire language model. Overfitting is a concern (X. Liu & Wang, 2021). Several strategies are recommended to avoid overfitting in hyperparameter tuning: (a) Early stopping prevents models from overfitting by determining the optimal number of epochs and ending the training process once the model’s performance is no longer improving after a specified number of epochs, typically five to 10 (Dodge et al., 2020). (b) The optimal values are those that resulted in the greatest average performance across all validation folds—not the best values of any individual run. (c) Dropout and weight decay can reduce overfitting. Dropout randomly removes connections between elements of the model during training, and weight decay adds penalties to highly influential paths to encourage the model to examine patterns more generally (Srivastava et al., 2014). After the optimal hyperparameters are determined, the full model should be retrained using these values. See the GitHub page for important training arguments for hyperparameters and early stopping.

Weights and Biases is a helpful tool for hyperparameter tuning (Biewald, 2020). This software is free for students, educators, and academic researchers and facilitates hyperparameter sweeps. Weights and Biases can be integrated with other libraries (e.g., Simple Transformers, Hugging Face transformers) to automatically log training and evaluation data in real time and visualize model performance. Figure 5 shows an example of a hyperparameter-tuning log. The results of each combination of learning rate and epochs are plotted, indicating model performance with respect to different combinations of these hyperparameters. Note the range in performance across different combinations, providing useful information about optimal hyperparameter settings.

Fig. 5.

Example of Weights and Biases hyperparameter-tuning log.

Model visualization

Deep learning uses nonlinear relations across multiple layers, which make it difficult to understand precisely how LLMs make decisions (this is known as the “black box” problem). Techniques are being developed to increase model decision explainability (H. Zhao et al., 2024). However, simple model visualizations can be helpful (Fig. 2h). One simple method is to correlate token usage from the text input with the target variable. Examining tokens that appear in more than 10% of the sample and selecting those with the highest positive and highest negative correlations is a straightforward approach to identify potentially important features.

Topic modeling is useful for providing insights into the content of language data and decreasing the manual labor required by exploring themes qualitatively. BERTopic (Grootendorst, 2022) is a python package for topic modeling that also harnesses the power of transformer models. It generates contextualized embeddings from the input text using Sentence-BERT (sBERT; Reimers & Gurevych, 2019) and then reduces the dimensionality of those embeddings and clusters semantically similar documents together. BERTopic then identifies words that contribute most to the topics. Code for initiating BERTopic modeling and viewing topics is provided in the source materials (Grootendorst, 2022).

If working with longer language samples, it is helpful to split samples into sentence-level data when performing topic modeling to adequately capture the variation of topics discussed by one person. Because the narratives were so long in our working example, we reformatted the data set to have utterances from participants in unique rows. Using BERTopic probability scores across participants, we found that topics were correlated with personality traits. Many topic-personality correlations were face valid (e.g., extraversion with a friends topic, agreeableness with a community topic, and neuroticism with a mental-health-problems topic). Topics can be trimmed, for example, removing topics that may reflect methodological artifacts of the text sample, but it is important to keep in mind how trimming the topics will affect future utility and/or replicability of the topic model.

It is also helpful to visualize embeddings in two-dimensional space. t-distributed stochastic neighbor embedding (t-SNE; Van der Maaten & Hinton, 2008) is a dimensionality-reduction technique used to visualize high-dimensional data, such as LLM embeddings, in two-dimensional space. The relative positioning of data points in the visualization provides insight into semantic meaning similarity. CLS embeddings, in particular, are useful for visual inspection because they represent the embedding for the full-text sample. The following pseudocode demonstrates visualizing embeddings in two-dimensional space using t-SNE:

Pseudocode (GitHub file: embedding_visualize.py):

Extract CLS embeddings from pretrained or fine-tuned model for each text in the data set.

Use t-SNE to transform embeddings into two-dimensional space.

Plot the scatter plot for all samples.

Attention weights for each token in the input text can also be visualized (Vig, 2019). This provides information about the importance of each language feature for prediction of the outcome. Create a two-dimensional matrix to visualize CLS tokens:

Pseudocode (GitHub file: attention_visualize.py):

Extract attention layers from the model output.

Select layer and head for which to view attention weights (most commonly used 0th layer and 0th attention head).

This will provide a square matrix as a two-dimensional array.

A heat map can illustrate how much attention (or weight) is given to each token in the input text to perform the output task. Create a heat map of the above attention matrix, which will show how each token is semantically connected to each other token in the input text:

Pseudocode (GitHub file: attention_visualize.py):

Extract attention layers from model output.

Select attention layer and head to visualize.

Visualize the attention matrix using heat map.

Important Issues for Consideration and Future Directions

In this section, we discuss issues, implementation, and future directions that will be important for using LLMs for psychological assessment.

Ethical considerations

LLMs contain biases that are prevalent in society and that researchers and the field at large should be aware of and prepared to continuously address in a transparent manner (Bender et al., 2021). Working with LLMs may involve sensitive data that need to be handled securely for the privacy and respect for research participants and patients. LLMs require significant energy resources that have a detrimental environmental impact.

Bias

LLMs can be conceptualized as “stochastic parrots” that lack human understanding of meaning. With some randomness, they confidently repeat back what they were trained on, which will include stereotypes and harmful biases that are prevalent in online training data (Bender et al., 2021). Training data from vast online samples reflect society at large. As a result, they will have negative biases against minority groups that can perpetuate harm. Research has demonstrated bias in LLMs across gender, race, culture, and other demographics (Raza et al., 2024), including showing a preference for male pronouns for certain professions (de Vassimon Manela et al., 2021), indicating some religious groups are more violent than others (Abid et al., 2021), favoring majority groups (Zhang et al., 2020), and propagating differential treatment recommendations based on race (Omiye et al., 2023). These biases emerge when LLMs are trained on data that provide an imbalanced or inaccurate representation of a group or do not represent them at all. Although LLMs contain bias, the level varies (Nadeem et al., 2020; Raza et al., 2024). Researchers may select LLMs based on fairness evidence. In the future, it may be beneficial to concentrate on specific representative training samples rather than simply collecting as much training data as possible (Bender et al., 2021).

It is unclear whether bias in LLMs can be eliminated. Without careful evaluation of bias in psychological research, they can be perpetuated and amplified. For example, LLMs trained on biased data may perpetuate job and financial inequality, amplify harmful content online, misdiagnose and influence clinician decision-making in health care, and otherwise prioritize majority backgrounds (Ferrara, 2023). Techniques are being developed that may reduce model biases, including data augmentation, bias-correction algorithms, and fairness metrics (Cai et al., 2024; Liang et al., 2021; Raza et al., 2024; Sun et al., 2019). However, these techniques are not able to fully remove bias. Psychological researchers using LLMs for psychological assessment must be (a) aware of bias, especially that which is directly relevant to their area of research; (b) active in ensuring fairness in model development (e.g., comparing model results and predictions across various groups); (c) transparent about the existence of biases in the models that they use (e.g., describing the biases and their potential influence on the results in discussion sections); (d) constantly updating their LLM use in accordance with the latest techniques to reduce harm; and (e) supporting or collaborating with researchers from minority groups, especially groups that might be a focus of the research. Psychology-research conferences should have regular panels with experts on LLMs to help spread awareness of bias and best-use practices to manage bias in research. Together, these strategies will help the field understand and mitigate bias, reduce the possibility of harm, and improve useful models.

Privacy

Text data are often more sensitive than questionnaire data, and it is imperative to take measured steps to protect it. Research participants and patients should complete transparent consent forms with the potential risks and benefits and plans for data usage in accordance with American Psychological Association (APA) ethical principles (APA, 2017). Data should be de-identified when possible. There should be a secure, password-protected, and encrypted server where the data can be stored and accessible only to authorized personnel (who all have training in data security). In addition, the server and its network can include a firewall to protect the data. Regular audits of the security system can be conducted to prevent data breaches.

At times, it may be necessary to work with third-party service providers. This must be completed in a manner that research participants and patients have consented to and is compliant with the relevant regulation authorities (e.g., Institutional Review Boards, Health Insurance Portability and Accountability Act, General Data Protection Regulation). The minimum number of third parties should be involved in the process. When using APIs, connections should be secure, authenticated, and encrypted. Vendors will have compliance standards that should be reviewed.

Environmental impact

Deep learning is computationally expensive. It requires significant power, which leads to a growing carbon footprint (Patterson et al., 2021). As a result, researchers are devising ways to train models more efficiently and reduce negative consequences, such as excessive water usage and CO₂ emissions (Rillig et al., 2023). The estimated energy usage for an analysis can be directly calculated, which can be helpful for planning efficient analyses (Hershcovich et al., 2022; Strubell et al., 2020). Researchers should be aware of the energy use that potential analyses would require and take steps to reduce unnecessary analyses. This may include reporting training times, using efficient computational hardware and models, and being aware of power resources used—for example, by data centers and cloud-computing services (Strubell et al., 2020). Researchers should also consider any potential positive downstream environmental impacts of a model (Hershcovich et al. 2022). The rapid increase of power required and used for training LLMs poses serious ethical dilemmas for researchers that should be understood, prioritized, and addressed in transparent ways moving forward.

Other LLM limitations

LLMs will generalize only to the population in which they were developed. Researchers should strive for approximately equal representation for every group that a model should generalize to (Ntoutsi et al., 2020). This means continuing to emphasize the inclusion and collection of language from diverse groups. Of course, most models will not include an accurate representation of everyone. This must be acknowledged in model-description materials and research articles. This will help prevent the use of models in groups for which the model may not work or even produce harmful results. Recently, some psychology journals have required discussion-section “generalizability statements” that are on par with this recommendation.

LLMs may also generalize only to the situation in which they were trained (e.g., interview, cognitive task, social media). Research should cross-validate models across contexts, and models should not be applied across context without validation in the new context. Models built from text gathered in controlled environments may not apply to models using real-life settings (e.g., ambulatory recordings). However, these generalizability questions will be exciting future research directions.

Token limits are currently limitations in working with LLMs. Early models had relatively smaller token limits, for example, 512 tokens. We have outlined ways to work with longer texts, but this is a primary area of future development. Newer LLMs have much longer token limits that could greatly facilitate LLM comprehension of longer texts. However, models with long token limits should be tested to ensure that they are in fact remembering (or properly maintaining) context across long texts. At the current time, managing token limits can be challenging, but simpler methods are likely to emerge as models continue to advance.

Interpretability and explainability

LLMs use deep-learning techniques that can be thought of as a black box one does not understand. The massive nonlinear complexity of the algorithms and layers in these models can make their decisions indecipherable to humans. This can cause problems for researchers and clinicians because they should be able to justify research conclusions and clinical decisions. As a result, researchers must do what is possible to understand how decisions are being made.

However, techniques exist to inform how LLMs are making decisions and predictions. Attention visualization is used to identify how neural networks are focusing their attention on tokens available to them. The differential weights placed on input tokens by the LLM in the process of their predictions can be examined as a heat map or text highlighting, showing the user which parts of the text are most important to the prediction that was made (e.g., Jeon et al., 2024). SHapley Additive exPlanations (SHAPs) are an explainable AI technique based in game theory that attribute differential importance to the input tokens (Lundberg & Lee, 2017). SHAP values are often visualized in waterfall plots, which help researchers interpret the key token predictors of an outcome of interest. However, because SHAPs require repeated evaluations of a model with different feature combinations—and LLM-based analyses often involve extremely high-dimensional inputs—it may be feasible only in smaller-scale LLM applications.

LLM outputs should also be understood through traditional psychometric-validation techniques. After a model is fine-tuned, for example, it may produce a predicted score for an outcome of interest. In the future, LLM output scores should be validated just as psychological variables have been in the past, with construct validation such as convergent-, discriminant-, and criterion-validity tests (cf. Chancellor & De Choudhury, 2020; Strauss & Smith, 2009). Nomological networks of the model output should be examined (e.g., What other constructs does it predict, and what does it not predict?; cf. Cronbach & Meehl, 1955), helping researchers place LLM-based scores in the broader research literature. Reliability should be understood through tests of internal consistency and test-retest reliability (cf. Simons et al., 2024). Construct-validation techniques will provide an understanding of what LLM-based predicted scores represent just as they facilitated understanding of psychological scores in the past.

Humans and LLM-based psychological assessments

LLMs hold promise for the automation and augmentation of assessment methods; however, results still vary based on the task. Each use case should be validated against human raters to evaluate model performance. For example, Schoenegger et al. (2024) compared the abilities of laypersons, psychology experts, pretrained LLMs, and a specialized AI model trained on personality data to predict correlations between personality items. Results indicated that AI models made better predictions than 85% of individual humans. However, median predictions from the whole group of psychology experts rivaled the specialized AI and outperformed those of pretrained models. This suggests that LLM performance might be superior to most individual evaluators, yet experts still hold advanced knowledge collectively.

Given the limitations outlined above, LLM-based psychological assessments should not be relied on as stand-alone assessments in clinical or applied situations without human oversight. Humans should always have oversight and final judgment over any consequential decisions that might be made by an LLM. Ideally, they will be administered as a tool within an assessment battery of multiple measures. They are currently best considered a potentially helpful tool for understanding psychological phenomena.

Collaboration among psychologists, computer scientists, and others is essential for LLM tools to be as useful for psychological assessment. Professionals from each area have unique insight, questions, and ways of thinking about assessment and developing research projects. Reliance on team science will also reduce the burden on any one scientist to have mastery of all cutting-edge methodologies. Interdisciplinary data-science PhD programs will be important for producing scientists who may help bridge the gap between disciplines. Profitable collaborations will occur when professionals from different areas come together with a mutual respect and put in the time needed to work together efficiently and effectively. Although this can be a challenge, effective interdisciplinary collaboration will be necessary to develop LLM-based psychological-assessment methods that will be as effective as science and medicine will need them to be.

Researchers and clinicians who administer LLMs for assessment should have proper training in their effective use. Currently, we are not aware of any official guidelines or standards. It may be fruitful to develop guidelines that LLM-based researchers should follow to receive the proper training that will provide a foundational knowledge and essential skills that are needed to effectively engage in this area. Furthermore, it would be especially useful to ensure this training provides the tools necessary for researchers to continue to grow their knowledge and stay aware of the latest best practices in the field throughout their careers.

Guidelines for the development and administration of LLM-based psychological assessments may also be helpful. Organizations such as the APA have provided resources and updates on policy for AI generally (APA, 2023). Researchers have also published useful guidance about ethical use of LLMs in science (Parker et al., 2023; Watkins, 2024). These protocols may include standards on transparency, data collection, management, privacy and security, bias mitigation, generalizability, training, and deployment. Organizations that may help develop these standards include research organizations, institutions, professional associations, publishers, or advocacy groups. Guidelines may help promote responsible practices and reduce potential harms. In conducting meta-analytic reviews, psychologists follow the PRISMA guidelines (Page et al., 2021). Beyond universal best practices, we emphasize the importance of flexible guidelines that account for the unique context of model development.

Future directions

There is growing evidence that multimodal models, including more than just language, improve predictive utility (Morales et al., 2018). It will be fruitful to pair LLMs with standard psychological assessments and other technologies to examine unique and combined predictive power across features (e.g., Harari et al., 2017; Jacobson & Bhattacharya, 2022). The transformer model can also be used with nonlanguage predictors (Wang & Sun, 2022). In the future, modeling features from video recordings in tandem with traditional psychological assessments will provide a more holistic assessment of a person.

LLMs will soon have longer context attention, better strategies to mitigate bias, better regulatory standards and guidelines, available training, and techniques for model explainability and interpretability, security, validation, and access. Alternative model architectures have already rivaled the transformer model in NLP, such as state-space models (Gu & Dao, 2023). It is important for researchers to stay informed on these developments. We recommend following journals, new books, podcasts, and online courses and webinars; attending conferences; and maintaining communication with interdisciplinary collaborators. Commitment to rigorous methodology, such as the collection of high-quality data including well-validated assessments with useful and targeted language samples across diverse populations, is also imperative. Consortiums with the purpose of bringing together researchers with similar interests in specific LLM applications may be useful to enhance data size and diversity.

Conclusions

LLMs offer important advantages compared with traditional psychometric approaches, such as the self-report questionnaire. This includes their behavioral nature, scalability, and allowance for a broader range of response possibilities. Language assessments can be derived from routine tasks or in naturalistic environments using smartphones. Despite potential advances, there are significant risks and biases with this technology. Psychologists must be aware of the biases in LLMs and ways of mitigating them.

The purpose of this overview is to provide accessible guidance on a novel and complex methodology. Despite rapid advances, relatively little is known about using LLMs for psychological assessment. Although a growing number of high-quality studies are emerging, many face limitations related to sample size, diversity, language data types, or psychological measurement. We encourage psychologists to strive for strong psychometric, methodological, and interdisciplinary contributions in the evolving area of using LLMs for psychological assessment. We hope this article helps promote them.

Supplemental Material

sj-docx-1-amp-10.1177_25152459251343582 – Supplemental material for Large Language Models for Psychological Assessment: A Comprehensive Overview

Supplemental material, sj-docx-1-amp-10.1177_25152459251343582 for Large Language Models for Psychological Assessment: A Comprehensive Overview by Jocelyn Brickman, Mehak Gupta and Joshua R. Oltmanns in Advances in Methods and Practices in Psychological Science

Footnotes

Transparency

Action Editor: Kongmeng Liew

Editor: David A. Sbarra

Author contributions

Jocelyn Brickman: Conceptualization;Software;Visualization;Writing – original draft;Writing – review & editing.

Mehak Gupta: Conceptualization;Software;Supervision;Visualization;Writing – original draft;Writing – review & editing.

Joshua Oltmanns: Conceptualization;Supervision;Visualization;Writing – original draft;Writing – review & editing.

ORCID iD

Joshua R. Oltmanns

Supplemental Material

Additional supporting information can be found at

References

Abdurahman

Zou

Ungar

Bhatia

(2024). A deep learning approach to personality assessment: Generalizing across items and expanding the reach of survey-based research. Journal of Personality and Social Psychology, 126(2), 312–331. https://doi.org/10.1037/pspp0000480

Abid

Farooqi

Zou

(2021). Persistent anti-Muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (pp. 298–306). Association for Computing Machinery.

Achiam

Adler

Agarwal

Ahmad

Akkaya

Aleman

F. L.

Almeida

Altenschmidt

Altman

Anadkat

(2023). Gpt-4 technical report. arXiv. https://doi.org/10.48550/arXiv.2303.08774

Almeida

Xexéo

(2019). Word embeddings: A survey. arXiv. https://doi.org/10.48550/arXiv.1901.09069

American Psychological Association. (2017). Ethical principles of psychologists and code of conduct. https://www.apa.org/ethics/code

American Psychological Association. (2023). APA journals policy on generative AI: Additional guidance. https://www.apa.org/pubs/journals/resources/publishing-tips/policy-generative-ai

Anthropic

A. I.

(2024). The Claude 3 model family: Opus, sonnet, haiku. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

APA Task Force on Psychological Assessment and Evaluation Guidelines. (2020). APA guidelines for psychological assessment and evaluation. American Psychological Association. https://doi.org/10.1037/e51014202

Badillo

Banfai

Birzele

Davydov

I. I.

Hutchinson

Kam-Thong

Siebourg-Polster

Steiert

Zhang

J. D.

(2020). An introduction to machine learning. Clinical Pharmacology & Therapeutics, 107(4), 871–885.

10.

Bain

Huh

Han

Zisserman

(2023). Whisperx: Time-accurate speech transcription of long-form audio. arXiv. https://doi.org/10.48550/arXiv.2303.00747

11.

Bender

E. M.

Gebru

McMillan-Major

Shmitchell

(2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency (pp. 610–623). Association for Computing Machinery.

12.

Berggren

S. J.

Rama

Øvrelid

(2019). Regression or classification? Automated essay scoring for Norwegian. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 92–102). Association for Computational Linguistics.

13.

Biewald

(2020). Experiment tracking with weights and biases. 2(5). Software. https://wandb.ai/site

14.

Bischl

Binder

Lang

Pielok

Richter

Coors

Thomas

Ullmann

Becker

Boulesteix

A. L.

(2023). Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(2), Article e1484. https://doi.org/10.1002/widm.1484

15.

Boyd

R. L.

Ashokkumar

Seraj

Pennebaker

J. W.

(2022). The development and psychometric properties of LIWC-22. University of Texas at Austin.

16.

Bredin

Yin

Coria

J. M.

Gelly

Korshunov

Lavechin

Fustes

Titeux

Bouaziz

Gill

M. P.

(2020). Pyannote. Audio: Neural building blocks for speaker diarization. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7124–7128). IEEE.

17.

Cai

Cao

Guo

Wen

Liu

Chen

(2024). Locating and mitigating gender bias in large language models. arXiv. https://doi.org/10.48550/arXiv.2403.14409

18.

Chae

Davidson

(2023). Large language models for text classification: From zero-shot learning to fine-tuning. SocArXiv. https://doi.org/10.31235/osf.io/sthwk_v2

19.

Chancellor

De Choudhury

(2020). Methods in predictive techniques for mental health status on social media: A critical review. NPJ Digital Medicine, 3, Article 43. https://doi.org/10.1038/s41746-020-0233-7

20.

Chekroud

A. M.

Hawrilenko

Loho

Bondar

Gueorguieva

Hasan

Kambeitz

Corlett

P. R.

Koutsouleris

Krumholz

H. M.

(2024). Illusory generalizability of clinical prediction models. Science, 383(6679), 164–167.

21.

Choi

R. Y.

Coyner

A. S.

Kalpathy-Cramer

Chiang

M. F.

Campbell

J. P.

(2020). Introduction to machine learning, neural networks, and deep learning. Translational Vision Science & Technology, 9(2), Article 14. https://doi.org/10.1167/tvst.9.2.14

22.

Clark

L. A.

Watson

(2019). Constructing validity: New developments in creating objective measuring instruments. Psychological Assessment, 31, 1412–1427. https://doi.org/10.1037/pas0000626

23.

Costa

P. T.

McCrae

R. R.

(1992). Revised NEO Personality Inventory (NEO PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual. Psychological Assessment Resources.

24.

Cronbach

L. J.

Meehl

P. E.

(1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.

25.

Cutler

Condon

D. M.

(2023). Deep lexical hypothesis: Identifying personality structure in natural language. Journal of Personality and Social Psychology, 125, 173–197.

26.

Dai

Chalkidis

Darkner

Elliott

(2022). Revisiting transformer-based models for long document classification. arXiv. https://doi.org/10.48550/arXiv.2204.06683

27.

de Vassimon Manela

Errington

Fisher

van Breugel

Minervini

. (2021). Stereotype and skew: Quantifying gender bias in pretrained and fine-tuned language models. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main volume (pp. 2232–2242). Association for Computational Linguistics.

28.

Demszky

Yang

Yeager

D. S.

Bryan

C. J.

Clapper

Chandhok

. . .Pennebaker

J. W.

(2023). Using large language models in psychology. Nature Reviews Psychology, 2(11), 688–701.

29.

Devlin

Chang

M. W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://doi.org/10.48550/arXiv.1810.04805

30.

Ding

Zhou

Yang

Tang

(2020). Cogltx: Applying bert to long texts. Advances in Neural Information Processing Systems, 33, 12792–12804.

31.

Ding

Zhang

L. L.

Zhang

Shang

Yang

(2024). Longrope: Extending llm context window beyond 2 million tokens. arXiv. https://doi.org/10.48550/arXiv.2402.13753

32.

Dodge

Ilharco

Schwartz

Farhadi

Hajishirzi

Smith

(2020). Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv. https://doi.org/10.48550/arXiv.2002.06305

33.

Englhardt

Morris

M. E.

Chang

C.-C.

X. O.

Qin

McDuff

Liu

Patel

Iyer

(2024). From classification to clinical insights: Towards analyzing and reasoning about mobile and behavioral health data with large language models. arXiv. https://doi.org/10.48550/arXiv.2311.13063

34.

Ferrara

(2023). Should ChatGPT be biased? Challenges and risks of bias in large language models. arXiv. https://doi.org/10.48550/arXiv.2304.03738

35.

Fields

Chovanec

Madiraju

(2024). A survey of text classification with transformers: How wide? How large? How long? How accurate? How expensive? How safe? IEEE Access, 12, 6518–6531. https://doi.org/10.1109/access.2024.3349952

36.

Galatzer-Levy

I. R.

McDuff

Natarajan

Karthikesalingam

(2023). The capability of large language models to measure psychiatric functioning. arXiv. https://doi.org/10.48550/arXiv.2308.01834

37.

Grootendorst

(2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv. https://doi.org/10.48550/arXiv.2203.05794

38.

Dao

(2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv. https://doi.org/10.48550/arXiv.2312.00752

39.

Gül

İ.

Lebret

Aberer

. (2024). Stance detection on social media with fine-tuned large language models. arXiv. https://doi.org/10.48550/arXiv.2404.12171

40.

Hadi

M. U.

Qureshi

Shah

Irfan

Zafar

Shaikh

M. B.

. . .Mirjalili

(2023). A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints. arXiv:2303.18223v16.

41.

Harari

G. M.

Müller

S. R.

Aung

M. S.

Rentfrow

P. J.

(2017). Smartphone sensing methods for studying behavior in everyday life. Current Opinion in Behavioral Sciences, 18, 83–90.

42.

Hardy

Sucholutsky

Thompson

Griffiths

(2023). Large language models meet cognitive science: LLMs as tools, models, and participants. Proceedings of the Annual Meeting of the Cognitive Science Society, 45. https://escholarship.org/uc/item/6dp9k2gz

43.

Liu

Gao

Chen

(2020). DeBERTa: Decoding-enhanced BERT with disentangled attention. arXiv. https://doi.org/10.48550/arXiv.2006.03654

44.

Hershcovich

Webersinke

Kraus

Bingler

J. A.

Leippold

(2022). Towards climate awareness in NLP research. arXiv. https://doi.org/10.48550/arXiv.2205.05071

45.

Honnibal

Montani

(2017). SpaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.

46.

Hoory

Feder

Tendler

Erell

Peled-Cohen

Laish

Nakhost

Stemmer

Benjamini

Hassidim

(2021). Learning and evaluating a differentially private pretrained language model. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 1178–1189). Association for Computational Linguistics.

47.

Hopwood

C. J.

Bornstein

R. F.

(Eds.). (2014). Multimethod clinical assessment. The Guilford Press.

48.

Hua

Liu

Yang

Sheu

Y.-h.

Zhou

Moran

L. V.

Ananiadou

Beam

(2024). Large language models in mental health care: A scoping review. arXiv. https://doi.org/10.48550/arXiv.2401.02984

49.

Huff

Ulakçı

(2024). Towards a psychology of machines: Large language models predict human memory. arXiv. https://doi.org/10.48550/arXiv.2403.05152

50.

Hussain

Binz

Mata

Wulff

D. U.

(2023). A tutorial on open-source large language models for behavioral science. PsyArXiv. https://doi.org/10.31234/osf.io/f7stn

51.

Jacobson

N. C.

Bhattacharya

(2022). Digital biomarkers of anxiety disorder symptom changes: Personalized deep learning models using smartphone sensors accurately predict anxiety symptoms from ecological momentary assessments. Behaviour Research and Therapy, 149, Article 104013. https://doi.org/10.1016/j.brat.2021.104013

52.

Jeon

Yoo

Lee

Son

Kim

Han

(2024). A dual-prompting for interpretable mental health language models. arXiv. https://doi.org/10.48550/arXiv.2402.14854

53.

Jiang

A. Q.

Sablayrolles

Mensch

Bamford

Chaplot

D. S.

de las Casas

Bressand

Lengyel

Lample

Saulnier

Lavaud

L. R.

Lachaux

M-A.

Stock

Le Scao

T. L.

Lavril

Wang

Lacroix

El Sayed

(2023). Mistral 7B. arXiv. https://doi.org/10.48550/arXiv.2310.06825

54.

Jiang

Gradus

J. L.

Rosellini

A. J.

(2020). Supervised machine learning: A brief primer. Behavior Therapy, 51(5), 675–687.

55.

Johri

Khatri

S. K.

Al-Taani

A. T.

Sabharwal

Suvanov

Kumar

(2021). Natural language processing: History, evolution, application, and future work. In Abraham

Castillo

Virmani

(Eds.), Proceedings of 3rd International Conference on Computing Informatics and Networks (Vol. 167, pp. 367–375). Springer. https://doi.org/10.1007/978-981-15-9712-1_31

56.

Kaddour

Key

Nawrot

Minervini

Kusner

M. J.

(2023). No train no gain: Revisiting efficient training algorithms for transformer-based language models. Advances in Neural Information Processing Systems, 36, 25793-25818.

57.

Khurana

Koli

Khatter

Sukhdev

(2023). Natural language processing: State of the art, current trends and challenges. Multimedia Tools and Applications, 82, 3713–3744. https://doi.org/10.1007/s11042-022-13428-4

58.

Kjell

O. N. E.

Kjell

Schwartz

H. A.

(2024). Beyond rating scales: With targeted evaluation, large language models are poised for psychological assessment. Psychiatry Research, 333, Article 115667. https://doi.org/10.1016/j.psychres.2023.115667

59.

Lazarević

L. B.

Bjekić

Živanović

Knežević

(2020). Ambulatory assessment of language use: Evidence on the temporal stability of electronically activated recorder and stream of consciousness data. Behavior Research Methods, 52, 1817–1835.

60.

Lialin

Deshpande

Yao

Rumshisky

(2024). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv. https://doi.org/10.48550/arXiv.2303.15647

61.

Liang

P. P.

Morency

L. P.

Salakhutdinov

(2021). Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning (pp. 6565–6576). PMLR.

62.

Lin

Wang

Dong

(2024). Detection and positive reconstruction of cognitive distortion sentences: Mandarin dataset and evaluation. arXiv. https://doi.org/10.48550/arXiv.2405.15334

63.

Liu

Wang

(2021). An empirical study on hyperparameter optimization for fine-tuning pretrained language models. arXiv. https://doi.org/10.48550/arXiv.2106.09204

64.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

Stoyanov

(2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv. https://doi.org/10.48550/arXiv.1907.11692

65.

Liu

Wang

Gao

G. G.

Agarwal

(2023). Echoes of biases: How stigmatizing language affects AI performance. arXiv. https://doi.org/10.48550/arXiv.2305.10201

66.

Lundberg

S. M.

Lee

S. I.

(2017). A unified approach to interpreting model predictions. In Advances in neural information processing systems (pp. 4765–4774). Association for Computing Machinery.

67.

Luxton

D. D.

(2014). Artificial intelligence in psychological practice: Current and future applications and implications. Professional Psychology: Research and Practice, 45(5), 332–339. https://doi.org/10.1037/a0034559

68.

Mehl

M. R.

(2017). The Electronically Activated Recorder (EAR): A method for the naturalistic observation of daily social behavior. Current Directions in Psychological Science, 26, 184–190.

69.

Meyer

G. J.

Finn

S. E.

Eyde

L. D.

Kay

G. G.

Moreland

K. L.

Dies

R. R.

Eisman

E. J.

Kubiszyn

T. W.

Reed

G. M.

(2001). Psychological testing and psychological assessment: A review of evidence and issues. American Psychologist, 56, 128–165. https://doi.org/10.1037/0003-066X.56.2.128

70.

Mikolov

Chen

Corrado

Dean

(2013). Efficient estimation of word representations in vector space. arXiv. https://doi.org/10.48550/arXiv.1301.3781

71.

Milligan

Bernard

Dowthwaite

Vallejos

E. P.

Davis

Salhi

Goulding

(2024). Developing a single-session outcome measure using natural language processing on digital mental health transcripts. Counselling and Psychotherapy Research, 24(3), 1057–1068. https://doi.org/10.1002/capr.12766

72.

Min

Ross

Sulem

Veyseh

A. P. B.

Nguyen

T. H.

Sainz

Agirre

Heintz

Roth

(2023). Recent advances in natural language processing via large pretrained language models: A survey. ACM Computing Surveys, 56(2), 1–40. https://doi.org/10.1145/3605943

73.

Minaee

Mikolov

Nikzad

Chenaghlu

Socher

Amatriain

Gao

(2024). Large language models: A survey. arXiv. https://doi.org/10.48550/arXiv.2402.06196

74.

Mitchell

Zaldivar

Barnes

Vasserman

Hutchinson

Spitzer

Raji

I. D.

Gebru

(2019). Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 220–229). Association for Computing Machinery.

75.

Mittal

(2024). Optimizing memory for large language model inference and fine-tuning. Unite.AI. https://www.unite.ai/optimizing-memory-for-large-language-model-inference-and-fine-tuning/

76.

Morales

Scherer

Levitan

(2018). A linguistically-informed fusion approach for multimodal depression detection. In Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic (pp. 13–24). Association for Computational Linguistics.

77.

Nadeem

Bethke

Reddy

(2020). StereoSet: Measuring stereotypical bias in pretrained language models. arXiv. https://doi.org/10.48550/arXiv.2004.09456

78.

Naveed

Khan

A. U.

Qiu

Saqib

Anwar

Usman

Barnes

Mian

(2024). A comprehensive overview of large language models. arXiv. https://doi.org/10.48550/arXiv.2307.06435

79.

Ntoutsi

Fafalios

Gadiraju

Iosifidis

Nejdl

Vidal

M. E.

Ruggieri

Turini

Papadopoulos

Krasanakis

(2020). Bias in data-driven artificial intelligence systems—An introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(3), Article e1356. https://doi.org/10.1002/widm.1356

80.

Ohse

Hadžić

Mohammed

Peperkorn

Danner

Yorita

Kubota

Rätsch

Shiban

(2024). Zero-Shot Strike: Testing the generalisation capabilities of out-of-the-box LLM models for depression detection. Computer Speech & Language, 88, Article 101663. https://doi.org/10.1016/j.csl.2024.101663

81.

Oltmanns

J. R.

Khandelwal

Brickman

Hussain

Gupta

(2025). Language-based AI modeling of personality traits and pathology from life narrative interviews. PsyArXiv. https://doi.org/10.31234/osf.io/j6yud_v2

82.

Omiye

J. A.

Lester

J. C.

Spichak

Rotemberg

Daneshjou

(2023). Large language models propagate race-based medicine. NPJ Digital Medicine, 6(1), Article 195. https://doi.org/10.1038/s41746-023-00939-z

83.

Page

M. J.

McKenzie

J. E.

Bossuyt

P. M.

Boutron

Hoffmann

T. C.

Mulrow

C. D.

Shamseer

Tetzlaff

J. M.

Akl

E. A.

Brennan

S. E.

Chou

Glanville

Grimshaw

J. M.

Hróbjartsson

Lalu

M. M.

Loder

E. W.

Mayo-Wilson

McDonald

. . . Moher

(2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. The BMJ, 74(9), 790–799. https://doi.org/10.1136/bmj.n71

84.

Pandey

Y. N.

Rastogi

Kainkaryam

Bhattacharya

Saputelli

(2020). Machine learning in the oil and gas industry: Including geosciences, reservoir engineering, and production engineering with Python. Springer.

85.

Pargent

Schoedel

Stachl

(2023). Best practices in supervised machine learning: A tutorial for psychologists. Advances in Methods and Practices in Psychological Science, 6(3). https://doi.org/10.1177/25152459231162559

86.

Parker

J. L.

Richard

V. M.

Becker

(2023). Guidelines for the integration of large language models in developing and refining interview protocols. The Qualitative Report, 28, 3460–3474.

87.

Patterson

Gonzalez

Liang

Munguia

L. M.

Rothchild

Texier

Dean

(2021). Carbon emissions and large neural network training. arXiv. https://doi.org/10.48550/arXiv.2104.10350

88.

Paulhus

D. L.

Vazire

(2007). The self-report method. In Robins

R. W.

Fraley

R. C.

Krueger

R. F.

(Eds.), Handbook of research methods in personality psychology (Vol. 1, pp. 224–239). The Guilford Press.

89.

Peng

Yang

Smith

K. E.

Chen

Bian

(2024). Model tuning or prompt tuning? A study of large language models for clinical concept and relation extraction. Journal of Biomedical Informatics, 153, Article 104630. https://doi.org/10.1016/j.jbi.2024.104630

90.

Pennebaker

J. W.

King

L. A.

(1999). Linguistic styles: Language use as an individual difference. Journal of Personality and Social Psychology, 77, 1296–1312.

91.

Pennebaker

J. W.

Mehl

M. R.

Niederhoffer

K. G.

(2003). Psychological aspects of natural language use: Our words, our selves. Annual Review of Psychology, 54, 547–577. https://doi.org/10.1146/annurev.psych.54.101601.145041

92.

Pennington

Socher

Manning

C. D.

(2014) GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Association for Computational Linguistics.

93.

Peters

Matz

(2023). Large language models can infer psychological dispositions of social media users. arXiv. https://doi.org/10.48550/arXiv.2309.08631

94.

Rajapakse

T. C.

(2019). Simple transformers. https://github.com/ThilinaRajapakse/simpletransformers

95.

Rathje

Mirea

D. M.

Sucholutsky

Marjieh

Robertson

C. E.

Van Bavel

J. J.

(2024). GPT is an effective tool for multilingual psychological text analysis. Proceedings of the National Academy of Sciences, 121(34), Article e2308950121. https://doi.org/10.1073/pnas.2308950121

96.

Raza

Bamgbose

Ghuge

Tavakoli

Reji

D. J.

(2024). Developing safe and responsible large language models–A comprehensive framework. arXiv. https://doi.org/10.48550/arXiv.2404.01399

97.

Reimers

Gurevych

(2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv. https://doi.org/10.48550/arXiv.1908.10084

98.

Rillig

M. C.

Ågerstrand

Gould

K. A.

Sauerland

(2023). Risks and benefits of large language models for the environment. Environmental Science & Technology, 57, 3464–3466.

99.

Sanford

F. H.

(1942). Speech and personality. Psychological Bulletin, 39, 811–845.

100.

Sanh

Debut

Chaumond

Wolf

(2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv. https://doi.org/10.48550/arXiv.1910.01108

101.

Schoenegger

Greenberg

Grishin

Lewis

Caviola

(2024). Can AI understand human personality?–Comparing human experts and AI systems at predicting personality correlations. arXiv. https://doi.org/10.48550/arXiv.2406.08170

102.

Shekhar

Dubey

Mukherjee

Saxena

Tyagi

Kotla

(2024). Towards optimizing the costs of LLM usage. arXiv. https://doi.org/10.48550/arXiv.2402.01742

103.

Shulga

(2018). 5 reasons why you should use cross-validation in your data science projects. https://medium.com/data-science/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79

104.

Simchon

Sutton

Edwards

Lewandowsky

(2023). Online reading habits can reveal personality traits: Towards detecting psychological microtargeting. PNAS Nexus, 2(6), Article pgad191. https://doi.org/10.1093/pnasnexus/pgad191

105.

Simons

J. J.

W. L.

Bhattacharya

Loh

B. S.

Gao

(2024). From traces to measures: Large language models as a tool for psychological measurement from text. arXiv. https://doi.org/10.48550/arXiv.2405.07447

106.

Spiller

T. R.

Rabe

Ben-Zion

Korem

Burrer

Homan

Harpaz-Rotem

Duek

(2023). Efficient and accurate transcription in mental health research-A tutorial on using whisper AI for audio file transcription. OSF Preprints. https://doi.org/10.31219/osf.io/9fue8

107.

Srivastava

Hinton

Krizhevsky

Sutskever

Salakhutdinov

(2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15, 1929–1958.

108.

Strauss

M. E.

Smith

G. T.

(2009). Construct validity: Advances in theory and methodology. Annual Review of Clinical Psychology, 5, 1–25.

109.

Strubell

Ganesh

McCallum

(2020). Energy and policy considerations for modern deep learning research. Proceedings of the AAAI Conference on Artificial Intelligence, 34(9), 13693–13696.

110.

Subramani

Matton

Greaves

Lam

(2020). A survey of deep learning approaches for OCR and document understanding. arXiv. https://doi.org/10.48550/arXiv.2011.13534

111.

Sun

Gaut

Tang

Huang

ElSherief

Zhao

Mirza

Belding

Chang

K.-W.

Wang

W. Y.

(2019). Mitigating gender bias in natural language processing: Literature review. arXiv. https://doi.org/10.48550/arXiv.1906.08976

112.

Tausczik

Y. R.

Pennebaker

J. W.

(2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29, 24–54.

113.

Touvron

Martin

Stone

Albert

Almahairi

Babaei

Bashlykov

Batra

Bhargava

Bhosale

(2023). Llama 2: Open foundation and fine-tuned chat models. arXiv. https://doi.org/10.48550/arXiv.2307.09288

114.

Trull

T. J.

Ebner-Priemer

(2013). Ambulatory assessment. Annual Review of Clinical Psychology, 9, 151–176.

115.

Powers

Merrill

Fani

Carter

Doogan

Choi

J. D.

(2024). Automating PTSD diagnostics in clinical interviews: Leveraging large language models for trauma assessments. arXiv. https://doi.org/10.48550/arXiv.2405.11178

116.

Tuggener

Sager

Taoudi-Benchekroun

Grewe

B. F.

Stadelmann

(2024, May 30–31). So you want your private LLM at home? A survey and benchmark of methods for efficient GPTs [Conference session]. 11th IEEE Swiss Conference on Data Science (SDS), Zurich, Switzerland. https://doi.org/10.21256/zhaw-30279

117.

Tuteja

Juclà

D. G.

(2023). Long text classification using transformers with paragraph selection strategies. In Proceedings of the Natural Legal Language Processing Workshop 2023 (pp. 17–24). Association for Computational Linguistics.

118.

Van der Maaten

Hinton

. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.

119.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Polosukhin

(2017). Attention is all you need. In Advances in Neural Information Processing Systems, 30, 5998–6008.

120.

Vickers

Barrault

Monti

Aletras

(2024). We need to talk about classification evaluation metrics in NLP. arXiv. https://doi.org/10.48550/arXiv.2401.03831

121.

Vig

(2019). A multiscale visualization of attention in the transformer model. arXiv. https://doi.org/10.48550/arXiv.1906.05714

122.

Wang

Sun

(2022). Survtrace: Transformers for survival analysis with competing events. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (pp. 1–9). Association for Computing Machinery.

123.

Watkins

(2024). Guidance for researchers and peer-reviewers on the ethical use of large language models (LLMs) in scientific research workflows. AI and Ethics, 4, 969–974. https://doi.org/10.1007/s43681-023-00294-5

124.

Wong

T. T.

(2015). Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognition, 48(9), 2839–2846.

125.

Huang

(2021). Hi-transformer: Hierarchical interactive transformer for efficient and effective long document modeling. arXiv. https://doi.org/10.48550/arXiv.2106.01040

126.

Wulff

D. U.

Mata

(2023). Using embeddings to automate jingle–jangle detection and tackle taxonomic incommensurability. PsyArXiv. https://doi.org/10.31234/osf.io/9h7aw

127.

Yang

Wen

Peng

T. Q.

Tang

Liu

(2024). Are large language models (LLMs) good social predictors? arXiv. https://doi.org/10.48550/arXiv.2402.12620

128.

Yang

Dai

Yang

Carbonell

Salakhutdinov

R. R.

Q. V.

(2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.

129.

Yates

L. A.

Aandahl

Richards

S. A.

Brook

B. W.

(2023). Cross validation for model selection: A review with examples from ecology. Ecological Monographs, 93(1), Article e1557. https://doi.org/10.1002/ecm.1557

130.

Zhang

A. X.

Abdalla

McDermott

Ghassemi

(2020). Hurtful words: Quantifying biases in clinical contextual word embeddings. In Proceedings of the ACM Conference on Health, Inference, and Learning (pp. 110–120). Association for Computing Machinery.

131.

Zhao

Chen

Yang

Liu

Deng

Cai

Wang

Yin

(2024). Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2), Article 20. https://doi.org/10.1145/3639372

132.

Zhao

W. X.

Zhou

Tang

Wang

Hou

Min

Zhang

Dong

(2023). A survey of large language models. arXiv. https://doi.org/10.48550/arXiv.2303.18223

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB