Sage Journals: Discover world-class research

Abstract

This article reframes the potential of generative AI in qualitative analysis, shifting from a focus on efficiency and automation toward opportunities to deepen analysis and strengthen core commitments of qualitative inquiry. We introduce “AI-in-the-loop analysis” as a term to describe the intentional incorporation of computational capabilities into analytic processes that remain grounded in human sensemaking, interpretation, and reflexive judgment. Building from foundational commitments of qualitative inquiry such as sustained attention to fine-grained data in relation to its larger context and intentional engagement of positionalities to support noticing and interpretation, we examine how properties of large language models (LLMs) can be mobilized to extend these practices. We focus on affordances provided by AI’s large-scale pre-training, rich semantic representations, attention mechanisms, long-context capacities, and interactive prompting, and describe ways that thoughtful engagement with these capabilities can help researchers maintain close attention to the details of the data across multiple iterations while situating interpretations in context, expand interpretive perspectives in dialogue with each other to layer meaning, and surface both confirming and disconfirming evidence across complex datasets. We connect these possibilities to established criteria for trustworthiness such as credibility, dependability, confirmability, transferability, and authenticity, showing how AI-in-the-loop approaches can offer new mechanisms for achieving and demonstrating analytic rigor. Rather than replacing human interpretive labor, generative AI can be used to augment researchers’ capacity for noticing, questioning, and synthesizing across large and complex qualitative data sets. When used critically and transparently, AI-in-the-loop analysis offers the possibility to expand the methodological repertoire of qualitative researchers for supporting rigorous, trustworthy, reflexive, contextually grounded analyses.

Keywords

data-intensive qualitative research contextual interpretation generative AI large language models trustworthiness AI-in-the-loop analysis human–AI partnering

Introduction

This article unpacks the relationship between commitments of qualitative inquiry and the architecture and capabilities of Generative AI (GenAI) models to explore a promising space of possibilities. We focus on Large Language Models (LLMs) that are usable out-of-the-box without programming or fine-tuning as these offer most immediately accessible opportunities for researchers to strengthen core commitments of, and address longstanding challenges in, qualitative analysis.

At its core, qualitative analysis involves closely examining large amounts of rich data to generate insight, a notoriously time-consuming process, while LLMs offer powerful pattern recognition capabilities. An initial impetus might thus be to see how AI could speed up aspects of qualitative methods to make the process more efficient. This has been a central focus of the discourse to date (as documented in Paulus et al., 2025) with most work using LLMs to speed up thematic analysis of a dataset (e.g. De Paoli, 2024; Rientes et al., 2025; Yan et al., 2024) or more rapidly develop a codebook for deductive application (Barany et al., 2024; Gao et al., 2024)¹. In both cases, important elements of interpretive judgment are ceded to AI (Paulus et al., 2025). In addition to study-specific uses of commercially available applications² there have also been early efforts to develop prompt frameworks (Zhang et al., 2023), analysis workflows (Bakharia et al., 2025; Rao et al., 2024), and LLM-powered tools (Lin et al., 2025) specifically tailored to qualitative research. The emphasis here is on fidelity to original sources, meaningful construction of analytic categories and assignment of data to them, completeness of analysis, and reproducibility of results.

While the majority of this work adopts a “human-in-the-loop,” rather than fully automated approach³, LLM use remains largely procedural, offering limited critical engagement with key conceptual dimensions of qualitative inquiry such as researcher positionality, relationship to the data and its generation, iterative refinement of questions as well as interpretations, and dialogue with theory. This may explain why, to date, the response of many qualitative researchers has ranged from marked reservation to outright rejection (e.g. Jowsey et al., 2025). Still greater concerns arise from the ways commercial qualitative data analysis (QDA) platforms incorporate and frame the role of AI in their tools. Paulus and Marone (2024) critically detail ways in which their dominant narratives of automation and efficiency risk undermining the epistemological foundations of qualitative research and weakening the researcher–data connection.

In this article, we offer a reframing, arguing that GenAI’s transformative potential for qualitative analysis lies not in automating tasks for speed or scale, but in supporting deeper engagement with core commitments of qualitative research and addressing persistent challenges in enacting them. We explore how fundamental characteristics of LLMs and defining principles of qualitative analysis offer productive overlaps, enabling new approaches for current practice and future inquiry. We begin by outlining some core commitments and ongoing challenges in qualitative analysis. We then provide a brief overview of LLMs and unpack key characteristics that can support efforts to meet these commitments. Finally, we provide an illustrative example of how LLMs can be mobilized to support robust qualitative analysis and outline initial expectations for conducting AI-in-the-loop analysis in ways that strengthen, rather than undermine, trustworthiness.

Qualitative Analysis: Key Commitments and Persistent Challenges

There are many variations of qualitative research; here we follow the tradition of Guba and Lincoln (1989), which is grounded in a relativist paradigm that highlights the social, cultural, and contextual elements of human activity. In this approach, researchers use emergent inductive methods to generate interpretations and address questions about how or why something might have occurred. Qualitative analyses of this type explore and display the rich complexity of a situation by searching for mechanisms, investigating context, examining the nuances of participants’ experiences, and documenting in great depth “what is going on here.” The value of thick description (Geertz, 1973) and what it allows us to see and understand characterizes much of the work in the field. Such close analysis requires repeated passes through the data and anticipates changes in both emergent conjectures and analytic approaches as a result of that iteration, regardless of the specific method used. These repeated passes help researchers to offer a more coherent and rigorous explanation of small moments of data, such as a particular utterance, and to test interpretations in consideration of a larger dataset.

Qualitative analyses of this kind share a set of epistemological and ontological commitments concerned with the centrality of the human researcher as an important element of analysis because of the subjectivities they bring to the analytic enterprise (Nzinga et al., 2018), with the understanding that true researcher objectivity is neither possible nor desirable (Madill et al., 2000). Rather than seeing positionality as a problem, qualitative analysis leverages it as an analytical resource that needs to be acknowledged and discussed as a component of the work. Analyses therefore strive to accomplish two somewhat contradictory tasks. On one hand, the researcher begins with a commitment to understand the situation from the perspective of the participants. First questions to guide analysis are often open, looking to understand what has transpired. For example, rather than asking “why is this classroom out-of-control,” one might ask “why does this classroom work this way?” At the same time, researchers must be attentive to how their experiences influence what they see, notice, and value in the data, which is influenced through multiple forms of subjectivity.

One aspect of subjectivity involves the researchers’ knowledge of and immersion in the context where data is collected. Understanding what is going on here requires understanding the broader context to make claims that are relevant. For example, understanding teacher practices requires considering what is happening in a classroom as well as in the surrounding environment, such as the kinds of evaluations that are conducted of students and about teachers, district pacing guides, and standards for promotion. A second aspect of subjectivity involves the researchers’ immersion in the data record. Qualitative analysis generally requires intentional selection of a subset of data for close analysis (Ochs, 1979); such intentional selection obligates researchers to be familiar with the entire corpus of the data. This includes attention to the kinds of data collected and how they are collected (Roschelle, 2000), the extent to which the participants were part of this process (Vossoughi et al., 2020), and how collected data are related to other aspects of the context (Erickson, 2010). A third facet of subjectivity involves the researchers’ own histories and positioning in the world, which influences what they notice and attend to in the data (Milner, 2007) as well as what theoretical lenses they may bring to bear. Personal histories and experiences can offer researchers a unique insight into the complexities of local contexts, how they work, and why they matter, as well as the context’s connections to broader social, historical, and cultural dynamics. For example, a researcher’s personal experience with the current or a similar context can sometimes support emic insight, whereas an external or contrasting positionality can bring other aspects of a situation to light. As a consequence, it is valuable to involve multiple researchers with different histories and backgrounds.

Rather than seeing subjectivities as confounds to high-quality research, qualitative analysis emphasizes the ways they support a rich understanding of the complexity of contexts. This means it is possible to develop more than one valid understanding of the data, and that such variance can be both valuable and useful. As Madill et al. (2000) note, “the goal of triangulation is completeness not convergence….two models [resulting from analysis] demonstrate how researchers can provide complementary pictures of a phenomenon. The models are not incompatible but allow us to view the experience of participants from two different perspectives, both of which are justifiable” (p. 12). This is not to say that all analyses are seen as equally valid, or that any interpretation is reasonable. The warrants for claims, and the emphasis on interrogating one’s own subjectivity and how it influences what is seen in the data, are essential components of rigorous qualitative analysis (Greene, 2014).

To summarize, the key commitments of qualitative analysis described above are: (i) close attention to the details of the data in multiple iterations that results in layering of meaning; (ii) immersion and personal history with the data to attend to any piece of it with the larger context in mind; (iii) attention to the larger context in which the data was collected so that the meaning of utterances is understood even if not directly captured in the data; (iv) positionality with respect to researchers’ lived experiences and associated perceptions to support ‘noticing’ and conceptualizing different insights and interpretations. (v) including multiple researchers in conversation with the data and with each other in ways that support different interpretations to emerge.

Enacting these commitments is central to the qualitative analysis process, but not without its challenges. Completing multiple iterative rounds of analysis of the full set of data (i) is time consuming, thus in practicality researchers often focus on subsets of the data that seem of particular interest. Likewise, keeping all of the larger context represented within and beyond the data in mind while attending to a piece of it (ii, iii) can be difficult even for experienced researchers, and processes for doing so are idiosyncratic and generally unexaminable. In addition, which positionalities are drawn on to generate and conceptualize insights (iv) and how many different ones can be engaged in dialogue with each other (v) is often practically limited by the size and membership of the research team.

Properties of LLMs that Support Efforts to Meet These Commitments

A Brief Overview of Large Language Models (LLMs)

LLMs are initially built by learning information patterns through a pre-training phase where they process vast amounts of text and other data modalities from a broad range of sources⁴ (Hoffmann et al., 2022; Kaplan et al., 2020). Following pre-training, models are often further refined through supervised fine-tuning or reinforcement learning from human feedback (RLHF) to align their responses more closely with human expectations and ethical norms (Christiano et al., 2017; Wei, Bosma et al., 2022). Once trained, the model’s representation of information remains fixed unless specifically updated through additional training or fine-tuning.

When used by a researcher to analyze data (e.g. by sending a message or “prompt” to the model) the model breaks down the submitted text into tokens (words, subwords and special markers⁵) and represents them in its working memory, generally referred to as its “context” (referred to here as “model context” for clarity; Zhao et al., 2023). The model context cumulatively includes everything the researcher submits to the model as well as all of the model’s responses. Maximum model context size, which determines how much information the model can process at once, currently ranges from several to many hundred thousands of tokens. For example, a researcher might submit a transcript of a classroom interaction and the request to identify all potential instances of scientific engagement. Each token (from the transcript and request) is placed in the model context, represented in a multidimensional semantic embedding space (previously formed during training) in which related concepts are positioned more closely together.

The positioning of each token in the embedding space occurs through an iterative process informed by its intrinsic meaning, the model’s learned knowledge from training, and its relation to all other preceding tokens through a process called “attention” (Vaswani et al., 2017) For instance, when analyzing a classroom transcript, the model would recognize that the word “challenging” needs to be positioned differently in the space depending on whether it appears in a positive context (“rewarding but challenging”) or a negative one (“too challenging”). The ability to distinguish such nuanced meanings helps the model respond to questions about the text in ways that reflect the subtle differences in how words are used in different situations.

In producing a response to the researcher’s request, the model generates tokens one at a time in an autoregressive fashion, selecting each based on probabilities derived from all prior tokens represented in the model context (Zhao et al., 2023). This enables varied responses to the same input (with the degree of variability itself adjustable), supporting the exploration of multiple interpretations of data through iterative interaction.

Key LLM Properties with Relevance for Qualitative Analysis

Having briefly outlined how LLMs function in use, we now highlight key properties of LLMs that offer promise for addressing challenges and realizing opportunities for qualitative analysis.

Large-Scale Pre-Training

LLMs incorporate broad conceptual representations learned during training across trillions of tokens. Rather than losing information through approximation or averaging, the training process is used to create and refine a representational space that provides support to encode varied information richly, including multiple languages, viewpoints, cultural understanding, and nuances of meaning.

Large-scale pre-training situates analysis processes within broader socio-cultural contexts learned by the model, supporting contextually grounded interpretation of an utterance’s meaning, even when it is not explicitly present in the data, by drawing on similar utterances encountered during training (commitment iii)⁶. In addition, researchers can elicit particular perspectives, stances or theories represented in the model to support surfacing different kinds of insights and interpretations (commitment iv) and put such perspectives or the insights arising from them in conversation with each other (commitment v)⁷.

Attention Mechanisms

LLMs continually reinterpret the elements of the model context (i.e. adjust the representation of tokens) in consideration of all other preceding tokens through attention processes as new information is added to the model context. In this process, all parts are recontextualized (all tokens are re-represented) based on the changes the new information brings.

This enables the model to focus on specific details within the data while maintaining attention to the broader context (commitment ii). Additionally, the model can process the data iteratively, continuously refining its representations as it integrates new information and deepening reflections from the researchers (e.g. input as memos or used to frame prompting), allowing for the layering of meaning over time (commitment i). This is useful to help surface subtle patterns, recurring themes, and contrasting details that can support nuanced interpretations.

Rich Embeddings

LLMs iteratively use attention processes and the model’s learned knowledge from training to represent words, sentences and documents in the embedding space in a way that captures semantic relationships as directional distances between tokens.

This provides a foundation for the analysis of themes in the data, represented as both large- and small-scale patterns in the information’s representation, which can be explored in multiple iterations (commitment i) and from multiple perspectives (commitment iv). In addition, when working in partnership with data scientists these embeddings can also be examined directly to probe the organization of the underlying data representation.

Auto-Regressive Nature

When composing a response to user input, LLMs generate tokens one at a time using a set of probabilities based on all prior tokens (i.e., the model context built up from all researcher inputs and the model’s previous responses). Model “temperature” is a parameter that controls the variability of the model’s output, adjusting between more deterministic (most probable tokens favored) and creative (a wider range of token probabilities is sampled from) responses⁸.

LLMs can produce varied responses to the same input, recombining and reframing all the information in their model context to offer multiple alternative interpretations. This allows for iterative and dynamic engagement with the data, enabling varied interpretations to emerge and be put in conversation with each other (commitments i and v). Adjusting the temperature, in particular, adds flexibility to analysis; while lower settings reinforce consistency, higher settings can help surface alternative interpretations, akin to how different researchers notice distinct aspects of the data (commitment iv).

Long-Context Capabilities

LLMs now support the analysis of multiple long documents and provide the capability for coding them interactively (in-context learning), allowing for rapid iteration of nuanced analyses. For example, at the time of writing, common model context sizes available range from 128,000 tokens (OpenAI’s GPT5 via ChatGPT) to 1 million (Google’s Gemini 3) providing the ability to represent approximately up to 300 to 2,500 pages of text in memory at once. Multimodal models, such as Gemini, that accept video as input can currently process about an hour of video at a time, with expectations that this capacity will be expanded.

It is possible for a LLM to hold the entire textual corpus of data collected in a study (e.g. classroom and interview transcripts, researcher memos, etc.) in its model context at once. This supports identification of potential patterns, often subtle and complex, across a large and diverse collection of data that might be challenging for humans to notice manually. It also allows researchers to probe and unpack patterns noticed in one part of the dataset while keeping the larger context of the full data set in mind (commitment ii). In addition, the flexibility to iteratively explore data through both small adjustments and substantial shifts in perspective supports repeated engagement with the entire dataset across multiple passes, helping researchers evolve, refine, and deepen their interpretations over time (commitment i).

Prompting LLMs to Meet the Commitments of Qualitative Analysis

The properties described above are leveraged when researchers engage with the models, which occurs through interactive prompting (requests submitted to an LLM). There are many prompting strategies that can be helpful for engaging in qualitative analysis of data (Anthropic, n.d.; Wei, Wang et al., 2022; White et al., 2023), several of which are presented here. Personas, which give the LLM a role to play (e.g. “you are an experienced K-12 science teacher”), and flipped interaction, which asks the LLM to drive the conversation (e.g. “Ask me questions to clarify what I look for in identifying key moments of scientific engagement”) can support the model adopting specific positionalities to help surface patterns and interpretations (commitment iv). When done with care, this can help researchers expand the range of positionalities that engage with the data beyond those available within the research team, while also fostering dialogue between these perspectives to generate varied and multifaceted interpretations (commitment v). Prompting can also invite investigation of alternative approaches that foreground the context surrounding the data, helping researchers better interpret meaning that may not be explicitly stated in the transcript. For example, a researcher might introduce several theories they see as potentially relevant for making sense of what is going on in the data (e.g., a sociocultural framework highlighting how institutional norms structure opportunities for participation, a positioning theory tracing how identities and roles are negotiated in classroom talk), and then prompt the model to examine how these frameworks highlight different features of the transcript. This can help researchers broaden the consideration and selection of constructs potentially useful for analysis (commitment iii, iv).

Other prompting techniques are useful for encouraging close attention to the details of the data (i). For example, chain of thought prompting supports complex reasoning with intermediate steps (e.g. “first create a table with excerpts of all notable instances of scientific engagement, then look for patterns in what was notable about them”). It can further be used to search for and organize additional examples from throughout the data that may support, refine, or challenge a theme (commitment ii). Templates direct the LLM to produce output in a particular structure designed for researcher interrogation (e.g. “Present potential themes in a table with four columns: theme name, short characterization, full text of all potentially supporting excerpts from classroom transcripts, full text of all potentially challenging excerpts). Finally, rationale asks the model to justify its responses in an inspectable format (e.g. “Describe how each excerpt supports the theme you’ve proposed”). This can help surface potential assumptions embedded in the analysis (commitment iii), creating space for researcher critique and dialogic engagement with the model to interrogate and refine emerging interpretations.

While prompting can refer to any text sent to an LLM, a special kind of prompt is a system instruction: directives given at the start of interacting with a model that set its overall behavior, tone, or role throughout the interaction. In qualitative analysis, system instructions can guide the model to maintain a consistent analytic posture throughout the process. For example, a system instruction might direct the LLM to focus closely on linking claims to supporting evidence (e.g. “You are a rigorous qualitative researcher. When suggesting interpretations, you always ground them in direct excerpts and clearly explain how each piece of data supports the claim”). System instructions should not be used for elements of the analysis that need to remain flexible or evolve throughout the process.

Mobilizing LLM Properties for Robust Qualitative Analysis

There are many ways that qualitative researchers can thoughtfully make use of LLMs as part of an AI-in-the-loop analytic process, while also staying true to core commitments of qualitative analysis. We frame such activities as mobilizing LLM properties to signal not simple instrumental use, but their intentional activation and coordination in relation to core qualitative analytic commitments. Below we walk through how some key activities of qualitative research might be conducted by humans together with AI. In doing so, we have two goals. The first is to offer concrete examples for readers to think with and consider in relation to their own data and questions. The second is to offer a set of expectations for what it would look like to undertake AI-in-the-loop analysis in a way that would be deemed trustworthy (Lincoln & Guba, 1985).

Trustworthiness

Although the field of qualitative analysis encompasses a range of perspectives on indicators of analytic quality, we focus here on the criteria that are most consistent with the relativist paradigm (Guba & Lincoln, 1989; Lincoln & Guba, 1985): credibility, dependability, confirmability, transferability, and authenticity. Briefly, credibility refers to the extent to which the sources and forms of data presented are sufficient to support the interpretations made to answer the questions of the study. Dependability refers to the extent to which the data offers stable interpretations of the phenomenon under investigation. Confirmability is the extent to which interpretations of data are congruent; this typically involves including multiple researchers’ perspectives on the data to examine whether interpretations are consistent, and if not, to explore the reasons for any differences. Transferability refers to the extent to which the findings are generated and presented in such a way as to be useful to other people or settings. Finally, authenticity is the extent to which the interpretation of the data encompasses the narratives of multiple people in the context, not just a few.

Using AI to Support Trustworthy Qualitative Analysis

As a means of illustrating an LLM-supported qualitative analysis workflow, we offer a hypothetical example of studying science talk in an elementary school classroom. The primary data are transcripts of whole-class conversations from the class’s science block, which took place over two months and focused on the life cycle of plants. Data from whole-class discussions were supplemented with reflective interviews with the teacher and each of the children, as well as with researcher observational memos. The overarching research question addressed is: To what extent did pedagogical and curricular practices elicit students’ personal experiences with and connections to the lifecycle of plants, and how did those ideas get connected with the formal science curriculum chosen by the school?

The LLM-supported analysis workflow begins by establishing the model context with all relevant information available for analysis. Depending on the AI tool used, this might involve initiating a persistent chat session that can be returned to later, or creating a project space to store chats, files, and system instructions together. For research data, it’s also essential to choose a tool that meets the security and privacy requirements of the study. Relevant information includes the body of data collected through a study, as described above, and potentially also additional documents for contextualization. For example, researchers’ personal history with the data might be represented in positionality statements and an overview of the data collection schedule, while information about school and classroom histories and the context in which the data were collected could include demographics, documentation of the teacher–researcher collaboration, and details about the curriculum in use (e.g., Plants in Action).

Capitalizing on the model’s long context capabilities, including all relevant information at the start supports the model in creating rich embeddings that are informed by the entire corpus of data, which ensures that responses to specific prompts are situated in the broader context. Note that for this to be effective, all of the information must be placed directly into the model context, not stored in a separate knowledge base to be selectively retrieved later⁹. When the model context includes ‘all relevant info’ representing, for example, the flow of classroom activity or participation patterns, it establishes the analytic frame for making sense of specific instances in the data beyond what is immediately observable in a single classroom moment. For example, questions about student collaboration in this science classroom can be interrogated not solely based on the words of the interactions, but also by drawing on details about who the students are, their histories in the classroom with respect to science learning, their prior experiences working together, the overall classroom context, and potentially, what we already know about typical collaborative practices between students generally. This lays a foundation for establishing credibility by ensuring that analysis is attending to the full context, not idiosyncratic aspects of the data.

Once model context is established, interactive prompting to interrogate the data begins. This iterative process can be approached in a variety of different ways as researchers consider the questions they are trying to answer, their understanding of the data and emerging patterns, possible explanations and evolving lines of inquiry. In the example of the study that is investigating students’ personal experiences with and connections to the lifecycle of plants, prompts need to set the scene for the kinds of discourse instances that are relevant. To do this, researchers might choose to develop a prompt that situates positionality with respect to the data and to the question. This can be done by prompting the LLM with a persona and rules to adopt a consistent stance¹⁰ (e.g. “You are a science education researcher experienced with analyses of student discourse that focus on argumentation and epistemology and attentive to noticing students’ diverse sense-making repertoires. As you analyze the data provided, make sure to actively look for not only straightforward examples of scientific reasoning, but also identify instances where student use everyday language or draw on lived experiences to make sense of ideas…”). As a further step, the researchers could upload published papers into the model context to make explicit the perspectives to be drawn on in enacting the persona. They could then explore the dataset for instances of interest (e.g., moments when students used any form of scientific reasoning to think about plants) and unpack the context surrounding those instances (e.g. antecedents to the student talk). Alternatively, researchers might look directly for recurring patterns (e.g. common narrative arcs in group discussions) supported by associated examples that can shed light on how the children made sense of their experiences. Each of these approaches (along with many others) can be used recursively, with each additional query and response building up the model context and thus allowing the attention mechanisms to further enrich the embeddings, similar to how a qualitative researcher deepens and refines their understanding through ongoing inquiry.

At this point, the LLM-supported workflow has surfaced a set of potentially interesting excerpts or patterns from a large dataset for further investigation, initial ‘needles in the haystack’ chosen in relation to the broader data, suggesting the potential beginnings of meaningful themes. Such patterns can be explored in multiple ways, setting the stage for confirmability as researchers take multiple passes through the data, adjusting prompts to vary across different personas and analytic tasks. Following the example above, researchers might iteratively prompt for particular perspectives about plants in the everyday world, including the ways those perspectives might be informed by personal histories or group membership, to explore influence on the model output. Given what the researchers know about the community being studied, they might also prompt for specific kinds of instance (e.g. examples of where students talked about plants being used as medicines at home) or for instances of experiential grounding (e.g. student contributions that indicate personal experience, such as utterances that begin with with “I,” “At my house,” or “My mom and dad”…). Such exploration needs to be undertaken with care, given well-documented limitations of LLMs due to the overrepresentation of WEIRD (Western, Educated, Industrialized, Rich, Democratic) perspectives in their training data (Bender et al., 2021), and because demographic characterizations run a significant danger of over-essentializing (Bisbee et al., 2024). If the goal of these different personas is limited to identifying data for further analysis (finding a broader variety of needles in the haystack), such oversimplifications are less worrying. In its most optimistic potential, these cycles of iteration can allow multiple useful perspectives on the data to be included in the analysis (Li & Wise, 2025).

Establishing grounds for confirmability requires thinking carefully about how the personas and tasks brought to bear might reveal or hide aspects of the data, and explaining those decisions and their iterative output fully in the findings. For example, the methods section might include details of the iterative prompting that was used and the different kinds of excerpts that were identified based on those prompts. These rounds of analysis also offer an additional mechanism for probing confirmability beyond what is typically feasible in qualitative research (e.g. triangulation across researchers, member checking with participants). Here, the model can be prompted to review the full dataset to surface both confirming and disconfirming instances for emergent conjectures, thereby allowing the researcher to more robustly explore and stress-test findings.

Such cycles of prompting further enable researchers to take seriously the importance of authenticity, as the model can specifically search for instances of students who have not yet surfaced, for example by asking for a characterization of the experiences of a particular child and a comparison of how this articulates with and/or differs from the themes of experience already constructed. This creates an opportunity to explore, in this instance, whether only some kinds of personal experiences are being identified, or whether science concepts that are relevant to plant life (but different from those introduced in the unit) might also be surfaced. Additionally, with an eye towards ensuring dependability of analysis, once the researchers start to find relevant patterns in the data, they might also explore the extent to which those themes seem stable over time or change in interesting ways. Taken together, researchers can use these capabilities to broaden or narrow the scope of inquiry, invoke alternative viewpoints for analysis, examine effects of changing assumptions on interpretations, and explicitly explore impacts of contextualizing information. Such detailed repeated reviews of the data corpus also support the sharing of findings (and the analysis process) with sufficient detail (Davison et al., 2024), supporting transferability.

Conclusion

While GenAI is often seen as a tool for automating or expediting analysis, here we have explored its deeper potential to engage meaningfully with the core commitments of qualitative inquiry and help address long-standing challenges in enacting them. However, while AI-in-the-loop analysis offers new possibilities, it also demands standards for rigor. We propose as a starting point that analyses that do not seriously engage with the foundations of qualitative research and document their processes for doing so are generally not high-quality. Researchers can provide evidence for how their analysis meets criteria for trustworthiness such as credibility, dependability, confirmability, transferability, and authenticity, through both established and emerging practices. For example, searching for confirming and disconfirming instances is a powerful means of supporting confirmability that can be carried out much more extensively with AI. Novel approaches for meeting existing trustworthiness criteria are also possible, such as an AI-supported temporal audit that examines how the qualities of a theme evolve over time to strengthen dependability. In addition, expanded criteria for trustworthiness may be needed, particularly transparency in analytic decision-making,¹¹ regarding what additional data context is included, which models and parameters are selected, and how iterative prompting approaches are enacted (Lin et al., 2025).

We close by drawing attention to persistent evidence that GenAI systems reflect the dominant cultural perspectives embedded in their training data and can reproduce problematic human biases even after explicit training not to (Bai et al., 2025; Hofman et al., 2024). A core strength of qualitative analysis is its intentional engagement with subjectivity. Extending this stance into work with GenAI offers a powerful means to surface and scrutinize assumptions that shape model outputs. Just as qualitative researchers interrogate their own positionality, so too must we interrogate, and work to reorient the cultural logics embedded in AI models. When engaged critically, GenAI can also be used to support this reflexive work, for example by being prompted to adopt explicitly critical, bias-aware analytic stances and to make patterned forms of bias in analytic outputs visible. Researchers can also prompt GenAI to help them reflect on their own assumptions by posing questions relevant to the research site, data, and their personal histories, enabling deeper interrogation of positionality throughout the analytic process. By taking shared responsibility for how AI is used in analysis, qualitative researchers and AI developers can work together to advance “human + AI” analytic practices that support the development of trustworthy, context-aware insights from large-scale data.

Footnotes

Acknowledgments

Generative AI (ChatGPT, GPT-5) was used to refine wording and formatting at the sentence level or below; all ideas and arguments presented are the authors’ own.

ORCID iDs

Alyssa Friend Wise

Jesse Spencer-Smith

Ethical Considerations

Conceptual contribution, ethical approval was not required.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

References

Anthropic . (nd). Chain complex prompts for stronger performance. Anthropic. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/

Bai

Wang

Sucholutsky

Griffiths

T. L.

(2025). Explicitly unbiased large language models still form biased associations. Proceedings of the National Academy of Sciences, 122(8), Article e2416228122. https://doi.org/10.1073/pnas.2416228122

Bakharia

Shibani

Lim

L. A.

McCluskey

Shum

S. B.

(2025). From transcripts to themes: A trustworthy workflow for qualitative analysis using large language models. In From Data to Discovery: LLMs for Qualitative Analysis in Education Workshop. LAK 2025.

Barany

Bittencourt

I. I.

Cukurova

Yacef

(2024). ChatGPT for education research: Exploring the potential of large language models for qualitative codebook development. In Bittencourt

I. I.

Cukurova

Yacef

(Eds.), Artificial Intelligence in Education: AIED 2024 Lecture Notes in Computer Science (14830, pp. 134–149). Springer. https://doi.org/10.1007/978-3-031-64299-9_10

Bender

E. M.

Gebru

McMillan-Major

Shmitchell

(2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’21) (pp. 610–623). ACM. https://doi.org/10.1145/3442188.3445922

Bisbee

Clinton

J. D.

Dorff

Kenkel

Larson

J. M.

(2024). Synthetic replacements for human survey data? The perils of large language models. Political Analysis, 32(4), 401–416. https://doi.org/10.1017/pan.2024.5

Christiano

P. F.

Leike

Brown

Martic

Legg

Amodei

(2017). Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (30). https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf

Davison

R. M.

Chughtai

Nielsen

Marabelli

Iannacci

van Offenbeek

Tarafdar

Trenz

Techatassanasoontorn

A. A.

Díaz Andrade

Panteli

(2024). The ethics of using generative AI for qualitative data analysis. Information Systems Journal, 34(5), 1433–1439. https://doi.org/10.1111/isj.12504

De Paoli

(2024). Performing an inductive thematic analysis of semi-structured interviews with a large language model: An exploration and provocation on the limits of the approach. Social Science Computer Review, 42(4), 997–1019. https://doi.org/10.1177/08944393231220483

10.

Erickson

(2010). Classroom ethnography. In Peterson

Baker

McGaw

(Eds.), International Encyclopedia of Education (3rd ed., Vol. 3, pp. 320–325). Elsevier.

11.

Gao

Guo

Lim

Zhang

T. J.

Perrault

S. T.

(2024). CollabCoder: A lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models (Version 4). arXiv Preprint. https://arxiv.org/abs/2304.07366

12.

Geertz

(1973). Thick description: Toward an interpretive theory of culture. In The interpretation of cultures: Selected essays (pp. 3–30). Basic Books.

13.

Greene

M. J.

(2014). On the inside looking in: Methodological insights and challenges in conducting qualitative insider research. The Qualitative Report, 19(29), 1–13. https://doi.org/10.46743/2160-3715/2014.1106

14.

Guba

E. G.

Lincoln

Y. S.

(1989). Fourth generation evaluation. Sage Publications.

15.

Hoffmann

Borgeaud

Mensch

Buchatskaya

Cai

Rutherford

Casas

D. D. L.

Hendricks

L. A.

Welbl

Clark

Hennigan

Noland

Millican

van den Driessche

Damoc

Guy

Osindero

Simonyan

Elsen

Irving

(2022). Training compute-optimal large language models. In Advances in Neural Information Processing Systems (35). https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf

16.

Hofmann

Kalluri

P. R.

Jurafsky

King

(2024). AI generates covertly racist decisions about people based on their dialect. Nature, 633(8028), 147–154. https://doi.org/10.1038/s41586-024-07856-5

17.

Jowsey

Braun

Clarke

Lupton

Fine

(2025). We reject the use of generative artificial intelligence for reflexive qualitative research. Qualitative Inquiry, Article 10778004251401851. https://doi.org/10.1177/10778004251401851

18.

Kaplan

McCandlish

Henighan

Brown

T. B.

Chess

Child

Gray

Radford

Amodei

(2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. https://arxiv.org/abs/2001.08361

19.

Wise

A. F.

(2025). From filling gaps to amplifying strengths: Exploring an asset-based approach to learning analytics. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (pp. 924–930). ACM. https://doi.org/10.1145/3706468.3706542

20.

Lin

G. C.

Anderson

Hanks

Chauhan

Farid

Fenech

Klopfer

(2025). Collaborative AI for qualitative analysis: Bridging AI and human expertise for scalable analysis. In From Data to Discovery: LLMs for Qualitative Analysis in Education Workshop. LAK 2025.

21.

Lincoln

Y. S.

Guba

E. G.

(1985). Naturalistic inquiry. Sage Publications.

22.

Madill

Jordan

Shirley

(2000). Objectivity and reliability in qualitative analysis: Realist, contextualist and radical constructionist epistemologies. British Journal of Psychology, 91(1), 1–20. https://doi.org/10.1348/000712600161646

23.

Milner IV

H. R.

(2007). Race, culture, and researcher positionality: Working through dangers seen, unseen, and unforeseen. Educational Researcher, 36(7), 388–400. https://doi.org/10.3102/0013189X0730947

24.

Nzinga

Rapp

D. N.

Leatherwood

Easterday

Rogers

L. O.

Gallagher

Medin

D. L.

(2018). Should social scientists be distanced from or engaged with the people they study? Proceedings of the National Academy of Sciences, 115(45), 11435–11441. https://doi.org/10.1073/pnas.1721167115

25.

Ochs

(1979). Transcription as theory. In Ochs

Schieffelin

B. B.

(Eds.), Developmental pragmatics (pp. 43–72). Academic Press.

26.

Paulus

Lester

J. N.

Davis

(2025). The construction of the role of AI in qualitative data analysis in the social sciences. AI & SOCIETY. https://doi.org/10.1007/s00146-025-02488-3

27.

Paulus

T. M.

Marone

(2024). “In minutes instead of weeks”: Discursive constructions of generative AI and qualitative data analysis. Qualitative Inquiry, 31(5), 395–402. https://doi.org/10.1177/10778004241250065

28.

Rao

V. N.

Agarwal

Dalal

Calacci

Monroy-Hernández

(2024). Quallm: An LLM-based framework to extract quantitative insights from online forums. arXiv. https://arxiv.org/abs/2405.05345

29.

Rientes

Domingue

Duttaroy

Herodotou

Tessarolo

Whitelock

(2025). What distance learning students want from an AI Digital Assistant. Distance Education, 46(2), 173–189. https://doi.org/10.1080/01587919.2024.2338717

30.

Roschelle

(2000). Choosing and using video equipment for data collection. In Kelly

A. E.

Lesh

R. A.

(Eds.), Handbook of research design in mathematics and science education (pp. 709–732). Lawrence Erlbaum Associates.

31.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

(2017). Attention is all you need. In: Advances in Neural Information Processing Systems (30). https://arxiv.org/abs/1706.03762

32.

Vossoughi

Jackson

Chen

Roldan

Escudé

(2020). Embodied pathways and ethical trails: Studying learning in and through relational histories. Journal of the Learning Sciences, 29(2), 183–223. https://doi.org/10.1080/10508406.2019.1693380

33.

Wei

Bosma

Zhao

V. Y.

Guu

A. W.

Lester

Dai

A. M.

Q. V.

(2022). Finetuned language models are zero-shot learners. In Proceedings of the International Conference on Learning Representations (ICLR 2022). https://openreview.net/forum?id=gEZrGCozdqR

34.

Wei

Wang

Schuurmans

Bosma

Ichter

Xia

Chi

E. H.

Q. V.

Zhou

(2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. https://doi.org/10.48550/arXiv.2201.11903

35.

White

Hays

Sandborn

Olea

Gilbert

Elnashar

Spencer-Smith

Schmidt

D. C.

(2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv Preprint. https://doi.org/10.48550/arXiv.2302.11382

36.

Yan

Echeverria

Fernandez-Nieto

G. M.

Jin

Swiecki

Zhao

Martinez-Maldonado

(2024). Human-ai collaboration in thematic analysis using chatgpt: A user study and design recommendations. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (pp. 1–7). https://arxiv.org/abs/2311.03999

37.

Zhang

Xie

Lyu

Cai

Carroll

J. M.

(2023). Redefining qualitative analysis in the AI era: Utilizing ChatGPT for efficient thematic analysis. arXiv. https://arxiv.org/abs/2309.10771

38.

Zhao

W. X.

Zhou

Tang

Wang

Hou

Ding

Chen

Jiang

Ren

Tang

Yang

Fan

Zhang

Liu

Wen

J.-R.

(2023). A survey of large language models. arXiv preprint arXiv:2303.18223. https://arxiv.org/abs/2303.18223