Abstract
Keywords
Introduction
In February 2023, Dr Cesare Aloisi, Head of Research and Development (R&D) at AQA, one of England’s largest assessment organisation and exam boards
1
(see Appendix A), was preparing for an upcoming presentation on the ethics of using artificial intelligence (AI) to mark students’ essay-based exams. The central theme of his presentation was going to be how to transition ethically from the paradigm of
AIEd: promising far-reaching solutions
Interest in AI in education (AIEd) was not new. In fact, it dated back at least to the 1960s. Over the years, the possibilities that this technology presented were so enticing, that by the late 2010s and early 2020s, there was growing belief that AIEd could bring substantial benefit to education and that it had the potential to transform the educational landscape worldwide (Nguyen et al., 2022). In the UK, organisations such as JISC (an organisation that focussed on digital transformation of tertiary education) and NESTA (an innovation hub) all shared the belief that the use of AI could be of great benefit to education, if used correctly. They saw AIEd as having the potential to reduce teacher workload, 2 improve consistency in marking, provide wide-scale personalised learning, and ensure greater consistency in the quality of learning provided by schools and other educational institutions across the UK (JISC, 2022; Baker et al., 2019).
JISC (2022) maintained that in the education system the impact of AIEd could be ‘transformational’. AI could both extend capacity, by automating certain functions, and increase capability, by augmenting others (see Appendix B). Thus, there could be instances of automated marking (machines marking humans) and augmented marking (humans and machines jointly marking humans). This offered the opportunity to harness the automation-augmentation paradox, with automation and augmentation co-existing, rather than being a trade-off between the two (Raisch and Krakowski, 2020). JISC had developed a model of AI maturity (see Appendix C) illustrating the potential impact of AIEd at different levels of maturity. At the transformational level, JISC believed that AI would free educators from routine administrative tasks and allow them to focus on engaging learners, and allow learners to have a fully personalised learning experience.
What is AI, AIEd and AES
AI was not ‘one single thing’, there were many techniques and applications that together were commonly grouped as AI – for example, Deep Learning and Natural Language Processing (see, for example, Jaffri, 2022). Many AI tools in education, that is, AIEd tools, including AES (Automated Essay Scoring) as one type of AIEd tool, used a mixture of non-AI rule-based statistical features and deep-learning algorithms and databases (e.g. Pytorch, Hugging face framework, and Transformer such as LongFormer). Exhibit 1 below illustrates some of the mixture of statistical features and deep-learning algorithms in AES.
3
Statistical features and deep-learning algorithms in AES (simplified). Source: Adapted from Fischer et al. (2021).
There were three broad areas in which AI is being used in education: (1) System-facing AI, providing information for managers and administrators; (2) Learner-facing AI, interacting with learners on an adaptive basis, with the aim of personalising the learning for each learner; (3) Teacher-facing AI, seeking to reduce teacher workload by automating tasks such as marking and assessment, detection of plagiarism and provision of feedback, as well as those providing insights about learner progress and helping teachers to experiment with different methods of teaching based on the AI-generated insights (Baker et al., 2019).
As AQA focussed on setting and marking of exams, the area of AIEd which interested Aloisi most was the teacher-facing activity, and in particular the field of automated essay scoring (AES). In this field, because AI could not get tired or bored, AI promised to increase grading consistency. AES also had the potential to prevent the ‘tick and flick’ approach, where the level and detail of feedback that markers gave, became less and less as the number of papers marked increased (Lewis, 2013, p.189). Aloisi and his colleagues at AQA noted that AES should not be seen as a homogeneous construct: two of the aspects worth considering were low-stakes versus high-stakes assessments; and short-text responses, used for example for language tests, versus longer-text responses, which required demonstration of both linguistic and substantive knowledge (see Exhibit 2). AES, a multi-dimensional construct. Source: Developed by Fischer and Aloisi for the purpose of this study.
Grading writing quality
AES systems had been around for a long time, with the first being developed in 1966. Project Essay Grade (PEG) as the system was known, had been developed to enable the College Board, an organisation based in the US that developed and administered thousands of standardised tests, 4 to streamline and speed up its essay scoring process (Dikli, 2006). PEG sought to grade the quality of the writing by looking for characteristics that were predictive of writing quality, such as essay length, diction, fluency, grammar and sentence construction. An experiment conducted in 1999 to test the accuracy of PEG concluded that it performed at least as well as human markers, and that it was extremely efficient, being able to grade approximately six documents per second (Shermis et al., 1999). The authors of the report concluded: ‘The initial applications of automated text graders will be to provide assistance in the summative evaluation of written work. However, the automated text grading has its greatest potential in providing students with formative feedback about areas of strength and weakness’ (p.7).
By 2023 the field of automated essay scoring and formative writing feedback had exploded. Advances in natural language processing (NLP) meant that besides Large Language Models (LLMs), such as ChatGPT (GPT = Generative Pre-trained Transformer), there were now numerous AI tools that could be used for formative writing feedback. Among them were Grammarly, MI Write, 5 Feedback Fruits, 6 Turnitin and Quill. Grammarly claimed that every day 30 million people and 50 000 teams around the world made use of its products (Grammarly, n.d.). Quill was being used by around 123,000 teachers in 28,000 schools (Quill, n.d.). Turnitin, which started out as a tool to help students and teachers identify plagiarism, had evolved and was now also used to provide writing feedback. Turnitin’s products were being used by more than 34 million learners in more than 15,000 school and tertiary institutions across the world (Turnintin, 2019). In the US, one state used PEG as its sole method for providing state summative writing assessments and the system was being used for formative writing assessments in 1000 schools and 3000 public libraries across the US. The digital learning company, Pearson, which had also been using automated scoring since the 1990s and owned Intelligent Essay Assessor (IEA), maintained that as early as 2010, IEA been used to score millions of essays written by learners in grades 4 to 12 and in tertiary education. Pearson believed that IEA could be used in high-stakes exams, to provide a second opinion and to provide formative evaluations (Pearson, 2010).
The accuracy of AES
As Aloisi investigated AES, one of his key initial concerns was that of accuracy. Unlike human markers, AES systems did not evaluate the intrinsic qualities of an essay. Instead
‘In concept, a functioning model replicates the scores that would have been provided by all the human raters used in the calibration essay. Thus, a functioning model should be more accurate than the usual one or two human raters who typically assign scores’, observed Rudner et al. (2006 p.18). ‘The issue, however, is how one defines a validated functioning model... One never knows if the human or computer is more accurate. Nevertheless, one should expect the automated essay scoring models and humans raters to substantially agree and one should expect high correlations between machine and human-produced scores’.
Before approving the move to an AES product called Intellimetric, GMAC (Graduate Management Admission Council) had conducted research to assure itself that the tool would ‘reasonably approximate’ the scores of human markers. The evaluation had found the system to be ‘extremely effective’, and that it was even able to identify papers where cheating had occurred (Rudner, 2005). The agreement between Intellimetric and the human markers was very similar to that between two human markers – being identical or within one point of each other 97% of the time and identical 55% of the time (Kaplan, n.d.). The results of several other AES studies had also reported high agreement rates between AES systems and human assessors (e.g. Lewis, 2013 and Dikli, 2006), but of course, correlation between marks does not necessarily mean there is causation (Christodoulou, 2023).
Doubts and limitations on accuracy
Marjanovic and Cecez-Kemanovic (2017) and Galliers et al. (2017) pointed out that there were a number of limitations associated with algorithmic decision-making. These included the following: • De-contextualisation: Data taken out of original context and then propagated and used in other contexts; • Recombination: Creation of new data/information through re-combination of de-contextualised data from other sources; • Using quantified proxies: Using quantified data as proxy measures for complex phenomena; • Gaming: Strategic and selective collection and use of data in pursuit of individual goals • Propagation of legitimation: Legitimacy of inferred information based on legitimacy of original data; • Auditing by non-experts: Non-experts using open performance data judge the quality of complex expert activities; • Amplified performativity: Data used to amplify impact of measures on what is being measured.
One of the above limitations that called into question the accuracy of AES was using quantified proxies. AES algorithms were frequently trained to identify words, phrases and patterns that were characteristic of stronger or weaker answers. They did not actually understand the essay that they were scoring. This raised the potential for users to mistrust the system and for the system to make mistakes. Indeed, at least two studies had shown that it was possible to trick certain AES systems by using a lot of big, but meaningless words (Lewis, 2013; Feathers, 2019). Other studies had found that even when as much as 20% of the content of an essay was changed, the AES score remained the same. On the other hand, simply adding three words to a 350-word essay could increase the AES score by an absolute 50% (Singla et al., 2021).
Christodoulou (2023) observed that once students know that AI is marking their essays, they want to know what it rewards and how it does so. They then try to game the system. She added: ‘This, essentially, is the problem with AI marking. It’s easy for it to be more consistent than humans, because humans are not great at being consistent, but whilst humans might not be consistent, they can’t be fooled by tricks’.
For Aloisi, accuracy was especially important because of the grade boundaries in the high-stakes exams that AQA was involved in setting and marking. ‘In a system like we have in the UK, where you have grade boundaries, one mark can make the difference between one grade and another grade’, he said. He believed that it might be too much of a risk even to use an automarker as second marker in such high-stakes assessments.
Explainability
But accuracy was not his only concern in the early days of his research. One of his other main concerns related to the ethics of AES. ‘Suppose that you have AI that is so good that it’s indistinguishable from a human marker, what would we want to see to be able to trust it’? Aloisi asked. ‘What emerged was explainability – can AI tell you why it gave a certain mark? The answer at the time was that it couldn’t’.
This concern related to the ‘black box’ nature of AI, where the complex algorithms used in machine learning meant that the systems could arrive at conclusions that may agree with human conclusions, but were nevertheless unexplainable. Thus, in the case of AES, this made it difficult for humans to understand how these systems arrived at their conclusions. Even their creators sometimes found it hard to predict the conclusions that their systems would reach (Baker et al., 2019).
‘Explainability is important for trust, because it gives you a sense that the system that you are interacting with is looking for more than superficial correlations: that it is capable of understanding some deeper meaning’, said Aloisi. When thinking about trust, Aloisi used the ABI + model of trust, a model combining concepts of Ability, Benevolence, Integrity and Predictability to analyse the trustworthiness of a system (Aloisi, 2023). ‘You want to know that the AI is aligned with you’, he said. ‘But also that it has the ability to evaluate you. If it’s given you a mark, you want to know that the mark is based on some sort of academic judgement, as opposed to how many words you wrote or some other superficial thing. Trust is about making yourself vulnerable to someone else, because you think that person has your best interest at heart. If you have a system that cannot tell you why it gave you a certain score, to me, it’s harder to claim that it’s a trustworthy system’.
The reason why the human creators of AES systems could not explain how their systems arrived at their conclusions, noted Aloisi, was that ‘the human will be able to tell you what the architecture is like, but they are not programming a set of rules. The system is designed in such a way that it can infer the rules. That is why it’s called machine learning’. He likened the activity of trying to understand how AI came to its conclusions to the discipline of psychology, which seeks to understand why people behave in a certain way. ‘Just looking at the brain and the way it is connected, you can have an idea of what’s happening, because there are different areas of the brain that are associated with different things. But the processing side is still a massive area of learning’, he explained. ‘It’s the same with machine learning – although much simpler. You know what the connections are, but you don’t know for any given input, the sort of abstraction that it will make’.
Others had raised concerns arising from lack of explainability, one of which related to who could be held accountable for the conclusions reached by an AES. As one University College London professor put it: ‘With humans there is accountability and exercise of power. What am I going to do, fire the AI if it’s incorrect? Who takes responsibility’? (Niemtus and Parker, 2022).
Another ethical concern related to the potential dehumanisation of learning. By their very nature, AIEd and AES systems sought to perform functions that were traditionally reserved for human beings. With education being so predicated on human interaction, some expressed concern at the consequences of removing humans from part of the process (Lewis, 2013; Comeau, 2019).
A third ethical concern related to the commercialisation and potential misuse of data. To date, most AIEd and AES systems had been developed by large corporations. Holmes (2022), for example, saw this as ‘the commercialisation of education by stealth, as education systems increasingly rely on educational tools provided by the commercial sector’. For his part, Aloisi was not necessarily opposed to this commercialisation, but believed that there had to be a regulatory framework to facilitate this involvement.
A final ethical concern was that of bias and potential exacerbation of inequality. On the face of it, because AES did not involve a human marker, such systems had the potential to be completely unbiased. Lewis (2013) wrote: ‘No human grader can be completely objective, even if the author of the essay is unknown. Certain writing styles and choices of topic or language can affect a human grader if only on a subconscious level. For a professor who interacts with students on a regular basis the possibility of bias entering into the grading process is a very real possibility. Favoured students are more likely to be graded leniently while out-of-favour students may be held to a stricter standard. A computer is not affected by such considerations’. However, there are still biases in AES. A study conducted in 2021 had shown a small, but significant, bias against male upper elementary school learners for AES. This bias was partly linked to essay word count. Removing word count did reduce bias marginally, but it also reduced each model’s scoring performance (Litman et al., 2021).
ChatGPT and large language models a game changer?
Aloisi believed that the advent of large, pre-trained language models (also known as transformer-based models) was potentially game changing for AES. When he and his colleagues first started researching AES in 2018, these models did not exist, but in 2019, when large language models such as BERT (Bidirectional Encoder Representations from Transformers) started to appear, they started investigating the implications of these models for essay scoring. It was clear to them that these models were more accurate, and that they would continue to become more accurate as time went on. GPT (Generative Pre-trained Transformer)-3.5 and ChatGPT now showed potential to take this even further.
Christodoulou (2023) conducted an experiment using ChatGPT to test whether it was possible to game the system in the same way as it had been possible to game earlier AES systems. She found that while ChatGPT was wise to certain tactics, it did not pick up others, and she concluded that although it was hard to game the system, it was not impossible.
Aloisi remarked that these models seemed to address some of the issues of explainability. ‘These days with generative AI, you can feed it an answer, you can feed it a mark scheme, and it will tell you a score and tell you why’, noted Aloisi. He was not wholly convinced, however. ‘It’s not totally accurate. The explainability issue has not been completely resolved, but the systems have got better at giving explanations. In terms of how the system works, they are still black box systems. They are no more transparent than they were 5 years ago. They just crunch more data’, he said.
‘People can say that even people are black boxes: that they find it difficult to explain why they know something. But with people, you can keep probing. This is something that is only recently been made possible with ChatGPT. But what people have, that large, pre-trained language models don’t have, is “direct experience of the world.” We live in the world that we talk about. Whereas ChatGPT is only taught about the world. We can philosophise and say that even what we know about the world is mediated – I’m not claiming that humans are special in any way, it may just be a quantitative difference. But in my opinion, there is a huge quantitative gap between the way in which a person can access that academic judgement, compared to a piece of software’.
Large languages models could also be trained with very few papers, in what was known as ‘few-shot’, ‘one-shot’ and ‘zero-shot’ learning. ‘That’s why ChatGPT can work’, observed Aloisi. ‘You don’t need that many essays to train the system’. But, he noted the accuracy of the systems diminished in zero-shot learning. ‘Zero-shot learning means that you’re getting accuracy of between 70% and 80%, which is fantastic from an R&D perspective, but if it’s your children it’s not good enough’.
He added: ‘My argument is not that these things should not be used. My argument is that at the moment, because of the explainability issue, they are hard to scale in a high-stakes context, because you end up having to do so much quality assurance that you might as well pay a person to do it in the first place’.
‘And I know that time will prove me wrong, because once you have the technology and people start to use the technology,
Ethical guidelines
There seemed to be a global convergence around five principles for the ethical use of AI: transparency, justice and fairness, non-maleficence, responsibility and privacy (Jobin et al., 2019). Underpinning these principles seemed to be a consensus that complex normative questions could not be solved with ‘good’ design alone, and that while checklists made complex ethical debates appear straightforward, they did so in a conceptually shallow manner (Mittelstadt, 2019). Complexities included how imperfections in data might significantly impact AI-generated results and how the algorithms underpinning particular AI-based tools could be quite simple, however, the results too complex for the users (Rahwan et al., 2019).
In considering these issues, Aloisi believed that a possible starting point for developing robust AI-based assessments could be to identify the qualities of a good assessment and of a good assessor and to make sure that these qualities were inherent in AES systems. Furthermore, to get started, ‘there are several AI ethics frameworks that can guide you’, he pointed out. In the UK, a non-governmental Institute of Ethical AI in Education, established in 2018 to develop agreed principles for the ethical use of AIEd, had identified nine factors that should be taken into consideration when using AIEd (see Appendix D). Applicable to all AI, independent of industry sectors, the European Union (EU) had, in 2019, developed an AI Ethics Framework that was based on seven principles: human agency and oversight; technical robustness and safety; privacy and data governance; transparency, diversity and non-discrimination; and societal and environmental wellbeing. This framework aligns with broad areas identified by literature and also Google as an example form the private sector (see Exhibit 3). Juxtaposing the EU AI ethics framework, academic literature and google. Source: Based on EU (2019), Jobin et al. (2019) and Google (n.d.)
The EU (2022) had taken the AI Ethics framework a step further and developed guidelines for the ethical use of 1. 2. 3. 4. Justified choice: Using ‘knowledge, facts, and data to justify necessary or appropriate collective choices by multiple stakeholders in the school environment’. The EU said that this factor required ‘transparency and is based on participatory and collaborative models of decision-making, as well as explainability’.
In addition, it had developed specific guidelines for the use of • • •
Thinking it through
Aloisi reflected on research that said that the ambiguity and caution over the use of AIEd could be explained by the fact that it was at ‘an emerging stage of hype, with over-optimism regarding the potential to transform existing education’ (Humble and Mozelius, 2022 p.9). These authors had observed that a 90:10 phenomenon prevailed in AIEd, where 90% of the technology was working as it should, but the remaining 10% had the potential to cause the systems to fail. It seemed to him that this held true, but he had moved on from the perspective that these failings should completely prevent the use of AES in the kinds of settings in which AQA operated.
His thinking about AES had moved from explainability, bias, and reliability, to the question of what features and qualities were necessary for AI to work alongside people in an exam situation. ‘I’m looking at trust, the components of trust, AI ethics, AI principles. The question for me is how we move incrementally from the paradigm of “humans assessing humans” to “humans with machines assessing humans”; how to make incremental changes that will make it easier to integrate AI technology into essay scoring; and how we can do this in an ethical way, so that people don’t end up serving the machine’?
These were Aloisi’s thoughts as he prepared his presentation for the forthcoming conference. He wanted to be able to make concrete recommendations and asked himself ‘How can we do this’?
Supplemental Material
Supplemental Material Evaluating the ethics of machines assessing humans the case of AQA: An assessment organisation and exam board in England
Supplemental Material for Evaluating the ethics of machines assessing humans the case of AQA: An assessment organisation and exam board in England by Isabel Fischer in Journal of Information Technology Teaching Cases
Footnotes
Declaration of conflicting interests
Funding
Supplemental Material
Notes
Author biography
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
