Abstract
Introduction
Feedback plays a crucial role in the process of learning English as a Second Language (ESL; Cao et al., 2022; Hyland & Hyland, 2006), as it fuels student motivation and achievement (Cauley & McMillan, 2010). Different types of feedback exist, such as self-feedback (SF), teacher feedback (TF), and computer-generated feedback (CF; Hattie & Timperley, 2007; Lipnevich & Smith, 2022). However, conflicting results have emerged from previous research when comparing the effectiveness of these feedback types. Some studies indicated that TF was superior to CF in identifying grammatical errors and improving overall writing quality (Kaivanpanah et al., 2020; Park, 2019). Conversely, other scholars argued that CF surpassed TF in reducing grammatical errors and positively impacting ESL learners’ writing ability (Hernández Puertas, 2018; Sistani & Tabatabaei, 2023). Moreover, TF and CF could eventually transition into SF (Lipnevich & Smith, 2022). Given the diversity of opinions and findings, further research is necessary to determine the optimal feedback approach for ESL learners.
ChatGPT, developed by OpenAI, is an AI-powered chatbot that has been hailed as a game-changer for ESL learners. While qualitative studies show its potential for ESL learning (Kasneci et al., 2023; Kuhail et al., 2023), experimental research on the effectiveness of its generated feedback is still scarce. To bridge this research gap, the present study aims to assess the impact of ChatGPT feedback in comparison to TF and SF in terms of the translation performance of advanced ESL learners, specifically Master of Translation and Interpreting (MTI) students in China. This comparative analysis would examine the overall translation quality (based on BLEU scores) as well as linguistic features such as lexicon, syntax, and cohesion in the students’ revised translation texts, focusing on the three feedback types. The findings will shed light on the advantages and disadvantages of using AI Chatbot for feedback in the context of translation practice.
Self-Feedback Versus Teacher Feedback Versus Computer-Generated Feedback
Self-feedback (SF), as a self-regulated learning practice, often involves learners detecting and correcting their own mistakes based on prior knowledge and experience. It is highly recommended for practical use in ESL classrooms, as it provides opportunities for students to critically evaluate their texts and cultivate meta-awareness and autonomy in learning (Cahyono & Rosyida, 2016). Additionally, SF can increase student motivation and active participation in second-language writing, as well as create a self-paced learning environment (Miranty & Widiati, 2021; Yu, Jiang, & Zhou, 2020). However, SF may prove counterproductive if students’ language proficiency is insufficient for independently identifying and rectifying all errors (Srichanyachon, 2011). In such cases, students might inadvertently reinforce incorrect language patterns without proper guidance.
Teacher feedback (TF) is the response given by instructors to help learners identify and revise mistakes and encourage them to engage in learning activities. Learners often perceive it as more valuable and reliable because teachers are always seen as subject experts (Guasch et al., 2013). In addition, TF can enhance learners’ confidence in second-language writing and create a sense of encouragement and interest among students (Ruegg, 2018; Srichanyachon, 2012). However, TF also has drawbacks. Time constraints make it challenging for teachers to consistently provide meaningful feedback to all students (Gul et al., 2016; Zou et al., 2023). Besides, over-reliance on TF can hinder students’ ability to critically self-assess, leading them to obediently implement corrections without analyzing their own writing (Mikume & Oyoo, 2010).
Computer-generated feedback (CF) refers to the automated responses provided by software programs to assist learners in identifying errors and suggesting improvements. Typical software programs are Grammarly (Koltovskaia, 2023), Pigai Wang (Bai & Hu, 2016), and Criterion (Li et al., 2015). CF has been found to benefit ESL learners in several ways. Firstly, these programs provide feedback in a short time and allow students to revise and practice their writing unlimited times, thus facilitating their learning process (G. Cheng, 2017). Secondly, CF can help alleviate students’ writing anxiety and embarrassment, as they receive feedback in a non-judgmental manner (Kukulska-Hulme & Viberg, 2018). Lastly, CF can guide instructors to focus on broader writing concepts rather than minor error correction, enabling them to provide more comprehensive instruction (Taskiran & Goksel, 2022). However, concerns do exist regarding CF, as it can sometimes be generic, repetitive, or even incorrect (Dikli, 2010; Jiang & Yu, 2022).
Prior studies have yielded inconsistent findings when evaluating the efficacy of TF and CF. Dikli and Bleyle (2014) asserted that TF was more concise, focused, and tailored, but CF tended to be redundant or unusable as noted in Dikli (2010). Similarly, Kaivanpanah et al. (2020) and Park (2019) discovered that TF surpassed Grammar Checker-based feedback because teachers could identify more grammatical errors and improve lexical processing. In contrast, Sistani and Tabatabaei (2023) reported that Grammarly-based feedback was outperformed due to its ability to reduce grammatical errors and even improve academic writing (Hernández Puertas, 2018). In Z. Wang and Han’s (2022) study, TF improved writing quality whereas CF (i.e., Pigai Wang) could increase students’ overall writing proficiency. Additionally, it was reported that TF had its unique strengths in promoting ESL learning, such as enhancing cognitive engagement (Zou et al., 2023).
However, the aforementioned studies had some issues that require further research in the following three aspects: (1) the variety of tools employed for CF across studies, such as Grammarly or Pigai Wang, may lead to conflicting results; (2) these focus predominantly on beginners and intermediate ESL learners, leaving the advanced learners unexplored; (3) the sample size is relatively small in the previous studies (e.g., 14 participants in Dikli and Bleyle [2014]), making the comparing results more unreliable. More importantly, to the best of our knowledge, no research has yet conducted a comprehensive comparison of CF, TF, and SF within a single investigation and it is still uncertain which feedback type is the most effective in terms of improving performance for ESL learners.
ChatGPT as a Computer-Generated Feedback Tool
ChatGPT is a chatbot launched by OpenAI in November 2022 (OpenAI, 2022). It adopts large language models, specifically GPT-4, to perform natural language processing tasks like writing, summarizing, translating, and answering questions (Kocoń et al., 2023; Y. Liu et al., 2023; Shen et al., 2023). ChatGPT performs well in these tasks due to its two-stage extensive training on around 45 terabytes of web data (Dwivedi et al., 2023; Zhou, Müller, et al., 2023). Beyond training by techies, ChatGPT also learns from regular users who can upvote/downvote or provide textual feedback to improve the Chatbot’s responses.
Recent studies have explored the potential of ChatGPT feedback as educational assistance, focusing on both teaching and learning aspects. For teachers, ChatGPT can produce feedback relevant to improve classroom instructions, but its feedback may lack insightful and novel content (R. E. Wang & Demszky, 2023). This might be attributed to the quality of the data it was fed, as ChatGPT solely relies on statistical patterns learned from its training data (Grassini, 2023). As for students, ChatGPT feedback tends to be more detailed, fluent, and coherent, especially when evaluating data science proposal reports (Dai et al., 2023). In addition, ChatGPT feedback may improve students’ task performance in other subjects such as programming problem-solving (Hellas et al., 2023) and argumentative essay writing (Su et al., 2023).
Theoretically, previous studies have suggested that ChatGPT feedback may benefit ESL learners. According to Hong (2023), ChatGPT provides instant and personalized feedback, which allows learners to make real-time improvements. Besides, S. Kim et al. (2023) claimed that feedback generated by ChatGPT is unlimited, providing students with ample opportunities for practice and refinement. In terms of language use, G. Liu and Ma (2023) found that interactions with ChatGPT can expose ESL learners to authentic language contexts, thereby enhancing their proficiency in a subconscious manner.
Empirically, however, only a limited number of studies have investigated the teachers’ and students’ perceptions about ChatGPT feedback within ESL learning. For instance, Mohamed (2023) and Nguyen (2023) conducted interviews with teachers, who viewed ChatGPT as an affordable and convenient tool for providing feedback. Similarly, Schmidt-Fajlik (2023) surveyed Japanese university students regarding their feelings toward ChatGPT feedback and the results showed that the majority of them expressed positive sentiments, with 89.86% of students reporting that “ChatGPT is easy to use.”
Despite these findings, several issues persist in the existing literature. First, most studies on ChatGPT feedback have predominantly focused on theoretical frameworks, with few employing empirical methodologies, leading to results that are often subjective and potentially unreliable. Second, the empirical studies that do exist have primarily explored self-reported attitudes toward ChatGPT feedback, which does not adequately address the actual effectiveness of such feedback for ESL learners. Third, the linguistic dimensions, which are essential for ESL learning, have largely been overlooked in assessing the impact of ChatGPT feedback. Given these gaps, further research is necessary to develop a comprehensive understanding of the efficacy of ChatGPT feedback, particularly in relation to linguistic dimensions, for ESL learners.
Feedback upon Written Translation
Translation is the process of transferring messages across languages and cultures. It is often regarded as the fifth basic language skill for ESL students, along with listening, speaking, reading, and writing. Improving translation quality has been a key focus of ESL learning in recent years (Drugan, 2013) and researches have shown that feedback, such as suggestions on language use, can help students improve their translations and prepare them for professional work (Alfayyadh, 2016).
Studies of translation feedback have centered on TF and SF, with few delving into CF, perhaps owing to a lack of specialized feedback systems for translation students. In terms of TF, students often reported not getting enough useful feedback from teachers (Alsahli, 2012). The insufficient teacher feedback might be owing to the labor intensity and time-consuming nature of giving feedback to a large cohort of translation students. TF requires instructors to compare the source text with the target text, which may lead to prolonged waiting periods and even demotivate students (C. Han & Lu, 2021; C. Liu & Yu, 2019). Several other studies have investigated the efficiency of SF, finding it helps student translators gain more awareness about their role in a translation task (Mellinger, 2019; Pietrzak, 2022). Nonetheless, SF is constrained by students’ translation experience, making it hard for them to spot or fix mistakes (Kasperavičienė & Horbačauskienė, 2020).
In light of these challenges, the novel AI tool ChatGPT may serve as an automatic translation evaluation tool, reducing the teacher’s workload and providing students with quick, detailed feedback (Frąckiewicz, 2023). ChatGPT offers real-time responses by comparing source and target texts, helping students identify mistakes and improve their self-editing skills. However, no study, to our knowledge, has directly compared TF and SF with ChatGPT feedback in terms of improving translation quality. This study is critical as it fills a gap in feedback research by examining how different feedback types influence non-native English speakers’ translation quality and linguistic dimensions.
Automatic Evaluation of Translation Quality
When it comes to evaluating the translation quality, previous studies commonly relied on automatic evaluation metrics like the BLEU score (Koehn, 2010; Papineni et al., 2002). BLEU score quantifies the similarity between a candidate translation and a reference translation, with a higher score indicating closer alignment to the reference (L. Han et al., 2021). Although the BLEU score was initially designed for machine translation evaluation, it has proven applicable in assessing the quality of human-produced translation texts as well (Chung, 2020; C. Han & Lu, 2021). Even a small increase of 0.02 in the BLEU score signifies significant advancements (e.g., Bechara et al., 2011; Y. Cheng et al., 2019). Chung (2020) found a strong correlation between BLEU score and human evaluation while assessing 120 German-to-Korean translations created by 10 MTI students. Inspired by Chung (2020), C. Han and Lu (2021) further validated the feasibility of using the BLEU score to assess English-to-Chinese interpretation by students.
In addition to the BLEU score, linguistic dimensions play a crucial role in translation quality (Sofyan & Tarigan, 2019), but to the best of the researchers’ knowledge, only one study has focused on this area so far. To illustrate, J. Q. Wang et al. (2021) examined the lexical performance of students’ translation texts in terms of six metrics: word count, word length, lexical complexity, word range, word density, and semantic elements. However, its evaluation method did not use statistical methods (e.g., Confirmatory Factor Analysis) to prove whether these metrics could predict the lexical performance of students’ translation texts. Moreover, it overlooked other linguistic features, such as syntax and cohesion. Given this, this study proposed a more comprehensive scoring system to assess the quality of student translations.
In the present study, we combined the BLEU score with three linguistic dimensions—lexicon, syntax, and cohesion—to develop a new scoring scheme for translation (Figure 1). The BLEU score is utilized to assess overall translation quality, while the linguistic dimensions are predicted using seven indicators to evaluate students’ language features. For the lexicon, two indicators, word length and hypernymy for verbs are considered. Word length, as suggested by J. Q. Wang et al. (2021), serves as a measure of lexical performance, indicating that proficient translations should incorporate both longer and shorter words. Hypernymy for verbs, discussed by Ouyang et al. (2021), assesses the precision of students’ interpretations. Basic texts used less specific verbs, while advanced texts employed more specific verbs, resulting in a higher average hypernymy score for verbs in the latter (Crossley et al., 2012).

The new scoring scheme for translation quality.
Regarding syntax, three key indicators of syntactic similarity, verb phrase density, and agentless passive voice usage were identified. First, syntactic similarity can be used to reflect the fluency of translations (Polio & Yoon, 2018; Sennrich, 2015). Second, verb phrase density is a significant factor to consider, as studies have shown that ESL learners tend to underutilize verb phrases in comparison to native speakers (Wu et al., 2020). Higher verb phrase density may indicate students are approaching a more native-like syntactic mastery. Third, passive voice usage was chosen, as Chinese-to-English translation often requires converting the Chinese active voice into the English passive voice (Xu et al., 2023). The capability of switching between active/passive voice in two languages shows both a strong understanding of how each language works and good translation skills.
In the domain of cohesion, two indicators were employed, namely referential cohesion and deep cohesion. Referential cohesion was chosen because it involves the use of pronouns, demonstratives, repetition, synonyms, and other cohesive devices to establish connections between ideas (Armstrong, 1991; Hall et al., 2016). Skilled translators can adapt their use of referential cohesion according to the norms of the target language to enhance clarity and coherence (Károly, 2014; Ong, 2011). Deep cohesion was included as it assesses the overall organization and connectivity of ideas by examining the causal and intentional relationships between concepts (McNamara et al., 2014). Strong deep cohesion means high logical flow and readability (Hall et al., 2016).
Research Questions
In a nutshell, the following three achievements were made in the previous studies. Firstly, CF, TF, and SF, each have unique strengths and weaknesses for improving English writing. Secondly, ChatGPT can support both ESL teaching and learning. Thirdly, theoretical studies showed that ChatGPT can deliver immediate, tailored, and interactive feedback for ESL learners. Despite these insights, it remains unknown the effectiveness of ChatGPT feedback compared with TF and SF in terms of improving students’ translation quality. Hence, the present study sets out to answer the two research questions (RQ) regarding ChatGPT feedback in the context of Chinese-to-English translation:
Method
Participants
The present study investigated a sample of 45 MTI students (39 females and 6 males) enrolled at a prestigious university (Top 10) in China. Ranging from 23 to 26 years old (
Materials
The experiment included a Chinese-to-English translation task, which utilized a 424-character source text in Chinese. This text was extracted from an official press release published in 2020 on the government website of Hubei Province, China during the COVID-19 pandemic (Hubei Provincial Government, 2020). Participants were informed that the English translation would be published alongside the Chinese source text, with the aim of providing foreign readers with updates about the pandemic. This particular document was selected as the translation material for several key reasons. First, the text difficulty was analyzed using the Chinese Resource Platform (http://120.27.70.114:8000/analysis_a), which indicated it was easily comprehensible with no major difficulties. This allowed purely testing translation capabilities of students, without confounding source text complexity. Second, as the text originated from an official press release, the language quality is high with strict editing of grammar and spelling. This prevented issues with low-quality input text from negatively impacting students’ performance (Yoshimi, 2001). Third, the text has strong local relevance as it is from a Chinese provincial government website. Using regionally representative data from China provides a more accurate evaluation of the effectiveness of ChatGPT feedback in the Chinese linguistic and cultural context. In short, the selected material presented an optimal balance of difficulty, language quality, and cultural considerations to assess Chinese-to-English translation competence within the experimental constraints. Importantly, there was no existing reference translation available for this source text. This ensured students could not rely on or be influenced by official translations.
Procedure
In order to collect data, the authors collaborated with an English teacher from the aforementioned university. The experiment was conducted during a compulsory curriculum and the teacher instructed her students (the participants) to translate the provided Chinese press release into English as an assignment. Participants had experience with both SF and TF, but not with CF (ChatGPT Feedback in this context). In order to collect the data from three types of feedback (i.e., SF, TF, and ChatGPT feedback), participants were first asked to revise their initial translation texts by themselves. They were required to submit the draft translations and the revised translation texts with embedded self-feedback notes (SF-finalized version). Two weeks later, the same group of students received the notes of teacher feedback on their initial drafts and revised accordingly, generating TF-finalized versions. Finally, all these students received ChatGPT feedback (the corresponding author used ChatGPT-4 to produce feedback) on their original drafts after two weeks and produced ChatGPT feedback finalized versions. When receiving the feedback generated by ChatGPT, the author used the standardized prompt for each initial translation from students: “Please provide detailed feedback on the following student translation. Original Text: […]. Student Translation: […].” To maintain consistency across students, this same prompt was used for all draft translations, The authors, rather than the students, submitted the translation texts to ChatGPT for feedback. This approach aimed to prevent students from directly using ChatGPT, as such direct interaction could introduce numerous uncontrolled variables (e.g., variations in prompts) that might affect the results. Notably, the deliberate 2-week intervals between three submissions were strategically incorporated to avoid carry-over effects, that is, preventing recall of details in previous tasks (Bordens & Abbott, 2002).
Additionally, during the three revised processes, participants were informed not to use any AI tools (e.g., machine translation) or external resources such as dictionaries or grammar books. To reinforce compliance, students were warned that the teacher could detect the use of machine translation, which would influence their scores on this curriculum.
Data Coding
A total of 135 translation texts (45*3) were collected from three feedback revisions (ChatGPT feedback, TF, and SF). First, the data was analyzed using the BLEU score to examine overall translation quality and we followed J. Q. Wang et al.’s paradigm (2021) to calculate the BLEU score. As that BLEU score compares the similarity between the candidate translation and reference translation, we recruited four professional translators to produce four reference translations. Since BLEU automates comparison across multiple references, it allows efficient, consistent scoring of the 135 student translations in our study. Following J. Q. Wang et al.’s (2021), each student translation was scored against 4 reference translations, producing 4 individual BLEU scores per translation. We then calculated the average of these 4 scores as the final BLEU score for each translation. Averaging the scores from multiple references helped provide a robust assessment while reducing potential bias from any individual reference translation.
Following that, we utilized Coh-Metrix to obtain the data of seven linguistic indicators (i.e., word length (DESWLlt), verb density (WRDHYPv), verb and passive phrase density (DRVP, DRPVAL), syntactic similarity (SYNSTRUTt), deep cohesion (PCDCz), and referential cohesion (PCREFp) to predict three linguistic dimensions (i.e., lexicon, syntax, and cohesion; Table 1). Coh-Metrix is an automated text analysis tool (McNamara et al., 2014). According to Ouyang et al. (2021), the scores generated by Coh-Metrix were significantly correlated with human scoring of translation quality, which indicates that Coh-Metrix is reliable to collect data for testing linguistic features of translation quality.
Coding of the New Scoring Scheme for Translation Quality.
Data Analysis
The study began with the use of Confirmatory Factor Analysis (CFA) to validate a model that consists of three latent factors—namely, lexicon, syntax, and cohesion. This analysis was conducted using the “
After CFA, Structural Equation Modeling (SEM) was also executed using the same “
Lastly, the study conducted two rounds of one-way analysis of variance (ANOVA) by using the EMMEANS function in the bruceR package (Bao, 2023). The first ANOVA evaluated the impact of different types of feedback (SF, TF, and ChatGPT feedback) on the three latent factors. The second ANOVA examined how these types of feedback affected the seven directly measured linguistic indicators. Conducting two separate ANOVAs enabled the study to scrutinize feedback effects at both latent and observed levels. In order to avoid type one error, we also applied Bonferroni adjustment to the alpha level (.05). If significant effects were identified in the ANOVAs, additional post-hoc Tukey HSD tests were performed to make pairwise comparisons (Lenth et al., 2023).
Results
CFA Analysis
CFA results showed that the model fits the data very well, with statistical indices nearing ideal values (χ2/
Result of Structural Validity Analysis.
SEM Analysis
Upon this validated model, SEM was applied and also showed an excellent fit to the data (χ2/

Structural equation model with “Type” as predictors of three linguistic factors.
Evaluation of Overall Translation Quality
The results showed that the average BLEU score for students’ draft translation was 0.466 while that for three revised translations based on SF, TF, and ChatGPT feedback was 0.485, 0.501, and 0.472 respectively. It is important to note that an increase of 0.02 in the BLEU score is widely considered to be a statistically significant improvement in translation quality (e.g., Bechara et al., 2011). Therefore, the revised translation based on TF scored the highest whereas those according to ChatGPT feedback scored the lowest. It indicates that TF is most effective in enhancing the overall quality of students’ translations compared with ChatGPT feedback and SF.
Comparing Linguistic Features Across Feedback Types
Table 3 shows the result of the first one-way ANOVA. The independent variable was “Type” (SF, TF, and ChatGPT feedback) and the dependent variables were the three latent linguistic features (Lexicon, syntax, and cohesion). The results showed a significant main effect of “Type” on the lexicon (
Three Linguistic Features Across Feedback Types.

Mean of three linguistic features.
As for syntax, TF scored higher than ChatGPT feedback (β (ChatGPT feedback − TF) = −32.100,
In terms of cohesion, no significant differences were found across three feedback types (β (TF − SF) = 4.679,
A second round of ANOVA tests the effect of “Type” (SF, TF, and ChatGPT feedback) on seven specific observable linguistic indicators (Please see Table 4). It found that five out of the seven indicators were significantly affected by “Type”: DESWLlt/word length (
Seven Linguistic Indicators Across Feedback Types.
Post-hoc tests in Figure 4 show that, in the lexical level, ChatGPT feedback elicited translations with longer word length (β (ChatGPT feedback − SF) = .293,

Mean of seven linguistic indicators.
In light of syntax, ChatGPT feedback resulted in translations with less verb phrase density compared with TF (β (ChatGPT feedback − TF) = −34.140,
When it comes to cohesion, ChatGPT feedback demonstrated higher referential cohesion than TF (β (ChatGPT feedback − TF) = 14.558,
Discussion
The present study assessed the overall translation quality through BLEU score and relevant linguistic dimensions using Coh-Metrix, so as to evaluate ChatGPT’s merits and drawbacks in generating feedback for translation practice. The results showed that both TF and SF outperformed ChatGPT feedback in improving the overall translation quality. Regarding linguistic features, we found that ChatGPT feedback showed greater gains than TF and SF in bolstering students’ lexical capabilities. However, for syntactic improvement, ChatGPT was less useful than TF. Moreover, all three feedback types exhibited no significant improvements in cohesion.
We further examined the specific lexical and syntactic components that were strongly affected by each feedback type. Our findings suggested that ChatGPT feedback-guided translations exhibited greater lexical complexity, characterized by longer average word lengths and more specific verb choices compared with SF- and TF- versions. However, for syntax, TF-based translations contained denser phrasal verb patterns and increased usage of the agentless passive voice compared with ChatGPT feedback-guided versions. What follows elaborates on the above results.
Overall Translation Quality: TF > SF > ChatGPT Feedback
The results indicated that TF and SF surpassed ChatGPT feedback in improving the overall quality of student translations, as measured by the BLEU score. This observation aligns with recent research by Bašić et al. (2023), which examined students’ essay writing performance with and without the assistance of ChatGPT-3. Although our study utilized ChatGPT-4 instead of ChatGPT-3, we similarly found that ChatGPT did not enhance writing quality in either essays or translations. Furthermore, our findings are consistent with the process-oriented writing theory proposed by Hayes (2012). This theory posits that texts should undergo multiple revisions based on feedback before arriving at a final version. Such iterative revisions can foster students’ reflection, critical thinking, and sense of responsibility, ultimately enhancing their overall writing abilities. In our study, the TF method involved teachers providing constructive suggestions and feedback to encourage students’ reflection and critical thinking. Similarly, in the SF method, students were required to revise their work independently. In this context, ESL students clearly improved their reflection, critical thinking, and responsibility through both the TF and SF methods. In contrast, ChatGPT feedback typically offers direct responses without requiring students to engage in deeper thought. As a result, ESL students may not fully develop their writing abilities when relying solely on ChatGPT feedback.
In our study, three factors may account for the underperformance of ChatGPT feedback compared with TF and SF. First, our participants were advanced ESL learners enrolled in MTI programs. These students already possess sophisticated translation skills, making it a greater challenge for ChatGPT to provide feedback that substantially improves their translation work.
Second, ChatGPT’s training data is limited by its predominantly mono-cultural, English-centric focus (Rettberg, 2022). As a result, it struggles with the nuanced demands of translation, which require not only conveying core meaning but also capturing subtle linguistic and cultural differences (Al-Sofi & Abouabdulqader, 2020; Bassnett, 2007). Our study revealed that ChatGPT frequently failed to detect errors in culturally sensitive translations. For example, it did not catch an error when students translated literally the Chinese word “
” as “
Third, we noted considerable inconsistency in ChatGPT’s feedback across different student translations. While it sometimes identified issues such as incorrect verb tense or inappropriate tone, it failed to consistently highlight similar issues across multiple student translations. This inconsistency can be attributed to ChatGPT’s stochastic nature, which allows it to generate different responses to the same prompt, as discussed by Jalil et al. (2023). This suggests that ChatGPT’s feedback mechanism is still in a developmental stage and is not as reliable as traditional feedback methods.
Despite the aforementioned limitations, our research did identify some areas where ChatGPT exhibited strengths. For instance, it was adept at identifying redundant and verbose expressions, guiding students toward more concise and clear translations. For instance, ChatGPT spotted lengthy expressions like “
Lexicon: ChatGPT Feedback > SF = TF
Our statistical analysis revealed that ChatGPT feedback outperformed SF and TF in improving students’ lexical capability. This finding is consistent with Activity Theory (Engeström, 2001). Based on this theory, physical tools (e.g., computers) traditionally mediate human-environment interactions by facilitating physical tasks. In contrast, ChatGPT transcends this conventional role by functioning as both a mediational tool and a semiotic sign. It not only connects students with the world through technology but also provides linguistic scaffolding that directly shapes their cognitive processes. Specifically, its feedback operates symbolically—through lexical and syntactic structures—to prompt learners to expand their vocabulary repertoire and refine active language use.
In our study, one compelling reason behind ChatGPT’s superior performance may lie in its extensive and diverse training data, sourced from billions of text entries such as academic articles, news reports, Wikipedia, and even literary works (Shen et al., 2023). This wide-ranging training not only equips the model with a vast lexical repertoire but also exposes it to a wide range of contextually appropriate vocabulary usage. This finding resonates with recent studies that advocated using ChatGPT for vocabulary enhancement (e.g., Baskara, 2023; Koraishi, 2023).
In fact, we found that ChatGPT feedback encouraged students to use longer words and more specific verbs. For instance, instead of employing simpler phrases like “
Conversely, both SF and TF have intrinsic limitations that make them less effective for vocabulary enhancement. For instance, SF suffers from the constraint of limited personal lexicons and less structured approaches to vocabulary building. Students often stick to the vocabulary they already know and might lack the search skills or self-discipline to incorporate new, more complex words into their translation. TF often centers on more macro-level issues, such as grammatical errors or mistranslations. Teachers may overlook refining word choices if they feel the student’s translation already captures the meaning of the source text (M. Kim, 2009; Wongranu, 2017). Therefore, it may not fine-tune the vocabulary to the same degree that ChatGPT feedback does.
Apparently, no human feedback provider can match ChatGPT’s data-driven vocabulary capabilities enabled by its massive training history. The evidence of marked lexical gains among students in our study strongly supports the integration of ChatGPT into translator education programs, especially for students who aim to improve their vocabulary in a nuanced and comprehensive way.
Syntax: TF = SF > ChatGPT Feedback
The result showed that TF and SF outperformed ChatGPT feedback in developing students’ syntax-related skills. This finding aligns with the internal feedback model proposed by Nicol (2020), which suggests that the core process of SF involves comparing prior knowledge with external information, such as task instructions. In our study, ESL students likely synthesized their past translation experiences with the current task to refine their syntactic choices during the SF task. This effective approach explains the similar improvements in syntax observed between TF and SF.
In our observation, student translations resulting from TF and SF displayed a better grasp of complex sentence structures, such as using more sophisticated verb phrases and appropriate use of the passive voice. In contrast, translations revised via ChatGPT feedback lacked these improvements. This discrepancy can be attributed to three main factors. First of all, ChatGPT has an inherent limitation in that it cannot deeply analyze or comprehend the rules of syntax (Borji, 2023; Chomsky et al., 2023). While human instructors offer nuanced feedback based on the contextual needs of a sentence, ChatGPT’s guidance tends to be more generic and superficial. For instance, it might recommend replacing one phrase with another for “better clarity,” yet it frequently misses underlying syntactic issues. This was evident when we explicitly asked ChatGPT to critique the sentence structure of a complex example: “
The second limitation emerged from ChatGPT’s disinclination toward passive voice. In this regard, our study aligns with AlAfnan and MohdZuki’s (2023) research, revealing ChatGPT’s reluctance to employ passive voice, both in its own writing and in its feedback. This indicates a more systemic limitation: if the model rarely uses passive constructions itself, it is unlikely to offer feedback that helps students understand when and how to effectively implement passive voice. However, passive voice is critical for Chinese to English translations, where Chinese sentences often lack a clear agent or subject (Hsiao et al., 2014; Zhiming, 1995). When translating to English, which often demands subjects for grammatical correctness, the ambiguity regarding the “doer” can introduce challenges. Passive voice can resolve such challenges, making translations more natural (Ke, 2023). ChatGPT falls short in this regard, unable to instruct students on how to use passive voice to tackle such challenges.
Lastly, ChatGPT lacks genre-specific feedback. The study used a news release for the translation exercise—a genre that often employs passive voice to maintain a formal, objective tone (Jacobs, 1999). In such contexts, passive constructions are not just permissible but often preferable, shifting the focus from the actor to the action or result. ChatGPT failed to offer the kind of nuanced feedback that would help students understand when and why to use passive voice in such formal settings. However, human teachers are trained to understand that different types of texts—whether news releases, academic papers, or casual conversations—have different language requirements and conventions. They understand the rationale behind these conventions and thus can impart that understanding to their students.
Cohesion: ChatGPT Feedback = TF = SF
The data demonstrated that three feedback types (ChatGPT feedback, TF, and SF) did not significantly improve the overall cohesion in student translations. However, translations revised with ChatGPT feedback did outperform those amended with TF or SF in terms of referential cohesion. Similar to Zhou, Cao, et al.’s findings (2023), the higher use of referential cohesion indicates ChatGPT’s ability to give feedback to prompt students to use more explicit linking devices between ideas, making their translations easier to follow. For instance, a translation revised with ChatGPT feedback might feature an increased frequency of synonyms or strategically employ pronouns like “
Regarding deep cohesion, which involves the use of causal or intentional connectors to develop ideas, none of the feedback types exhibited significant improvement. This observation appears to conflict with Liang and Liu’s (2023) findings that human translations often display better deep cohesion than machine translations. The discrepancy can be attributed to several factors. First, the scope and focus of our research are fundamentally different from those of Liang and Liu (2023). Their study directly compared final translations produced by humans and machines, whereas ours evaluated how different feedback types affected the revisions of texts initially produced by human translators.
Second, it is important to note that the technology underpinning the feedback differs between the studies. Liang and Liu (2023) relied on Google Translate for their evaluation, while we incorporated ChatGPT, a more sophisticated language model that has been shown in recent studies (e.g., Lee, 2023) to possibly surpass Google Translate in terms of translation quality.
Third, the nature of the translation task itself could be an influencing factor. Unlike free-form writing, translation is bounded by the content of the source text, which might limit the degree to which deep cohesion could be enhanced. In other words, if the source text lacks elements of deep cohesion, the translated version is the same and translators may not add more bounding words to improve cohesion. This perhaps explains why deep cohesion was not significantly improved across all samples.
Conclusion
This study compared ChatGPT feedback, teacher feedback (TF), and self-feedback (SF) for improving translation performance among advanced ESL learners. We assessed how these different types of feedback influenced overall translation quality as well as specific linguistic dimensions, including lexicon, syntax, and text cohesion. Our main findings revealed that ChatGPT feedback lagged behind both SF and TF in boosting overall translation proficiency. While ChatGPT demonstrated efficacy in some linguistic domains, such as vocabulary enrichment and referential cohesion, it was comparatively less adept in bolstering intricate syntactic competencies. The nuanced utilization of verb phrases and passive constructs, in particular, emerged as challenging areas for the AI tool.
All things considered, the findings of the current study contribute to the ongoing discussion about the role of ChatGPT in education, particularly in translator training. On a practical level, our study advocates for a blended instructional approach to translation practice. This approach combines the data-driven advantages of AI tools with nuanced, culture-aware feedback from human experts, creating a more comprehensive learning environment. By harnessing AI’s efficiency alongside the insight of experienced translators, educators can provide students with a richer and more contextualized understanding of translation.
Conversely, it is essential to acknowledge potential drawbacks associated with ChatGPT in language education in translation. For instance, excessive reliance on ChatGPT may lead to a gradual decline in translators’ proficiency, particularly in translating between L1 and L2. This underscores the importance of maintaining a balanced approach that combines ChatGPT with traditional translation training methods. Specifically, while ChatGPT can assist in providing quick translations and suggestions, it should not replace the critical practices of active learning, such as hands-on translation exercises, self-feedback, teacher feedback, and in-depth analysis of linguistic nuances.
Limitations and Recommendations for Future Research
This study has four primary limitations that warrant further consideration.
First, the research sample was exclusively composed of advanced ESL learners (MTI students) without controlling for specific demographic variables, and thus it is uncertain how ChatGPT feedback would perform with beginner or intermediate students, or whether demographic factors influenced the study’s findings. This limitation highlights the need for future research to explore ChatGPT’s effectiveness with learners of varying proficiency levels and diverse backgrounds in translation tasks.
Second, the methodology of the present study was limited to a quantitative approach. It would be more beneficial for future studies to incorporate qualitative methods such as classroom observation, diary study, and retrospective interviews. This mixed-method approach would help gain a fuller understanding of students’ perceptions, experiences, and attitudes toward different types of feedback.
Third, the assessment of overall translation quality relied solely on BLEU scores. While BLEU provides rapid and unbiased calculations, combining these scores with evaluations from human raters could enhance the reliability of the findings. Subsequent studies may consider integrating machine-generated scores with human assessments to develop more accurate methods for evaluating translation quality that reflects human judgment.
Lastly, the scope of this study was confined to a single language pair, direction, and text type. To gain a deeper understanding of ChatGPT’s capabilities and limitations, future research should investigate additional language pairs, translation directions (e.g., from L2 to L1), and a broader variety of text types, including literary works. For instance, it remains uncertain whether ChatGPT would be equally effective for MTI students translating from English to Chinese. Given that ChatGPT is predominantly trained in English, the availability of training data for lower-resource languages may be limited, potentially impacting its effectiveness in those scenarios.
