Abstract
Introduction
The transformer architecture (Vaswani et al., 2017) has proven to be an extremely effective method for pretraining language models, from bidirectional encoder representations from transformers (BERT; Devlin et al., 2019a) to generative pretrained transformer (GPT; Brown et al., 2020). These models leverage the self-attention mechanism for the masked language modeling task, that is, predicting the word masked in a context. However, this relatively simple procedure leads to rich contextual representations, which can rival human performance. Nevertheless, despite their ability to learn implicit syntactic patterns, these models often struggle with explicit syntactic structures and phenomena (Bai et al., 2021; Rogers et al., 2020). This limitation is particularly significant in tasks such as neural machine translation (NMT), where syntactic accuracy is crucial for correctly interpreting and translating the structure and meaning of the source text. On the other hand, linguistic research has long focused on the detailed description and annotation of syntactic relations across languages. The universal dependencies (UD; Nivre et al., 2016) provides a standardized framework for annotating syntactic dependencies, creating richly annotated corpora that can be leveraged to improve NMT systems. Integrating explicit syntactic knowledge into NMT models has the potential to enhance translation quality by providing more structured and interpretable representations of language.
Neurosymbolic artificial intelligence (AI) aims to bridge the gap between symbolic reasoning and neural computation, thereby enabling more transparent, interpretable, and robust AI systems. Symbolic reasoning involves using explicit rules and structures to represent and manipulate knowledge, while neural networks excel at learning from large datasets and capturing complex patterns (Besold et al., 2021; Tilwani et al., 2024). Traditional sequential models, such as recurrent neural networks (RNNs) and transformers, although capable of processing and representing sentences, often fail to accurately capture complex syntactic structures and phenomena (Conneau et al., 2018; Egea Gómez et al., 2021; Peng et al., 2021). The advent of graph attention network (GAT; Veličković et al., 2017) introduces a more explicit representation of syntactic structures and inter-word dependencies through their topology, promising better readability and interpretability in natural language processing (NLP; Huang et al., 2020; Li et al., 2022).
Inspired by these developments, this study introduces NMT engines improved with syntactic knowledge via graph attention and BERT (SGB), where GAT provides a powerful mechanism for explicitly representing syntactic structures and inter-word dependencies, complementing the implicit knowledge captured by BERT. This approach aligns with the principles of neurosymbolic AI, which seeks to combine the strengths of symbolic reasoning (explicit syntactic graphs) with the robustness and scalability of neural networks (BERT and transformer models). By integrating syntactic data from source sentences with GATs and BERT, we aim to improve transformer-based NMT by incorporating syntax (every sentence yields a syntactic tree structure through the parser) and leveraging the capabilities of the pretrained BERT model. Utilizing multi-head attention mechanisms within the graph structure allows for the explicit exploitation of source-side syntactic dependencies, enhancing both the BERT embeddings on the source side and the effectiveness of the target-side decoder. The study conducts experiments on translation tasks from Chinese, German, and Russian to English to demonstrate the effectiveness of the proposed methodology, across three typologically different languages. We also examine the interpretability of the proposed NMT engines in improving translation quality, such as better identification of certain syntactic structures in the source language, and whether GAT can effectively learn syntactic knowledge. This research fills the current gap in understanding how syntactic strategies impact machine translation (MT) quality. The main contributions of this study are summarized as follows:
The proposed SGB engines effectively demonstrate the potential and effectiveness of integrating BERT with syntactic knowledge derived from graph attention mechanisms in MT tasks. These engines can be efficiently fine-tuned to complete the training process without the need for pretraining from scratch. This study evaluates the translation quality of the proposed MT engines, focusing specifically on improvements in quality estimation (QE) scores. The results indicate that the SGB engines achieve enhanced QE scores across three MT directions. A paired
This study reveals that while GAT possesses the capability to learn syntactic knowledge, their sensitivity in the learning process is influenced by the multi-head attention mechanism and the number of model layers. Excessive model layers can even significantly impair the GAT’s ability to learn dependency relations. Furthermore, there is a correlation between the GAT’s mastery of syntactic dependencies and translation quality. Better-learned syntactic structures by the GAT enable the MT engine to more accurately recognize source language sentences with those structures, resulting in smoother and more accurate translations.
This study also investigates the interpretability of translation quality improvement through the lens of syntactic knowledge. The experiments demonstrate that a syntactic structure based on GAT enables more nuanced modeling of source language sentences by the lower and middle layers within BERT, thereby enhancing translation quality. While SGB engines enhanced with graph-based syntactic knowledge exhibit improved QE score distributions, the integration of BERT plays a crucial role in forming representations of source sentences. This research underscores the importance of accurate syntactic graphs for maintaining high-quality translations and highlights the limitations of current models in interpreting jumbled sentences. Furthermore, this study assesses the versatility of the proposed approach by integrating XLM-Roberta in place of BERT. Despite this substitution, the approach consistently improves translation quality across all evaluated MT directions, underscoring its broad applicability.
Related Studies
Pretrained Language Models
Pretrained models have significantly advanced NLP, particularly with the advent of transformer architectures, marking a paradigm shift in the field’s approach to understanding language (Devlin et al., 2019b; Liu et al., 2019). Among these innovations, BERT stands out by leveraging self-supervised learning on extensive corpora through the masked language model and next sentence prediction tasks. These techniques enable BERT to capture the essence of linguistic knowledge, enriching its understanding of language context and structure (Rogers et al., 2020). The empirical analysis and applications of BERT have also helped humans understand pretrained language models, supporting future improvements. Also, BERT has made significant contributions to MT tasks, where its contextual word embeddings and generic linguistic knowledge learned from pretraining enhance the generalization ability of MT engines, especially in cases with limited bilingual data. Most studies show that incorporating BERT improves the performance of MT engines, as demonstrated by metrics such as the bilingual evaluation undertstudy (BLEU) score (Imamura & Sumita, 2019; Yang et al., 2020; Zhu et al., 2020).
Syntactic Knowledge in Translation
In the realm of MT, the importance of syntactic dependency cannot be overstated. Syntactic dependency is crucial for the grammatical dissection of sentences, presenting them in easily interpretable tree diagrams. The incorporation of syntactic data into NMT systems provides substantial benefits, notably in clarifying sentence structure, facilitating more accurate context interpretation, and minimizing ambiguity. In recent years, the transformer model has garnered significant attention, and the strategy for incorporating explicit syntactic knowledge has shifted progressively from RNN-based methods to transformer-based ones (Currey & Heafield, 2019; McDonald & Chiang, 2021; Zhang et al., 2020). Within the transformer framework, a prevalent approach involves leveraging the self-attention mechanism to capture and represent syntactic information, enabling focused analysis on particular tokens. However, the efficacy of using the transformer’s attention mechanism as an explanatory tool remains a topic of debate (Jain & Wallace, 2019; Wiegreffe & Pinter, 2019). Efforts have been made to enhance the effectiveness of downstream tasks by fusing explicit syntactic knowledge with BERT (Huang et al., 2020; Wang et al., 2020). However, the applications of such integration in MT have not been thoroughly explored.
Deep Learning for Graphs
In NLP tasks, representing sentences and words as linear sequences might compress or obscure crucial topological information, including tree-like syntactic structures. This loss of structure can present significant challenges for downstream tasks that depend on accurately capturing the nuanced features of source language sentences, such as speech recognition and MT. While there are many approaches for encoding graphs (Chen et al., 2025), graph neural networks offer a solution through a topological graph-based approach, enabling the construction of diverse linguistic graphs. These graphs transform various textual features into a network of nodes, edges, and overall graph structures. This method allows for a more nuanced analysis and inference of linguistic patterns within input sentences, significantly benefiting downstream tasks (Song et al., 2019; Yin et al., 2020). The GAT emerges as a novel solution within this space, adept at processing data in non-Euclidean domains. It utilizes attention mechanisms to dynamically assign importance to nodes, enhancing the model’s capacity to learn from graph-based representations. This capability, when combined with BERT, forms a robust framework for encapsulating linguistic knowledge in downstream NLP tasks (Chen et al., 2021; Huang et al., 2020; Zhou et al., 2022).
Methodology
Construction of the Proposed Engines
This section provides detailed descriptions of the individual layers within the engine. Figure 1 illustrates the comprehensive architecture of the proposed engines.

The architecture of the SGB engines. The encoder with BERT and GAT on the left and the decoder on the right. Dash lines indicate the alternative connections.
Given source sentence
The experiments include translations from three source languages into English: Chinese to English (Zh
By capturing the representation of each subword token through BERT, the final embedded sequence is accessible via the last layer of BERT,

The input sentence is parsed, and it is then expected to be converted into a graph structure based on the connections between parent nodes in the syntactic dependencies.
To Illustrate the Working Principle, Consider the Input Sentence: “The New Spending is Fueled by Clinton’s Large Bank Account.” This Sentence is Subsequently Parsed to Provide Detailed Linguistic Information, Such as Part-of-Speech (POS) Tags, Head Node IDs, and Syntactic Dependency Labels (DepRel). Source Language Sentences in Chinese, Russian, and German Also Follow the Same Parsing Steps.
Words and adjacency relations in a sentence can be represented as a graph structure, where the words (known as tokens in the model) on the graph are as nodes, and the relationships called syntactic dependencies between words are regarded as edges connecting nodes. We use GAT (Veličković et al., 2017) as our critical component to fuse the graph-structured information and node features. The node features given to a GAT layer are
1-hop neighbors
Two methodologies for integrating syntactic knowledge into MT engines are introduced. The initial approach, termed syntactic knowledge via graph attention with BERT concatenation (SGBC), involves merging syntactic information from graphs with BERT for the encoder’s operation, as detailed in equations (3) and (4):
The second one, called syntactic knowledge via graph attention with BERT and decoder (SGBD), is that the syntactic knowledge on the graph is not only applied to the encoder but also guides the decoder through the syntax-decoder attention, as shown in equations (5), (6), and (7):
In the domain of MT, there is an active search for accurate and reliable evaluation metrics. Among these metrics, BLEU (Papineni et al., 2001) has become a fundamental tool for evaluating the quality of text translated from one language to another. BLEU functions by comparing machine-generated translations to one or more reference translations, primarily focusing on the precision of
QE offers an innovative approach to translation assessment that does not require reference texts, by building models that directly predict whether the suggested translation is an accurate and fluent translation of the source text. This method is not only innovative but also practical, especially in contexts where reference translations are unavailable. QE engines can be trained to evaluate various aspects including fluency, adequacy, and even the predicted postediting effort, providing a comprehensive view of translation quality.
In this study, the evaluation of MT primarily employs two methods: the widely recognized
Datasets
The parallel UD (PUD) corpus is a collection of multilingual datasets designed to facilitate cross-linguistic analysis and the development of MT engines. Comprising texts translated into 20 languages, each dataset within the PUD corpus contains 1,000 sentences that are syntactically annotated, ensuring a high level of linguistic consistency and quality across different languages. These sentences are selected from a wide range of sources, including news articles and Wikipedia, providing a diverse mix of genres and topics.
The experiments utilize three typologically different languages to be translated into English: PUD Chinese, 5 PUD Russian, 6 and PUD German. 7 The choice of these languages is determined by the availability of the UD corpus for a trained external syntactic parser and the PUD corpus for evaluating both the syntactic knowledge of BERT and GAT and the performance of the MT engine.
What Happens to Translations
Translation Performance with BLEU and QE
The effectiveness of the proposed approach is evaluated by BLEU score on the UNPC
8
(Zh
As shown in Table 2, the proposed engines consistently achieve higher BLEU scores than the baseline engine across all three translation directions, regardless of the training set size. This underscores the effectiveness and generalization capability of the proposed approach. In the table, bold values indicate the highest BLEU scores for each combination of training set size and language direction, while italic values highlight the scores of the baseline model. SGBC consistently outperforms both the baseline and SGBD. This can be attributed to the fact that the output of SGBC more closely aligns with the criteria used in the BLEU score calculation. It is likely to generate translations that have a higher degree of
The Performance of SGB Engines Compared to Baseline Engines in BLEU Scores Across Three MT Directions With Varying Training Set Sizes. Despite the Reduced Dataset Size, SGB Engines Maintain Competitive BLEU scores.
Note . BERT = bidirectional encoder representations from transformers; SGB = syntactic knowledge via graph attention and BERT; BLEU = bilingual evaluation undertstudy; MT = machine translation; SGBC = syntactic knowledge via graph attention with BERT concatenation; SGBD = syntactic knowledge via graph attention with BERT and decoder; Zh
En = Chinese to English; Ru
En = Russian to English; De
En = German to English.
The Performance of SGB Engines Compared to Baseline Engines in BLEU Scores Across Three MT Directions With Varying Training Set Sizes. Despite the Reduced Dataset Size, SGB Engines Maintain Competitive BLEU scores.
Table 3 demonstrates that when the training set size reaches 1 million, both SGB series engines exhibit higher scores on the BLEU and COMET QE performance metrics. However, SGBC and SGBD exhibit notable differences in their performance across these metrics: SGBC achieves the highest BLEU scores in all three translation directions, while SGBD obtains the highest COMET and TransQuest QE scores. SGBD’s scores are generally at least two points higher than those of the baseline engines. These performance metrics reflect the engines’ proficiency in leveraging syntactic knowledge from graphs and fully utilizing BERT’s potential language capabilities, enabling them to generate more accurate translations. It is important to note that BLEU is a paired metric, which can be unreliable, and both BLEU and COMET QE depend on reference translations. In real-world translation scenarios, reference translations may not always be available, and the semantic diversity of output sentences cannot be reliably verified. Therefore, compared to BLEU and COMET QE scores, the TransQuest QE score offers a more nuanced advantage in adapting to reasonable variations in translation. This is because it does not require reference translations, making it a more robust and practical metric for evaluating translation quality in diverse and realistic settings.
Performance Comparison of BLEU, COMET, and TransQuest scores for Three Translation Directions (Zh
Based on the results of the above experiments, BLEU scores still fail to reflect linguistic subtleties and align with human evaluative criteria (Callison-Burch et al., 2006; Novikova et al., 2017). To address these shortcomings, we employ a gold-standard syntactically annotated corpus, the PUD corpus, and the TransQuest QE model to further investigate changes in translation quality. The PUD corpus, with its diverse range of sources, including out-of-domain content, ensures a comprehensive evaluation of the MT engines’ ability to handle various linguistic structures and contexts. Additionally, the syntactic annotations in the PUD corpus provide a gold-standard reference, allowing for a detailed analysis of the engines’ performance in capturing and translating syntactic dependencies. We utilize the PUD corpus (PUD Chinese, PUD Russian, and PUD German) to evaluate the translation quality of the Baseline and SGB engines across three translation directions. The PUD corpus includes sentences from various out-of-domain sources, not limited to news and Wikipedia content, thus placing higher demands on the MT engines’ ability to effectively summarize and clarify the structure of input sentences. The QE model is used to estimate the quality of the source language sentences and their translations, rating the translations on a scale from 0 to 1, where higher scores indicate better translation quality. Paired
From Table 4, when comparing the Zh Baseline and SGBC engines, the average of differences (
The Baseline and the SGB Engines Compare the Translations of the PUD Corpus, Scored by the QE Model and Subjected to Paired
-Tests to Demonstrate the Differences in Translation Quality Scores.
Note . BERT = bidirectional encoder representations from transformers; SGB = syntactic knowledge via graph attention and BERT; PUD = parallel universal dependencies; QE = quality estimation; Zh=Chinese; Ru=Russian; De=German; SGBC = syntactic knowledge via graph attention with BERT concatenation; SGBD = syntactic knowledge via graph attention with BERT and decoder.
The Baseline and the SGB Engines Compare the Translations of the PUD Corpus, Scored by the QE Model and Subjected to Paired
Comparable outcomes are evident for Ru and De, wherein the quality of translations, upon the implementation of proposed methodologies, manifests a significant divergence from the prior state, as gauged by QE scores. The incorporation of syntactic knowledge via graph representations alongside the employment of BERT substantially enhances the translation efficacy of MT engines. It is noteworthy that the SGBD engines consistently achieve elevated QE scores, indicating a robust improvement in translation quality. Contrarily, while the SGBC engines are favored by BLEU scores, achieving higher metrics under that evaluation, the QE scores highlight a different aspect of translation quality, underscoring the nuanced and comprehensive analysis provided by QE metrics over BLEU. This divergence underscores the complexity of translation quality evaluation, revealing how different evaluation metrics may prioritize various aspects of translation performance.
Multiple dependency relations signify the structural attributes of a given sentence. To identify which dependency relation in the source language sentence from the PUD corpus contributes most to the enhancement of translation quality through translation engines, we retain and categorize sentences based on their dependency relations. Specifically, both the baseline engine and the two proposed SGB engines translate their own source language sentences from the PUD corpus. The translations are then ranked according to their TransQuest QE scores. The bottom 30% of translations, based on TransQuest QE scores, are considered low-quality translations. Source language sentences corresponding to these low-quality translations and containing the same dependency relation are grouped together. For example, for a given dependency relation
Table 5 details how SGB engines outperform the baseline engines in accurately identifying syntactic relations within source language sentences, thereby markedly improving translation quality. It particularly emphasizes the top five syntactic relations that contribute to this improvement. Although both SGBC and SGBD engines incorporate graph-based syntactic knowledge, their approaches to learning dependency relations diverge. For instance, the “flat” (flat structure) in Zh is markedly significant in the SGBC engine yet receives less emphasis in the SGBD engine. Despite SGBD’s decoders being similarly guided by syntactic knowledge derived from graph representations, it does not uniformly excel across all syntactic relations in achieving a higher QE score compared to the SGBC engine. Specifically, in languages such as Zh, Ru, and De, the SGBC model outperforms SGBD in handling certain syntactic relations, including “discourse:sp” (discourse marker: speech), “orphan” (orphan), and “csubj” (clausal subject). This discrepancy may suggest that an overly focused reliance on syntactic knowledge could lead to knowledge redundancy, detrimentally affecting translation quality in the SGBD engine. Conversely, the importance of some syntactic relations remains consistent across both SGBC and SGBD engines, underscoring that the integration of syntactic knowledge via graph attention alongside BERT enables the MT engine to more precisely address specific common relations. This consistency, irrespective of the methodological differences between the two engines, indicates that leveraging graph-based syntactic knowledge in conjunction with BERT enhances the MT engine’s ability to explicitly navigate certain syntactic structures, thus contributing to the refinement of translation quality.
The Top-5 Dependency Relations Identified by the SGB Engines Are Those That Show the Greatest Improvement in QE Scores. These Relations Highlight Which Syntactic Dependencies Are Most Effectively Detected and Contribute Most Significantly to the Enhancement of Translation Quality in Each Translation Direction. “Qual” Denotes the Percentage Increase in QE Scores for Sentences Containing Such a Syntactic Structure.
Note . BERT = bidirectional encoder representations from transformers; SGB = syntactic knowledge via graph attention and BERT; Zh = Chinese; Ru = Russian; De = German; SGBC = syntactic knowledge via graph attention with BERT concatenation; SGBD = syntactic knowledge via graph attention with BERT and decoder; QE = quality estimation.
The Top-5 Dependency Relations Identified by the SGB Engines Are Those That Show the Greatest Improvement in QE Scores. These Relations Highlight Which Syntactic Dependencies Are Most Effectively Detected and Contribute Most Significantly to the Enhancement of Translation Quality in Each Translation Direction. “Qual” Denotes the Percentage Increase in QE Scores for Sentences Containing Such a Syntactic Structure.
Syntactic Knowledge in GAT
GATs have the capability to represent syntactic structures in sentences using graph-based models. However, whether this capability signifies their ability to effectively learn syntactic knowledge remains an open question. To address this, we design a syntactic dependency prediction experiment where GATs are tasked with predicting the relevant syntactic labels in the syntactic structure. For this experiment, we utilize the PUD corpus, which provides gold-standard syntactic annotations, as our foundational dataset. The experimental process involves converting the syntactic annotations and sentence words into syntactic trees, which are subsequently transformed into graph structures for GAT analysis. In these graph structures, each word is represented as a node, and the edges represent the syntactic dependency connections as defined by the PUD corpus. The primary objective of the GAT is to infer the dependency relations for each word by integrating information from both nodes and edges. Unlike traditional syntactic dependency models, which often follow a unidirectional flow from parent to child nodes, this approach treats dependencies as bidirectional graphs. This bidirectional model acknowledges the mutual influence between parent and child nodes, which is crucial for GATs to understand the varying implications of node connections. By considering these bidirectional relationships, GATs can enhance their ability to accurately identify dependency relations among nodes, thereby improving their syntactic learning capabilities.
Similar to the transformer model, GAT utilizes multi-head attention and layers stacked upon each other. The study initially explores how the number of multi-head attention heads and layers influences GATs’ acquisition of syntactic knowledge, examining the advantages these configurations offer for learning syntactic dependencies. In the experiments, the attention head counts (Heads) tested for GATs are 2, 4, 6, and 8, while the layer counts (L) explored are 2, 3, 4, 5, and 6. For each language, datasets are divided into training, validation, and test sets with 800, 100, and 100 sentences, respectively, to tune hyperparameters, monitor model performance during training to prevent overfitting and evaluate the model on unseen data. The model parameters are set with a learning rate of
Table 6 emphasizes the critical importance of judiciously configuring the number of attention heads and layers in GAT, as this configuration significantly influences the model’s sensitivity to accurately learn syntactic knowledge. In the table, bold values indicate the highest performance metrics for each combination of language and number of layers. For example, the Russian language experiment reveals that a GAT setup with two layers and four attention heads outperforms a configuration with eight attention heads in terms of overall prediction efficacy. As the model is expanded to four layers, a higher number of attention heads enhances performance, with the F1-score increasing from 0.44 to 0.57. Conversely, increasing the number of layers tends to degrade the model’s ability to accurately predict dependency relations. Specifically, a configuration with two layers outperforms one with six layers, regardless of the number of attention heads. This decline suggests that an increase in GAT layers might lead to performance degradation, potentially due to nodes losing their specific attributes or incorporating irrelevant information during the aggregation process.
GAT Performance in Syntactic Dependency Prediction for Three Languages With Different Numbers of Attention Heads and Layers. The Number of Attention Heads Increases Incrementally From 2 to 6, and the Number of Model Layers Increases From 2 to 8.
Note . GAT = graph attention network; Zh = Chinese; Ru = Russian; De = German.
GAT Performance in Syntactic Dependency Prediction for Three Languages With Different Numbers of Attention Heads and Layers. The Number of Attention Heads Increases Incrementally From 2 to 6, and the Number of Model Layers Increases From 2 to 8.
When examining the prediction scores for individual dependency relations across the three languages, the results further validate this observation. As shown in Table 7, when the number of layers exceeds 3, the F1-scores for some syntactic relations tend to decrease and even drop to 0 as the number of layers increases. Increasing the number of attention heads does little to mitigate this degradation. Bold values in the table indicate the highest F1-scores for each syntactic relation across different configurations of layers and heads. However, certain syntactic tags remain unaffected by this trend. Regardless of the number of layers, GAT consistently learns and maintains high F1-scores for tags such as “advmod” (adverbial modifier), “case” (case marking), “cc” (coordinating conjunction), “mark” (marker), “nsubj” (nominal subject), and “punct” (punctuation). This indicates that GAT exhibits a high sensitivity and reliable capture of these specific syntactic features.
The Prediction of Syntactic Dependencies for Three Languages is Conducted Using Different Numbers of Attention Heads and Layers. As the Number of Layers Increases, the Performance of the GAT in Predicting Dependency Labels Declines, and it Gradually Loses the Ability to Learn Certain Dependency Labels, Resulting in the F1-Scores Dropping to Zero. However, Some Dependency Relations Remain Unaffected and Continue to Achieve Relatively High Prediction Scores.
We continue to compare the F1-scores of GAT’s dependency relation predictions with the QE scores of the SGB engines when processing prior low-quality translations containing these specific dependency relations (from Section 4.3), as shown in Table 8. It highlights the top-10 dependency relations with the highest prediction scores by GAT across various source language sentences, along with the corresponding changes in translation quality facilitated by different MT engines. The results demonstrate a clear positive correlation between GAT’s syntactic dependency prediction scores and the improvement in translation quality, especially when using the SGBC and SGBD engines. For Zh, dependency relations such as “mark” (marker), “cc” (coordinating conjunction), and “conj” (conjunct) have very high prediction scores by GAT (0.986, 0.984, and 0.970, respectively). These high scores correlate with significant improvements in translation quality, as evidenced by the higher QE scores of the SGBC and SGBD models compared to the baseline. Similarly, for Ru, dependency relations such as “det” (determiner), “root” (root), and “amod” (adjectival modifier) have high prediction scores (0.990, 0.987, and 0.982, respectively), leading to notable improvements in translation quality. For De, dependency relations such as “case” (case marking), “cc” (coordinating conjunction), and “det” (determiner) also exhibit high prediction scores (0.992, 0.987, and 0.987, respectively), resulting in improved translation quality. The positive correlation between GAT’s prediction scores and translation quality is consistent across the three languages, suggesting that GAT’s ability to accurately predict syntactic dependencies is a robust indicator of its potential to enhance translation quality. This underscores the importance of integrating syntactic information into MT systems to achieve more accurate and reliable translations. Also, the consistent improvement in translation quality across different languages and MT engines demonstrates the robustness of GAT in learning and applying graph-based syntactic structures.
Top-10 Dependency Relations With the Highest GAT F1-Score Across Various Source Language Sentences, Alongside Corresponding Changes in Translation Quality as Measured by QE Scores From Different MT Engines.
Representational Similarity Analysis
Representational similarity analysis (RSA) is a technique used to analyze the similarity between different representation spaces of neural networks. Inspired by the work of Merchant et al. (2020), RSA uses
Table 9 lists partial results from an RSA comparing baseline BERT and SGB models based on syntactic prediction scores by GAT (full results are provided in Appendix 8). The analysis shows that the lowest RSA scores mainly occur in the lower and middle layers of BERT, regardless of whether the model is used in the SGBC or SGBD engine. Specifically, when GAT achieves high F1-scores for a particular dependency relation, the representations of sentences containing this relation typically undergo significant changes in the lower and middle layers of BERT. These changes are most pronounced in layers 3–5 for Chinese and Russian and in layers 5–8 for German. This suggests that the syntactic structure represented through graphs influences BERT’s reanalysis of input sentences, leading to a syntactic reconstruction of the input sentence. Also, the lower and middle layers of BERT are particularly sensitive to modifications in modeling both shallow and deep syntactic structures. In contrast, layers 9–12 are primarily involved in processing abstract semantic information and are task-oriented. However, the RSA scores in these layers do not consistently reach 0.8 or higher (see detailed results in Appendix 8), indicating that changes in the syntactic representation in the lower layers can also affect the processing of deep linguistic information in the upper layers. These findings further explain why integrating syntactic structures represented through graphs can help BERT reconstruct the structure of input sentences, leading to a more accurate representation of source language sentences and, consequently, improved translation quality.
Top-5 Syntactic Labels With the Highest F1-Scores for GAT Predictions for Each Language, Along With the BERT Layers Where the Lowest RSA Scores are Observed.
Note . GAT = graph attention network; RSA = representational similarity analysis; BERT = bidirectional encoder representations from transformers; SGBD = syntactic knowledge via graph attention with BERT and decoder.
RSA scores for representations from the baseline and SGBD models for comparison.
Top-5 Syntactic Labels With the Highest F1-Scores for GAT Predictions for Each Language, Along With the BERT Layers Where the Lowest RSA Scores are Observed.
The impact of BERT and graph-based syntactic knowledge on enhancing translation quality presents an area for further investigation, particularly concerning the robustness of syntactic knowledge. This raises questions about the relative contributions of BERT versus graph-based syntactic knowledge to translation quality and the potential limitations of the proposed MT engines. To address these questions, the study involves altering the word order in source language sentences from each language in the PUD corpus. For example, the sentence “A B C D E F” is transformed into a randomized sequence like “C B A D F E.” Both the baseline and SGB engines are then tasked with translating these modified sentences. The translations are subsequently reassessed by TransQuest QE model, which compares the translations of the shuffled sentences against those of the original, orderly sentences. This comparison provides insights into the adaptability and efficacy of syntactic knowledge in translation.
To further validate the importance of accurate syntactic knowledge in enhancing the performance of the proposed MT engines, we conduct an additional experiment where we intentionally introduce incorrect syntactic graphs. In this experiment, we replace the parsers for Chinese, Russian, and German with an English parser to extract the syntactic structures of these three source languages. This deliberate introduction of incorrect syntactic graphs is then applied to the SGBC and SGBD engines. The goal is to observe how the performance of these models is affected when provided with inaccurate syntactic information.
As shown in Figure 3, scrambled word sequences in source sentences cause a significant decrease in translation quality for both baseline and SGB engines across all MT directions. Integrating GAT into the encoder or providing explicit syntactic knowledge to the decoder does not guarantee a substantial improvement in translation quality. It is unrealistic to expect the median QE scores in the box plots to increase from below 0.4 to 0.7. This finding suggests that BERT plays a more crucial role in forming representations of source sentences and influencing translation quality in this hybrid approach. The scrambling of input sentence order, which leads to a loss of syntactic information, indicates that while SGB engines, enhanced by graph-based syntactic knowledge, can mitigate some of the negative effects, they are still unable to interpret and comprehend the correct semantics of jumbled sentences as effectively as humans.

The box plot distribution of QE scores for translations in three MT directions, contrasting translations from ordered (above) versus disordered (below) source language sentence arrangements.
Table 10 provides a detailed comparison of QE scores for the SGBC and SGBD models when using correct versus incorrect syntactic graphs. In all translation directions, the introduction of incorrect syntactic graphs results in a significant decrease in QE scores for both the SGBC and SGBD models, with reductions exceeding 15% in all cases. The largest decrease in QE scores is observed for the Zh
Comparison of QE Scores With Correct and Incorrect Syntactic Graphs for SGBC and SGBD Engines and the Percentage Decrease in QE Scores.
These findings highlight that accurate syntactic graphs are not only beneficial but essential for maintaining high-quality translations, as inaccuracies in these graphs significantly affect the performance of MT systems. However, the performance degradation is not as severe as when input sentences are randomized. This further suggests that in the SGB models, BERT plays a dominant role, and while incorrect syntactic graphs do harm performance, the impact is more severe when the input errors are so significant that even BERT cannot effectively process them.
The central focus of this investigation is to determine whether the proposed use of syntactic knowledge on graphs continues to benefit alternative pretrained models, thereby further improving translation quality. XLM-Roberta-large (Conneau et al., 2020) replaces BERT in all three MT scenarios. To distinguish from earlier versions, MT engines incorporating XLM-Roberta-large are labeled Baseline-X, SGBC-X, and SGBD-X. The Chinese and Russian (Zh
Table 11 demonstrates that both SGB engines consistently achieve higher BLEU scores than Baseline-X across various MT directions, with the SGBD-X engine surpassing the SGBC-X engine in every scenario through superior BLEU scores. Bold values indicate the highest BLEU scores for each translation direction. Furthermore, Figure 4 illustrates the QE scores for translations within the PUD corpus for each engine. Baseline-X yields the highest number of translations with QE scores in the 0.2, 0.3, and 0.4 intervals along the

Distribution of QE scores for the MT engines after replacing BERT. The
BLEU Scores in Different MT Directions for the MT Engines That Replaced BERT With XLM-Roberta-Large.
The demonstrated efficacy of our method with XLM-Roberta indicates its applicability beyond a single pretrained model, extending to encoder-based pretrained models in general. This suggests that our approach is not confined to a specific architecture. However, adapting our method to other pretrained models, such as GPT or T5, presents distinct challenges. These models are primarily decoder-based and sequence-to-sequence models, respectively, which differ significantly from the encoder-based architecture of XLM-Roberta. Integrating syntactic knowledge into these models may necessitate alternative strategies, such as modifying the input format or adjusting the attention mechanisms. Despite these challenges, the potential benefits of incorporating syntactic knowledge into a broader range of pretrained models are substantial, as it can lead to more accurate and contextually appropriate translations. Future research will explore these adaptations to further enhance the robustness and applicability of our method.
This study explores the integration of syntactic knowledge into MT, particularly focusing on the evaluation of BERT and GAT. Two SGB engines are introduced for translating from Chinese to English (Zh
