Sage Journals: Discover world-class research

Abstract

Neural language models such as bidirectional encoder representations from transformers or generative pretrained transformer operate on the basis of sequences of words. Pretraining on a large corpus endows them with implicit knowledge about the relationship between words. This study explores the extent to which the explicit incorporation of knowledge about syntactic relations, represented as a graph of dependencies, can enhance machine translation (MT) tasks. Specifically, it employs the graph attention network (GAT), trained on a universal dependencies corpus, to evaluate the impact of explicit syntactic knowledge, even when derived from a smaller corpus, in comparison to the pretraining of implicit knowledge on a massive corpus. The investigation involves an experiment on integrating GAT models into the MT framework, demonstrating robust improvement in MT quality for three language pairs, thus opening up possibilities for neurosymbolic approaches to natural language processing.

Keywords

machine translation syntactic knowledge graph attention transformers

1. Introduction

The transformer architecture (Vaswani et al., 2017) has proven to be an extremely effective method for pretraining language models, from bidirectional encoder representations from transformers (BERT; Devlin et al., 2019a) to generative pretrained transformer (GPT; Brown et al., 2020). These models leverage the self-attention mechanism for the masked language modeling task, that is, predicting the word masked in a context. However, this relatively simple procedure leads to rich contextual representations, which can rival human performance. Nevertheless, despite their ability to learn implicit syntactic patterns, these models often struggle with explicit syntactic structures and phenomena (Bai et al., 2021; Rogers et al., 2020). This limitation is particularly significant in tasks such as neural machine translation (NMT), where syntactic accuracy is crucial for correctly interpreting and translating the structure and meaning of the source text. On the other hand, linguistic research has long focused on the detailed description and annotation of syntactic relations across languages. The universal dependencies (UD; Nivre et al., 2016) provides a standardized framework for annotating syntactic dependencies, creating richly annotated corpora that can be leveraged to improve NMT systems. Integrating explicit syntactic knowledge into NMT models has the potential to enhance translation quality by providing more structured and interpretable representations of language.

Neurosymbolic artificial intelligence (AI) aims to bridge the gap between symbolic reasoning and neural computation, thereby enabling more transparent, interpretable, and robust AI systems. Symbolic reasoning involves using explicit rules and structures to represent and manipulate knowledge, while neural networks excel at learning from large datasets and capturing complex patterns (Besold et al., 2021; Tilwani et al., 2024). Traditional sequential models, such as recurrent neural networks (RNNs) and transformers, although capable of processing and representing sentences, often fail to accurately capture complex syntactic structures and phenomena (Conneau et al., 2018; Egea Gómez et al., 2021; Peng et al., 2021). The advent of graph attention network (GAT; Veličković et al., 2017) introduces a more explicit representation of syntactic structures and inter-word dependencies through their topology, promising better readability and interpretability in natural language processing (NLP; Huang et al., 2020; Li et al., 2022).

Inspired by these developments, this study introduces NMT engines improved with syntactic knowledge via graph attention and BERT (SGB), where GAT provides a powerful mechanism for explicitly representing syntactic structures and inter-word dependencies, complementing the implicit knowledge captured by BERT. This approach aligns with the principles of neurosymbolic AI, which seeks to combine the strengths of symbolic reasoning (explicit syntactic graphs) with the robustness and scalability of neural networks (BERT and transformer models). By integrating syntactic data from source sentences with GATs and BERT, we aim to improve transformer-based NMT by incorporating syntax (every sentence yields a syntactic tree structure through the parser) and leveraging the capabilities of the pretrained BERT model. Utilizing multi-head attention mechanisms within the graph structure allows for the explicit exploitation of source-side syntactic dependencies, enhancing both the BERT embeddings on the source side and the effectiveness of the target-side decoder. The study conducts experiments on translation tasks from Chinese, German, and Russian to English to demonstrate the effectiveness of the proposed methodology, across three typologically different languages. We also examine the interpretability of the proposed NMT engines in improving translation quality, such as better identification of certain syntactic structures in the source language, and whether GAT can effectively learn syntactic knowledge. This research fills the current gap in understanding how syntactic strategies impact machine translation (MT) quality. The main contributions of this study are summarized as follows:

The proposed SGB engines effectively demonstrate the potential and effectiveness of integrating BERT with syntactic knowledge derived from graph attention mechanisms in MT tasks. These engines can be efficiently fine-tuned to complete the training process without the need for pretraining from scratch. This study evaluates the translation quality of the proposed MT engines, focusing specifically on improvements in quality estimation (QE) scores. The results indicate that the SGB engines achieve enhanced QE scores across three MT directions. A paired $t$ -test confirms a statistically significant difference in translation quality, highlighting the engines’ superior performance. Additionally, the study identifies specific syntactic structures in source sentences that the SGB engines learn optimally from, which contributes to the overall improvement in translation quality.

This study reveals that while GAT possesses the capability to learn syntactic knowledge, their sensitivity in the learning process is influenced by the multi-head attention mechanism and the number of model layers. Excessive model layers can even significantly impair the GAT’s ability to learn dependency relations. Furthermore, there is a correlation between the GAT’s mastery of syntactic dependencies and translation quality. Better-learned syntactic structures by the GAT enable the MT engine to more accurately recognize source language sentences with those structures, resulting in smoother and more accurate translations.

This study also investigates the interpretability of translation quality improvement through the lens of syntactic knowledge. The experiments demonstrate that a syntactic structure based on GAT enables more nuanced modeling of source language sentences by the lower and middle layers within BERT, thereby enhancing translation quality. While SGB engines enhanced with graph-based syntactic knowledge exhibit improved QE score distributions, the integration of BERT plays a crucial role in forming representations of source sentences. This research underscores the importance of accurate syntactic graphs for maintaining high-quality translations and highlights the limitations of current models in interpreting jumbled sentences. Furthermore, this study assesses the versatility of the proposed approach by integrating XLM-Roberta in place of BERT. Despite this substitution, the approach consistently improves translation quality across all evaluated MT directions, underscoring its broad applicability.

2. Related Studies

2.1. Pretrained Language Models

Pretrained models have significantly advanced NLP, particularly with the advent of transformer architectures, marking a paradigm shift in the field’s approach to understanding language (Devlin et al., 2019b; Liu et al., 2019). Among these innovations, BERT stands out by leveraging self-supervised learning on extensive corpora through the masked language model and next sentence prediction tasks. These techniques enable BERT to capture the essence of linguistic knowledge, enriching its understanding of language context and structure (Rogers et al., 2020). The empirical analysis and applications of BERT have also helped humans understand pretrained language models, supporting future improvements. Also, BERT has made significant contributions to MT tasks, where its contextual word embeddings and generic linguistic knowledge learned from pretraining enhance the generalization ability of MT engines, especially in cases with limited bilingual data. Most studies show that incorporating BERT improves the performance of MT engines, as demonstrated by metrics such as the bilingual evaluation undertstudy (BLEU) score (Imamura & Sumita, 2019; Yang et al., 2020; Zhu et al., 2020).

2.2. Syntactic Knowledge in Translation

In the realm of MT, the importance of syntactic dependency cannot be overstated. Syntactic dependency is crucial for the grammatical dissection of sentences, presenting them in easily interpretable tree diagrams. The incorporation of syntactic data into NMT systems provides substantial benefits, notably in clarifying sentence structure, facilitating more accurate context interpretation, and minimizing ambiguity. In recent years, the transformer model has garnered significant attention, and the strategy for incorporating explicit syntactic knowledge has shifted progressively from RNN-based methods to transformer-based ones (Currey & Heafield, 2019; McDonald & Chiang, 2021; Zhang et al., 2020). Within the transformer framework, a prevalent approach involves leveraging the self-attention mechanism to capture and represent syntactic information, enabling focused analysis on particular tokens. However, the efficacy of using the transformer’s attention mechanism as an explanatory tool remains a topic of debate (Jain & Wallace, 2019; Wiegreffe & Pinter, 2019). Efforts have been made to enhance the effectiveness of downstream tasks by fusing explicit syntactic knowledge with BERT (Huang et al., 2020; Wang et al., 2020). However, the applications of such integration in MT have not been thoroughly explored.

2.3. Deep Learning for Graphs

In NLP tasks, representing sentences and words as linear sequences might compress or obscure crucial topological information, including tree-like syntactic structures. This loss of structure can present significant challenges for downstream tasks that depend on accurately capturing the nuanced features of source language sentences, such as speech recognition and MT. While there are many approaches for encoding graphs (Chen et al., 2025), graph neural networks offer a solution through a topological graph-based approach, enabling the construction of diverse linguistic graphs. These graphs transform various textual features into a network of nodes, edges, and overall graph structures. This method allows for a more nuanced analysis and inference of linguistic patterns within input sentences, significantly benefiting downstream tasks (Song et al., 2019; Yin et al., 2020). The GAT emerges as a novel solution within this space, adept at processing data in non-Euclidean domains. It utilizes attention mechanisms to dynamically assign importance to nodes, enhancing the model’s capacity to learn from graph-based representations. This capability, when combined with BERT, forms a robust framework for encapsulating linguistic knowledge in downstream NLP tasks (Chen et al., 2021; Huang et al., 2020; Zhou et al., 2022).

3. Methodology

3.1. Construction of the Proposed Engines

This section provides detailed descriptions of the individual layers within the engine. Figure 1 illustrates the comprehensive architecture of the proposed engines.

Figure 1.

The architecture of the SGB engines. The encoder with BERT and GAT on the left and the decoder on the right. Dash lines indicate the alternative connections. $H_{e}^{l}$ and $h_{g}^{l}$ represent the final layer output of BERT and GAT. Note. BERT = bidirectional encoder representations from transformers; GAT = graph attention network; SGB = syntactic knowledge via graph attention and BERT.

3.1.1. Encoding

Given source sentence $S = [w_{1}, w_{2}, w_{3}, \dots, w_{i}]$ , where $i$ is the number of word tokens in a sentence, $S$ is then cut into subword tokens and fed into BERT, which become: $\tilde{S} = [[CLS], w_{1}^{1}, w_{1}^{1 # 1}, w_{2}, w_{3}^{3}, w_{3}^{3 # 3}, \dots, w_{n}, [SEP]]$ , where $w^{n # n}$ represents the subwords of $w_{n}$ , [CLS] and [SEP] are special tokens of BERT.

The experiments include translations from three source languages into English: Chinese to English (Zh $\to$ En), Russian to English (Ru $\to$ En), and German to English (De $\to$ En). We use three BERT variants as an encoder for each MT engine, where Chinese is chinese-bert-wwm-ext,¹ Russian is rubert-base,² and German is bert-base-german.³ Although their model structures are the same, the approaches differ in pretraining. Chinese BERT uses whole word masking, Russian BERT takes the multilingual version of BERT-base as its initialization for further pretraining, and the approach of German BERT remains the same as vanilla BERT. We aim to propose approaches that can be generalized to the BERT model structure, even though their pretraining approaches are different.

By capturing the representation of each subword token through BERT, the final embedded sequence is accessible via the last layer of BERT, $h_{B} = BERT (\tilde{S})$ . To obtain the syntactic dependency information of the source sentence $S$ , we use an UD -based parser⁴ (He & Choi, 2021) to perform tokenizing and syntactic dependency parsing on source sentences, as shown in Table 1. After obtaining the parsing results, we aim to represent the syntactic connections between words in the sentence using a graph. We construct the node adjacency matrix for graph representation, where each token corresponds to a node in the graph as shown in Figure 2. Since word representations from BERT contain rich semantic information, nodes on the graph are encoded by BERT embeddings. Considering the subword segmentation, we average subword token representations to obtain the node embeddings on the graph.

Figure 2.

The input sentence is parsed, and it is then expected to be converted into a graph structure based on the connections between parent nodes in the syntactic dependencies.

Table 1.

To Illustrate the Working Principle, Consider the Input Sentence: “The New Spending is Fueled by Clinton’s Large Bank Account.” This Sentence is Subsequently Parsed to Provide Detailed Linguistic Information, Such as Part-of-Speech (POS) Tags, Head Node IDs, and Syntactic Dependency Labels (DepRel). Source Language Sentences in Chinese, Russian, and German Also Follow the Same Parsing Steps.

Index	Word	POS	Head	DepRel
1	The	DET	3	det
2	new	ADJ	3	amod
3	spending	NOUN	5	nsubj:pass
4	is	AUX	5	aux:pass
5	fueled	VERB	0	root
6	by	ADP	11	case
7	Clinton	PROPN	11	nmod:poss
8	’s	PART	7	case
9	large	ADJ	11	amod
10	bank	NOUN	11	compound
11	account	NOUN	5	obl:agent
12	.	PUNCT	5	punct

3.1.2. Graph Attention

Words and adjacency relations in a sentence can be represented as a graph structure, where the words (known as tokens in the model) on the graph are as nodes, and the relationships called syntactic dependencies between words are regarded as edges connecting nodes. We use GAT (Veličković et al., 2017) as our critical component to fuse the graph-structured information and node features. The node features given to a GAT layer are $\tilde{G} = [x_{1}, x_{2}, \dots, x_{i}, \dots, x_{n}], x_{i} \in R^{F}$ , where $n$ is the total number of nodes, $F$ is the feature size of each node, the same with BERT embedding. Equations (1) and (2) summarize the working mechanism of the GAT. $\begin{aligned} h_{i}^{out} & = ∥_{k = 1}^{K} σ (\sum_{j \in N_{i}} α_{i j}^{k} W^{k} x_{j}), \end{aligned}$ (1) $\begin{aligned} α_{i j}^{k} & = \frac{\exp (LeakyReLU (a^{T} [W x_{i} ∥ W x_{j}]))}{\sum_{v \in N_{i}} \exp (LeakyReLU (a^{T} [W x_{i} ∥ W x_{v}]))} . \end{aligned}$ (2)

1-hop neighbors $j \in N_{i}$ are attended by the node $i$ , $∥_{k = 1}^{K}$ represents $K$ multi-head attention output concatenation. $h_{i}^{out}$ is the representation of node $i$ at the given layer. $α_{i j}^{k}$ means attention between node $i$ and $j$ . $W^{k}$ is linear transformation, $a$ is the weight vector for attention computation, $LeakyReLU$ is activation function. Simplistically, the feature calculation of one-layer GAT can be concluded as $h_{G} = GAT (X, A; Θ^{l})$ . The input is $X \in R^{n \times F}$ , and the final output is $h_{G} \in R^{n \times F^{'}}$ where $n$ is the number of nodes, $F$ is the feature size for each node, $F^{'}$ is the hidden state for GAT, $A \in R^{n \times n}$ is the graph adjacency matrix indicating node connection, $Θ^{l}$ is the parameters during training. During training, the GAT faithfully represents the syntactic information provided by the parser in the adjacency matrix. It then obtains the representations of the vertices and passes them to subsequent model modules. However, we cannot guarantee that all information from the parser is correct. Therefore, we treat incorrect information as noise, allowing the model to learn and enhance its robustness against such noise.

3.1.3. Fusion and Output

Two methodologies for integrating syntactic knowledge into MT engines are introduced. The initial approach, termed syntactic knowledge via graph attention with BERT concatenation (SGBC), involves merging syntactic information from graphs with BERT for the encoder’s operation, as detailed in equations (3) and (4): $\begin{aligned} H_{e}^{l} & = concat (h_{B}, h_{G}), \end{aligned}$ (3) $\begin{aligned} {\tilde{h}}_{d}^{l} & = {attn}_{D} (h_{d}^{l}, H_{e}^{l}, H_{e}^{l}) \end{aligned}$ (4)where ${attn}_{D}$ stands for encoder–decoder attention in MT engines. $l$ is the output of the $l$ th layer, d is the representation of the tokens in decoder-side. $H_{e}^{l}$ contains the features of BERT ( $h_{B}$ ) and GAT ( $h_{G}$ ) fed into the encoder–decoder attention module in the decoder. The feed-forward network subsequently processes the attention features alone with residual connection, as in the case of the vanilla transformer model.

The second one, called syntactic knowledge via graph attention with BERT and decoder (SGBD), is that the syntactic knowledge on the graph is not only applied to the encoder but also guides the decoder through the syntax-decoder attention, as shown in equations (5), (6), and (7): $\begin{aligned} {\tilde{h}}_{d}^{l} & = {attn}_{D} (h_{d}^{l}, H_{e}^{l}, H_{e}^{l}), \end{aligned}$ (5) $\begin{aligned} {\tilde{h}}_{s}^{l} & = {attn}_{S} (h_{d}^{l}, h_{g}^{l}, h_{g}^{l}), \end{aligned}$ (6) $\begin{aligned} {\tilde{h}}_{t}^{l} & = concat ({\tilde{h}}_{d}^{l}, {\tilde{h}}_{s}^{l}), \end{aligned}$ (7)where ${attn}_{D}$ and ${attn}_{S}$ represent encoder–decoder attention and syntax-decoder attention, respectively. $h_{g}^{l}$ is the output of GAT containing syntactic dependency features of sentences via another feed-forward network. ${\tilde{h}}_{t}^{l}$ is the final attention features obtained by concatenating ${attn}_{D}$ and ${attn}_{S}$ . As with the vanilla transformer, the predicted word is generated by a feed-forward network with residual connection and softmax function.

3.2. Metrics for MT Evaluation

In the domain of MT, there is an active search for accurate and reliable evaluation metrics. Among these metrics, BLEU (Papineni et al., 2001) has become a fundamental tool for evaluating the quality of text translated from one language to another. BLEU functions by comparing machine-generated translations to one or more reference translations, primarily focusing on the precision of $n$ -grams. Despite its widespread use, BLEU’s sole emphasis on precisely matching the reference translations, without considering fluency or content adequacy, has led researchers to seek supplementary evaluation strategies.

QE offers an innovative approach to translation assessment that does not require reference texts, by building models that directly predict whether the suggested translation is an accurate and fluent translation of the source text. This method is not only innovative but also practical, especially in contexts where reference translations are unavailable. QE engines can be trained to evaluate various aspects including fluency, adequacy, and even the predicted postediting effort, providing a comprehensive view of translation quality.

In this study, the evaluation of MT primarily employs two methods: the widely recognized $n$ -gram matching model, BLEU, and advanced neural network-based QE models, specifically COMET QE (Rei et al., 2020) and TransQuest QE (Ranasinghe et al., 2020). However, both BLEU and COMET QE operate at the corpus level, failing to identify improvements in specific sentences and relying on reference translations, which can overlook legitimate translation variants. In contrast, TransQuest QE employs MT quality assessment techniques to measure sentence-level improvements without relying on reference translations. Additionally, TransQuest QE leverages state-of-the-art transformer models, introducing a novel quality assessment method through sentence-level QE. It predicts a quality score for each sentence pair (source and translated sentence), which correlates with human judgments on translation quality. This approach represents significant advancements over traditional QE methods, providing more accurate and reliable assessments. TransQuest is also the winner of the WMT 20 QE shared task. Therefore, in the subsequent experiments, the QE scores are derived from the TransQuest QE methodology unless otherwise specified.

3.3. Datasets

The parallel UD (PUD) corpus is a collection of multilingual datasets designed to facilitate cross-linguistic analysis and the development of MT engines. Comprising texts translated into 20 languages, each dataset within the PUD corpus contains 1,000 sentences that are syntactically annotated, ensuring a high level of linguistic consistency and quality across different languages. These sentences are selected from a wide range of sources, including news articles and Wikipedia, providing a diverse mix of genres and topics.

The experiments utilize three typologically different languages to be translated into English: PUD Chinese,⁵ PUD Russian,⁶ and PUD German.⁷ The choice of these languages is determined by the availability of the UD corpus for a trained external syntactic parser and the PUD corpus for evaluating both the syntactic knowledge of BERT and GAT and the performance of the MT engine.

4. What Happens to Translations

4.1. Translation Performance with BLEU and QE

The effectiveness of the proposed approach is evaluated by BLEU score on the UNPC⁸ (Zh $\to$ En, Ru $\to$ En) and Europarl⁹ (De $\to$ En) datasets. One million sentence pairs are selected as the training set for each language, with 6,000 and 5,000 sentence pairs for the validation and test sets, respectively. The dataset is randomly divided to ensure that each subset is representative of the overall distribution, thereby reducing bias and ensuring a fair evaluation of the model’s performance. The validation set is used to monitor the model’s performance during training and to implement early stopping to prevent overfitting, while the test set is used for final evaluation to assess the model’s generalization capabilities. The baseline involves an encoder based on fine-tuned BERT, compared fairly with the proposed SGB engines using the same training setup. Decoders from the vanilla transformer model are used, featuring BERT variants for each source language with six layers and eight attention heads, while maintaining consistency in other parameters. The GAT within SGB engines includes two layers and six attention heads for Zh, and four attention heads for Ru and De, optimizing model performance. The training utilizes the Adam optimizer with parameters $β_{1} = 0.9$ and $β_{2} = 0.98$ , a learning rate of $2 \times 10^{- 5}$ , word embedding of 768, and cross-entropy as the loss function. All experiments are performed on RTX 3080 and 3090 GPUs.

As shown in Table 2, the proposed engines consistently achieve higher BLEU scores than the baseline engine across all three translation directions, regardless of the training set size. This underscores the effectiveness and generalization capability of the proposed approach. In the table, bold values indicate the highest BLEU scores for each combination of training set size and language direction, while italic values highlight the scores of the baseline model. SGBC consistently outperforms both the baseline and SGBD. This can be attributed to the fact that the output of SGBC more closely aligns with the criteria used in the BLEU score calculation. It is likely to generate translations that have a higher degree of $n$ -gram overlap with the reference translations, thus achieving higher BLEU scores. In contrast, the more complex SGBD produces translations that are more varied or nuanced, which may not always align as closely with the reference translations in terms of $n$ -gram precision. Inspired by the study revealing BLEU reliability (Kocmi et al., 2021), BLEU scores may not be sufficient to capture the nuanced quality of translations. Therefore, two QE models, COMET and TransQuest, are introduced to further evaluate the translation quality of the proposed models. The key difference between these models is that COMET assesses the translation quality by examining the interplay between the source sentence, its translation, and reference translations, whereas TransQuest only requires the source sentence and its translation. All performance metrics are scored on a scale from 0 to 100, with higher scores indicating better translation quality.

Table 2.
The Performance of SGB Engines Compared to Baseline Engines in BLEU Scores Across Three MT Directions With Varying Training Set Sizes. Despite the Reduced Dataset Size, SGB Engines Maintain Competitive BLEU scores.

Language Training size (million) Baseline SGBC SGBD

Zh $\to$ En 0.1 24.26 24.89 24.72

0.5 38.48 38.71 38.53

1 47.15 47.23 47.17

Ru $\to$ En 0.1 21.12 21.45 21.33

0.5 37.69 37.74 37.68

1 47.22 47.36 47.27

De $\to$ En 0.1 15.41 15.79 15.50

0.5 26.89 27.13 26.92

1 37.59 37.67 37.63

Note. BERT = bidirectional encoder representations from transformers; SGB = syntactic knowledge via graph attention and BERT; BLEU = bilingual evaluation undertstudy; MT = machine translation; SGBC = syntactic knowledge via graph attention with BERT concatenation; SGBD = syntactic knowledge via graph attention with BERT and decoder; Zh $\to$ En = Chinese to English; Ru $\to$ En = Russian to English; De $\to$ En = German to English.

Language	Training size (million)	Baseline	SGBC	SGBD
Zh $\to$ En	0.1	24.26	24.89	24.72
	0.5	38.48	38.71	38.53
	1	47.15	47.23	47.17
Ru $\to$ En	0.1	21.12	21.45	21.33
	0.5	37.69	37.74	37.68
	1	47.22	47.36	47.27
De $\to$ En	0.1	15.41	15.79	15.50
	0.5	26.89	27.13	26.92
	1	37.59	37.67	37.63

Table 3 demonstrates that when the training set size reaches 1 million, both SGB series engines exhibit higher scores on the BLEU and COMET QE performance metrics. However, SGBC and SGBD exhibit notable differences in their performance across these metrics: SGBC achieves the highest BLEU scores in all three translation directions, while SGBD obtains the highest COMET and TransQuest QE scores. SGBD’s scores are generally at least two points higher than those of the baseline engines. These performance metrics reflect the engines’ proficiency in leveraging syntactic knowledge from graphs and fully utilizing BERT’s potential language capabilities, enabling them to generate more accurate translations. It is important to note that BLEU is a paired metric, which can be unreliable, and both BLEU and COMET QE depend on reference translations. In real-world translation scenarios, reference translations may not always be available, and the semantic diversity of output sentences cannot be reliably verified. Therefore, compared to BLEU and COMET QE scores, the TransQuest QE score offers a more nuanced advantage in adapting to reasonable variations in translation. This is because it does not require reference translations, making it a more robust and practical metric for evaluating translation quality in diverse and realistic settings.

Table 3.

Performance Comparison of BLEU, COMET, and TransQuest scores for Three Translation Directions (Zh $\to$ En, Ru $\to$ En, De $\to$ En) With a Training Set Size of 1 Million. The Table Shows the Scores for the Baseline, SGBC, and SGBD Models, Highlighting the Best Performance in Each Metric With Bold Text.

	Language	Zh $\to$ En			Ru $\to$ En			De $\to$ En
Training size	Metric	Baseline	SGBC	SGBD	Baseline	SGBC	SGBD	Baseline	SGBC	SGBD
1 million	BLEU	47.15	47.23	47.17	47.22	47.36	47.27	37.59	37.67	37.63
	COMET	82.20	83.69	84.78	80.93	81.34	82.56	78.02	78.66	79.37
	TransQuest	70.08	72.66	73.01	81.65	83.31	83.95	75.49	77.00	77.94

Note. BLEU = bilingual evaluation undertstudy; Zh $\to$ En = Chinese to English; Ru $\to$ En = Russian to English; De $\to$ En = German to English; BERT = bidirectional encoder representations from transformers; SGBC = syntactic knowledge via graph attention with BERT concatenation; SGBD = syntactic knowledge via graph attention with BERT and decoder.

4.2. Translation of In-Domain and Out-of-Domain Sentences

Based on the results of the above experiments, BLEU scores still fail to reflect linguistic subtleties and align with human evaluative criteria (Callison-Burch et al., 2006; Novikova et al., 2017). To address these shortcomings, we employ a gold-standard syntactically annotated corpus, the PUD corpus, and the TransQuest QE model to further investigate changes in translation quality. The PUD corpus, with its diverse range of sources, including out-of-domain content, ensures a comprehensive evaluation of the MT engines’ ability to handle various linguistic structures and contexts. Additionally, the syntactic annotations in the PUD corpus provide a gold-standard reference, allowing for a detailed analysis of the engines’ performance in capturing and translating syntactic dependencies. We utilize the PUD corpus (PUD Chinese, PUD Russian, and PUD German) to evaluate the translation quality of the Baseline and SGB engines across three translation directions. The PUD corpus includes sentences from various out-of-domain sources, not limited to news and Wikipedia content, thus placing higher demands on the MT engines’ ability to effectively summarize and clarify the structure of input sentences. The QE model is used to estimate the quality of the source language sentences and their translations, rating the translations on a scale from 0 to 1, where higher scores indicate better translation quality. Paired $t$ -tests are used to analyze the changes and distribution of translation quality before and after implementing the proposed strategies, with a significance level of 0.05.

From Table 4, when comparing the Zh Baseline and SGBC engines, the average of differences ( ${\bar{x}}_{d}$ ) of them is 0.024, the standard deviation of the difference ( $S_{d}$ ) is 0.109 and the test statistic $(t)$ is 7.18, corresponding to a $p$ -value $< 0.001$ . Similarly, the $t$ and $p$ values for the SGBD engine also reveal the statistical significance of the QE scores before and after the proposed approach. Both comparisons reject the null hypothesis $H_{0}$ at the significance level of 0.05, where $H_{0}$ states that the proposed approaches do not significantly differ in QE scores compared to the baselines. Instead, the alternative hypothesis $H_{1}$ is accepted, which states that the differences between the baseline and SGB engines in QE scores are large enough to be statistically significant. Specifically, $H_{1}$ asserts that the QE scores of the SGB engines are significantly higher than those of the baseline engines.

Table 4.
The Baseline and the SGB Engines Compare the Translations of the PUD Corpus, Scored by the QE Model and Subjected to Paired $t$ -Tests to Demonstrate the Differences in Translation Quality Scores.

Source language Sample size Models ${\bar{x}}_{d}$ $S_{d}$ $t$ $P$ value

Zh 1000 Baseline SGBC 0.024 0.109 7.18 <0.001

SGBD 0.032 0.111 9.12 <0.001

Ru 1000 Baseline SGBC 0.024 0.042 18.38 <0.001

SGBD 0.034 0.045 23.67 <0.001

De 1000 Baseline SGBC 0.007 0.113 2.16 $=$ 0.030

SGBD 0.012 0.110 3.61 <0.001

Note. BERT = bidirectional encoder representations from transformers; SGB = syntactic knowledge via graph attention and BERT; PUD = parallel universal dependencies; QE = quality estimation; Zh=Chinese; Ru=Russian; De=German; SGBC = syntactic knowledge via graph attention with BERT concatenation; SGBD = syntactic knowledge via graph attention with BERT and decoder.

Source language	Sample size	Models	${\bar{x}}_{d}$	$S_{d}$	$t$	$P$ value
Zh	1000	Baseline	SGBC	0.024	0.109	7.18	<0.001
			SGBD	0.032	0.111	9.12	<0.001
Ru	1000	Baseline	SGBC	0.024	0.042	18.38	<0.001
			SGBD	0.034	0.045	23.67	<0.001
De	1000	Baseline	SGBC	0.007	0.113	2.16	$=$ 0.030
			SGBD	0.012	0.110	3.61	<0.001

Comparable outcomes are evident for Ru and De, wherein the quality of translations, upon the implementation of proposed methodologies, manifests a significant divergence from the prior state, as gauged by QE scores. The incorporation of syntactic knowledge via graph representations alongside the employment of BERT substantially enhances the translation efficacy of MT engines. It is noteworthy that the SGBD engines consistently achieve elevated QE scores, indicating a robust improvement in translation quality. Contrarily, while the SGBC engines are favored by BLEU scores, achieving higher metrics under that evaluation, the QE scores highlight a different aspect of translation quality, underscoring the nuanced and comprehensive analysis provided by QE metrics over BLEU. This divergence underscores the complexity of translation quality evaluation, revealing how different evaluation metrics may prioritize various aspects of translation performance.

4.3. Identifying Syntactic Relations in Source Language Sentences

Multiple dependency relations signify the structural attributes of a given sentence. To identify which dependency relation in the source language sentence from the PUD corpus contributes most to the enhancement of translation quality through translation engines, we retain and categorize sentences based on their dependency relations. Specifically, both the baseline engine and the two proposed SGB engines translate their own source language sentences from the PUD corpus. The translations are then ranked according to their TransQuest QE scores. The bottom 30% of translations, based on TransQuest QE scores, are considered low-quality translations. Source language sentences corresponding to these low-quality translations and containing the same dependency relation are grouped together. For example, for a given dependency relation $d$ , any source language sentence with a low-quality translation containing such dependency $d$ is grouped together. The average TransQuest QE score for each group, characterized by specific dependency relations, is calculated both before and after the application of the proposed methodologies. This approach allows us to conduct a detailed examination of the impact of distinct syntactic structures on the efficacy of translation quality improvements facilitated by the engines. By analyzing these groups, we can determine which dependency relations are most influential in improving translation quality, thereby providing insights into the syntactic features that benefit most from the proposed improvements.

Table 5 details how SGB engines outperform the baseline engines in accurately identifying syntactic relations within source language sentences, thereby markedly improving translation quality. It particularly emphasizes the top five syntactic relations that contribute to this improvement. Although both SGBC and SGBD engines incorporate graph-based syntactic knowledge, their approaches to learning dependency relations diverge. For instance, the “flat” (flat structure) in Zh is markedly significant in the SGBC engine yet receives less emphasis in the SGBD engine. Despite SGBD’s decoders being similarly guided by syntactic knowledge derived from graph representations, it does not uniformly excel across all syntactic relations in achieving a higher QE score compared to the SGBC engine. Specifically, in languages such as Zh, Ru, and De, the SGBC model outperforms SGBD in handling certain syntactic relations, including “discourse:sp” (discourse marker: speech), “orphan” (orphan), and “csubj” (clausal subject). This discrepancy may suggest that an overly focused reliance on syntactic knowledge could lead to knowledge redundancy, detrimentally affecting translation quality in the SGBD engine. Conversely, the importance of some syntactic relations remains consistent across both SGBC and SGBD engines, underscoring that the integration of syntactic knowledge via graph attention alongside BERT enables the MT engine to more precisely address specific common relations. This consistency, irrespective of the methodological differences between the two engines, indicates that leveraging graph-based syntactic knowledge in conjunction with BERT enhances the MT engine’s ability to explicitly navigate certain syntactic structures, thus contributing to the refinement of translation quality.

Table 5.
The Top-5 Dependency Relations Identified by the SGB Engines Are Those That Show the Greatest Improvement in QE Scores. These Relations Highlight Which Syntactic Dependencies Are Most Effectively Detected and Contribute Most Significantly to the Enhancement of Translation Quality in Each Translation Direction. “Qual” Denotes the Percentage Increase in QE Scores for Sentences Containing Such a Syntactic Structure.

Zh

Baseline SGBC Qual Baseline SGBD Qual

obl:agent 0.379 0.576 +51.978% obl:agent 0.379 0.597 +57.519%

discourse:sp 0.388 0.502 +29.381% iobj 0.387 0.511 +32.041%

flat 0.387 0.494 +27.648% nsubj:pass 0.423 0.545 +28.841%

flat:name 0.415 0.518 +24.819% appos 0.404 0.518 +28.217%

mark:prt 0.435 0.532 +22.298% discourse:sp 0.388 0.501 +29.123%

Ru

Baseline SGBC Qual Baseline SGBD Qual

orphan 0.608 0.768 +26.315% orphan 0.608 0.719 +18.256%

aux 0.700 0.764 +9.142% aux 0.700 0.777 +11.000%

ccomp 0.681 0.745 +9.397% ccomp 0.681 0.747 +9.691%

flat:name 0.703 0.761 +8.250% discourse 0.614 0.676 +10.097%

fixed 0.688 0.742 +7.848% fixed 0.688 0.750 +9.011%

De

Baseline SGBC Qual Baseline SGBD Qual

csubj 0.449 0.566 +26.057% flat 0.442 0.625 +41.402%

flat 0.442 0.553 +25.113% csubj 0.449 0.554 +23.385%

expl 0.486 0.573 +17.901% expl 0.486 0.589 +21.193%

compound:prt 0.493 0.579 +17.444% compound:prt 0.493 0.595 +20.689%

compound 0.495 0.577 +16.565% cop 0.502 0.586 +16.733%

Note. BERT = bidirectional encoder representations from transformers; SGB = syntactic knowledge via graph attention and BERT; Zh = Chinese; Ru = Russian; De = German; SGBC = syntactic knowledge via graph attention with BERT concatenation; SGBD = syntactic knowledge via graph attention with BERT and decoder; QE = quality estimation.

Zh
obl:agent	0.379	0.576	+51.978%	obl:agent	0.379	0.597	+57.519%
discourse:sp	0.388	0.502	+29.381%	iobj	0.387	0.511	+32.041%
flat	0.387	0.494	+27.648%	nsubj:pass	0.423	0.545	+28.841%
flat:name	0.415	0.518	+24.819%	appos	0.404	0.518	+28.217%
mark:prt	0.435	0.532	+22.298%	discourse:sp	0.388	0.501	+29.123%
Ru
	Baseline	SGBC	Qual		Baseline	SGBD	Qual
orphan	0.608	0.768	+26.315%	orphan	0.608	0.719	+18.256%
aux	0.700	0.764	+9.142%	aux	0.700	0.777	+11.000%
ccomp	0.681	0.745	+9.397%	ccomp	0.681	0.747	+9.691%
flat:name	0.703	0.761	+8.250%	discourse	0.614	0.676	+10.097%
fixed	0.688	0.742	+7.848%	fixed	0.688	0.750	+9.011%
De
	Baseline	SGBC	Qual		Baseline	SGBD	Qual
csubj	0.449	0.566	+26.057%	flat	0.442	0.625	+41.402%
flat	0.442	0.553	+25.113%	csubj	0.449	0.554	+23.385%
expl	0.486	0.573	+17.901%	expl	0.486	0.589	+21.193%
compound:prt	0.493	0.579	+17.444%	compound:prt	0.493	0.595	+20.689%
compound	0.495	0.577	+16.565%	cop	0.502	0.586	+16.733%

5. What Happens to Graphs

5.1. Syntactic Knowledge in GAT

GATs have the capability to represent syntactic structures in sentences using graph-based models. However, whether this capability signifies their ability to effectively learn syntactic knowledge remains an open question. To address this, we design a syntactic dependency prediction experiment where GATs are tasked with predicting the relevant syntactic labels in the syntactic structure. For this experiment, we utilize the PUD corpus, which provides gold-standard syntactic annotations, as our foundational dataset. The experimental process involves converting the syntactic annotations and sentence words into syntactic trees, which are subsequently transformed into graph structures for GAT analysis. In these graph structures, each word is represented as a node, and the edges represent the syntactic dependency connections as defined by the PUD corpus. The primary objective of the GAT is to infer the dependency relations for each word by integrating information from both nodes and edges. Unlike traditional syntactic dependency models, which often follow a unidirectional flow from parent to child nodes, this approach treats dependencies as bidirectional graphs. This bidirectional model acknowledges the mutual influence between parent and child nodes, which is crucial for GATs to understand the varying implications of node connections. By considering these bidirectional relationships, GATs can enhance their ability to accurately identify dependency relations among nodes, thereby improving their syntactic learning capabilities.

Similar to the transformer model, GAT utilizes multi-head attention and layers stacked upon each other. The study initially explores how the number of multi-head attention heads and layers influences GATs’ acquisition of syntactic knowledge, examining the advantages these configurations offer for learning syntactic dependencies. In the experiments, the attention head counts (Heads) tested for GATs are 2, 4, 6, and 8, while the layer counts (L) explored are 2, 3, 4, 5, and 6. For each language, datasets are divided into training, validation, and test sets with 800, 100, and 100 sentences, respectively, to tune hyperparameters, monitor model performance during training to prevent overfitting and evaluate the model on unseen data. The model parameters are set with a learning rate of $2 \times 10^{- 5}$ , a dropout rate of 0.2, Adam as the optimizer, and a hidden size of 768. The F1-score is used as the evaluation metric.

Table 6 emphasizes the critical importance of judiciously configuring the number of attention heads and layers in GAT, as this configuration significantly influences the model’s sensitivity to accurately learn syntactic knowledge. In the table, bold values indicate the highest performance metrics for each combination of language and number of layers. For example, the Russian language experiment reveals that a GAT setup with two layers and four attention heads outperforms a configuration with eight attention heads in terms of overall prediction efficacy. As the model is expanded to four layers, a higher number of attention heads enhances performance, with the F1-score increasing from 0.44 to 0.57. Conversely, increasing the number of layers tends to degrade the model’s ability to accurately predict dependency relations. Specifically, a configuration with two layers outperforms one with six layers, regardless of the number of attention heads. This decline suggests that an increase in GAT layers might lead to performance degradation, potentially due to nodes losing their specific attributes or incorporating irrelevant information during the aggregation process.

Table 6.
GAT Performance in Syntactic Dependency Prediction for Three Languages With Different Numbers of Attention Heads and Layers. The Number of Attention Heads Increases Incrementally From 2 to 6, and the Number of Model Layers Increases From 2 to 8.

Zh

Layers 2 Heads 4 Heads 6 Heads 8 Heads

2 0.63 0.62 0.64 0.64

3 0.64 0.61 0.62 0.63

4 0.56 0.58 0.64 0.49

5 0.49 0.50 0.51 0.50

6 0.37 0.40 0.33 0.33

Ru

Layers 2 Heads 4 Heads 6 Heads 8 Heads

2 0.58 0.61 0.47 0.56

3 0.45 0.55 0.54 0.53

4 0.44 0.47 0.56 0.57

5 0.42 0.52 0.46 0.49

6 0.41 0.36 0.31 0.33

De

Layers 2 Heads 4 Heads 6 Heads 8 Heads

2 0.64 0.67 0.64 0.56

3 0.60 0.56 0.56 0.57

4 0.56 0.50 0.53 0.53

5 0.58 0.61 0.50 0.47

6 0.48 0.49 0.48 0.42

Note. GAT = graph attention network; Zh = Chinese; Ru = Russian; De = German.

	Zh
2	0.63	0.62	0.64	0.64
3	0.64	0.61	0.62	0.63
4	0.56	0.58	0.64	0.49
5	0.49	0.50	0.51	0.50
6	0.37	0.40	0.33	0.33
	Ru
Layers	2 Heads	4 Heads	6 Heads	8 Heads
2	0.58	0.61	0.47	0.56
3	0.45	0.55	0.54	0.53
4	0.44	0.47	0.56	0.57
5	0.42	0.52	0.46	0.49
6	0.41	0.36	0.31	0.33
	De
Layers	2 Heads	4 Heads	6 Heads	8 Heads
2	0.64	0.67	0.64	0.56
3	0.60	0.56	0.56	0.57
4	0.56	0.50	0.53	0.53
5	0.58	0.61	0.50	0.47
6	0.48	0.49	0.48	0.42

When examining the prediction scores for individual dependency relations across the three languages, the results further validate this observation. As shown in Table 7, when the number of layers exceeds 3, the F1-scores for some syntactic relations tend to decrease and even drop to 0 as the number of layers increases. Increasing the number of attention heads does little to mitigate this degradation. Bold values in the table indicate the highest F1-scores for each syntactic relation across different configurations of layers and heads. However, certain syntactic tags remain unaffected by this trend. Regardless of the number of layers, GAT consistently learns and maintains high F1-scores for tags such as “advmod” (adverbial modifier), “case” (case marking), “cc” (coordinating conjunction), “mark” (marker), “nsubj” (nominal subject), and “punct” (punctuation). This indicates that GAT exhibits a high sensitivity and reliable capture of these specific syntactic features.

Table 7.

The Prediction of Syntactic Dependencies for Three Languages is Conducted Using Different Numbers of Attention Heads and Layers. As the Number of Layers Increases, the Performance of the GAT in Predicting Dependency Labels Declines, and it Gradually Loses the Ability to Learn Certain Dependency Labels, Resulting in the F1-Scores Dropping to Zero. However, Some Dependency Relations Remain Unaffected and Continue to Achieve Relatively High Prediction Scores.

GAT		Zh			Ru			De
layers	Heads	advmod	clf	dep	case	flat	mark	acl:relcl	cc	nsubj
2	2	0.90	0.87	0.64	0.99	0.85	0.97	0.71	0.97	0.75
	4	0.90	0.82	0.63	0.99	0.86	0.94	0.75	0.99	0.72
	6	0.91	0.89	0.66	0.98	0.87	0.96	0.75	0.96	0.72
	8	0.90	0.83	0.62	0.98	0.86	0.90	0.41	0.97	0.69
3	2	0.90	0.88	0.64	0.98	0.00	0.93	0.60	0.96	0.78
	4	0.91	0.86	0.64	0.98	0.86	0.94	0.45	0.96	0.71
	6	0.90	0.88	0.66	0.98	0.77	0.93	0.41	0.96	0.72
	8	0.91	0.90	0.66	0.99	0.86	0.93	0.46	0.96	0.74
4	2	0.89	0.68	0.64	0.97	0.00	0.94	0.52	0.84	0.74
	4	0.90	0.66	0.65	0.99	0.77	0.94	0.45	0.85	0.73
	6	0.91	0.69	0.68	0.99	0.67	0.97	0.40	0.85	0.77
	8	0.90	0.00	0.64	0.99	0.80	0.94	0.45	0.96	0.74
5	2	0.90	0.00	0.00	0.97	0.55	0.93	0.42	0.85	0.78
	4	0.90	0.00	0.00	0.98	0.77	0.96	0.68	0.82	0.79
	6	0.90	0.00	0.00	0.97	0.67	0.93	0.44	0.81	0.72
	8	0.89	0.00	0.00	0.99	0.48	0.96	0.43	0.86	0.73
6	2	0.83	0.00	0.00	0.94	0.00	0.91	0.00	0.83	0.65
	4	0.86	0.00	0.00	0.95	0.00	0.97	0.00	0.78	0.65
	6	0.84	0.00	0.00	0.94	0.00	0.93	0.00	0.79	0.67
	8	0.86	0.00	0.00	0.96	0.00	0.93	0.37	0.85	0.63

Note. GAT = graph attention network; Zh = Chinese; Ru = Russian; De = German.

We continue to compare the F1-scores of GAT’s dependency relation predictions with the QE scores of the SGB engines when processing prior low-quality translations containing these specific dependency relations (from Section 4.3), as shown in Table 8. It highlights the top-10 dependency relations with the highest prediction scores by GAT across various source language sentences, along with the corresponding changes in translation quality facilitated by different MT engines. The results demonstrate a clear positive correlation between GAT’s syntactic dependency prediction scores and the improvement in translation quality, especially when using the SGBC and SGBD engines. For Zh, dependency relations such as “mark” (marker), “cc” (coordinating conjunction), and “conj” (conjunct) have very high prediction scores by GAT (0.986, 0.984, and 0.970, respectively). These high scores correlate with significant improvements in translation quality, as evidenced by the higher QE scores of the SGBC and SGBD models compared to the baseline. Similarly, for Ru, dependency relations such as “det” (determiner), “root” (root), and “amod” (adjectival modifier) have high prediction scores (0.990, 0.987, and 0.982, respectively), leading to notable improvements in translation quality. For De, dependency relations such as “case” (case marking), “cc” (coordinating conjunction), and “det” (determiner) also exhibit high prediction scores (0.992, 0.987, and 0.987, respectively), resulting in improved translation quality. The positive correlation between GAT’s prediction scores and translation quality is consistent across the three languages, suggesting that GAT’s ability to accurately predict syntactic dependencies is a robust indicator of its potential to enhance translation quality. This underscores the importance of integrating syntactic information into MT systems to achieve more accurate and reliable translations. Also, the consistent improvement in translation quality across different languages and MT engines demonstrates the robustness of GAT in learning and applying graph-based syntactic structures.

Table 8.

Top-10 Dependency Relations With the Highest GAT F1-Score Across Various Source Language Sentences, Alongside Corresponding Changes in Translation Quality as Measured by QE Scores From Different MT Engines.

Zh
Dependency relation	GAT F1-score	Baseline QE score	SGBC QE score	SGBD QE score
mark	0.986	0.424	0.510	0.529
cc	0.984	0.436	0.513	0.512
conj	0.970	0.435	0.521	0.518
nummod	0.965	0.429	0.514	0.522
root	0.955	0.426	0.514	0.523
cop	0.945	0.426	0.520	0.511
det	0.935	0.438	0.530	0.528
case	0.934	0.428	0.511	0.526
nmod	0.933	0.429	0.509	0.523
amod	0.927	0.435	0.528	0.520
Ru
Dependency relation	GAT F1-score	Baseline QE score	SGBC QE score	SGBD QE score
det	0.990	0.697	0.747	0.746
root	0.987	0.700	0.748	0.750
amod	0.982	0.707	0.753	0.752
case	0.978	0.702	0.748	0.760
aux:pass	0.974	0.718	0.749	0.760
cop	0.971	0.720	0.774	0.781
advmod	0.934	0.704	0.750	0.747
cc	0.930	0.698	0.751	0.748
flat:foreign	0.921	0.678	0.701	0.727
obl	0.900	0.701	0.749	0.749
De
Dependency relation	GAT F1-score	Baseline QE score	SGBC QE score	SGBD QE score
case	0.992	0.504	0.568	0.574
cc	0.987	0.509	0.565	0.561
det	0.987	0.504	0.565	0.571
mark	0.981	0.511	0.561	0.570
advmod	0.932	0.506	0.573	0.582
root	0.931	0.503	0.570	0.574
aux:pass	0.927	0.498	0.576	0.556
amod	0.913	0.507	0.567	0.571
flat:name	0.876	0.505	0.551	0.565
aux	0.868	0.520	0.586	0.597

Note. GAT = graph attention network; QE = quality estimation; MT = machine translation; Zh = Chinese; Ru = Russian; De,= German; BERT = bidirectional encoder representations from transformers; SGBC = syntactic knowledge via graph attention with BERT concatenation; SGBD = syntactic knowledge via graph attention with BERT and decoder.

6. What Happens to Syntactic Features

6.1. Representational Similarity Analysis

Representational similarity analysis (RSA) is a technique used to analyze the similarity between different representation spaces of neural networks. Inspired by the work of Merchant et al. (2020), RSA uses $n$ examples to build two sets of comparable representations between neural networks. The representations are then transformed into a similarity matrix, and the Pearson correlation between the upper triangles of the similarity matrix is used to obtain the final similarity score between the representation spaces. We select the source sentences corresponding to the prior 300 low-quality translations and use them as the input stimulus for our analysis. The stimulus consists of groups of sentences, where each group is defined by a specific type of dependency relation. For example, if the current dependency relation is $x$ , all source sentences of low-quality translations containing $x$ are grouped together to form one stimulus group. To provide an example, consider the dependency relation “obl:agent” (oblique agent); all source sentences from the 300 low-quality translations that contain the “obl:agent” (oblique agent) relation are grouped together. Similarly, for the dependency relation “nsubj:pass” (nominal subject in a passive construction), all source sentences containing this relation are grouped together. BERT representations are extracted from both the baseline model and the SGB engines (e.g., baseline vs. SGBC) for each stimulus group, allowing us to compare the representation spaces of the different models. Cosine similarity is used as the kernel to compute the similarity between the BERT representations of the stimulus groups, helping us understand how the addition of syntactic knowledge affects the representation space of BERT.

Table 9 lists partial results from an RSA comparing baseline BERT and SGB models based on syntactic prediction scores by GAT (full results are provided in Appendix 8). The analysis shows that the lowest RSA scores mainly occur in the lower and middle layers of BERT, regardless of whether the model is used in the SGBC or SGBD engine. Specifically, when GAT achieves high F1-scores for a particular dependency relation, the representations of sentences containing this relation typically undergo significant changes in the lower and middle layers of BERT. These changes are most pronounced in layers 3–5 for Chinese and Russian and in layers 5–8 for German. This suggests that the syntactic structure represented through graphs influences BERT’s reanalysis of input sentences, leading to a syntactic reconstruction of the input sentence. Also, the lower and middle layers of BERT are particularly sensitive to modifications in modeling both shallow and deep syntactic structures. In contrast, layers 9–12 are primarily involved in processing abstract semantic information and are task-oriented. However, the RSA scores in these layers do not consistently reach 0.8 or higher (see detailed results in Appendix 8), indicating that changes in the syntactic representation in the lower layers can also affect the processing of deep linguistic information in the upper layers. These findings further explain why integrating syntactic structures represented through graphs can help BERT reconstruct the structure of input sentences, leading to a more accurate representation of source language sentences and, consequently, improved translation quality.

Table 9.
Top-5 Syntactic Labels With the Highest F1-Scores for GAT Predictions for Each Language, Along With the BERT Layers Where the Lowest RSA Scores are Observed.

Zh

Relation GAT F1-score RSA score BERT layer RSA score $^{}$ BERT layer

mark (marker) 0.986 0.418 5 0.407 3

cc (coordinating conjunction) 0.984 0.274 4 0.354 5

conj (conjunct) 0.970 0.380 5 0.340 4

nummod (numeric modifier) 0.965 0.274 4 0.237 3

root (root) 0.955 0.216 4 0.390 4

Ru

Relation GAT F1-score RSA score BERT layer RSA score $^{}$ BERT layer

det (determiner) 0.990 0.426 4 0.408 3

root(root) 0.987 0.466 3 0.504 3

amod (adjectival modifier) 0.982 0.444 3 0.391 4

case (case marking) 0.978 0.462 4 0.413 4

aux:pass (passive auxiliary) 0.974 0.357 3 0.327 3

De

Relation GAT F1-score RSA score BERT layer RSA score $^{}$ BERT layer

case (case marking) 0.992 0.686 5 0.759 2

cc (coordinating conjunction) 0.987 0.591 6 0.741 6

det (determiner) 0.987 0.584 8 0.817 6

mark (marker) 0.981 0.676 6 0.769 6

advmod (adverbial modifier) 0.932 0.733 6 0.774 8

Note. GAT = graph attention network; RSA = representational similarity analysis; BERT = bidirectional encoder representations from transformers; SGBD = syntactic knowledge via graph attention with BERT and decoder. $^{}$ RSA scores for representations from the baseline and SGBD models for comparison.

Zh
Relation	GAT F1-score	RSA score	BERT layer	RSA score $^{*}$	BERT layer
mark (marker)	0.986	0.418	5	0.407	3
cc (coordinating conjunction)	0.984	0.274	4	0.354	5
conj (conjunct)	0.970	0.380	5	0.340	4
nummod (numeric modifier)	0.965	0.274	4	0.237	3
root (root)	0.955	0.216	4	0.390	4
Ru
Relation	GAT F1-score	RSA score	BERT layer	RSA score $^{*}$	BERT layer
det (determiner)	0.990	0.426	4	0.408	3
root(root)	0.987	0.466	3	0.504	3
amod (adjectival modifier)	0.982	0.444	3	0.391	4
case (case marking)	0.978	0.462	4	0.413	4
aux:pass (passive auxiliary)	0.974	0.357	3	0.327	3
De
Relation	GAT F1-score	RSA score	BERT layer	RSA score $^{*}$	BERT layer
case (case marking)	0.992	0.686	5	0.759	2
cc (coordinating conjunction)	0.987	0.591	6	0.741	6
det (determiner)	0.987	0.584	8	0.817	6
mark (marker)	0.981	0.676	6	0.769	6
advmod (adverbial modifier)	0.932	0.733	6	0.774	8

6.2. Randomization of Word Order and Disruption of Syntactic Graphs

The impact of BERT and graph-based syntactic knowledge on enhancing translation quality presents an area for further investigation, particularly concerning the robustness of syntactic knowledge. This raises questions about the relative contributions of BERT versus graph-based syntactic knowledge to translation quality and the potential limitations of the proposed MT engines. To address these questions, the study involves altering the word order in source language sentences from each language in the PUD corpus. For example, the sentence “A B C D E F” is transformed into a randomized sequence like “C B A D F E.” Both the baseline and SGB engines are then tasked with translating these modified sentences. The translations are subsequently reassessed by TransQuest QE model, which compares the translations of the shuffled sentences against those of the original, orderly sentences. This comparison provides insights into the adaptability and efficacy of syntactic knowledge in translation.

To further validate the importance of accurate syntactic knowledge in enhancing the performance of the proposed MT engines, we conduct an additional experiment where we intentionally introduce incorrect syntactic graphs. In this experiment, we replace the parsers for Chinese, Russian, and German with an English parser to extract the syntactic structures of these three source languages. This deliberate introduction of incorrect syntactic graphs is then applied to the SGBC and SGBD engines. The goal is to observe how the performance of these models is affected when provided with inaccurate syntactic information.

As shown in Figure 3, scrambled word sequences in source sentences cause a significant decrease in translation quality for both baseline and SGB engines across all MT directions. Integrating GAT into the encoder or providing explicit syntactic knowledge to the decoder does not guarantee a substantial improvement in translation quality. It is unrealistic to expect the median QE scores in the box plots to increase from below 0.4 to 0.7. This finding suggests that BERT plays a more crucial role in forming representations of source sentences and influencing translation quality in this hybrid approach. The scrambling of input sentence order, which leads to a loss of syntactic information, indicates that while SGB engines, enhanced by graph-based syntactic knowledge, can mitigate some of the negative effects, they are still unable to interpret and comprehend the correct semantics of jumbled sentences as effectively as humans.

Figure 3.

The box plot distribution of QE scores for translations in three MT directions, contrasting translations from ordered (above) versus disordered (below) source language sentence arrangements. Note. QE = quality estimation; MT = machine translation.

Table 10 provides a detailed comparison of QE scores for the SGBC and SGBD models when using correct versus incorrect syntactic graphs. In all translation directions, the introduction of incorrect syntactic graphs results in a significant decrease in QE scores for both the SGBC and SGBD models, with reductions exceeding 15% in all cases. The largest decrease in QE scores is observed for the Zh $\to$ En direction, where both the SGBC and SGBD engines experience a decline of over 20%. Conversely, the smallest decrease is noted for the De $\to$ En direction, with reductions of 18.53% and 16.80% for the SGBC and SGBD models, respectively. This difference may be attributed to the closer linguistic proximity between German and English, which results in fewer detrimental effects from the parser’s incorrect syntactic structures. In contrast, the lower similarity between Chinese and English means that incorrect syntactic structures have a more significant adverse impact on the SGBC and SGBD engines. Despite the use of incorrect syntactic graphs, the SGBD engine still demonstrates a greater likelihood of maintaining higher translation performance, indicating that the SGBD model benefits more from syntactic graphs, even when they are incorrect.

Table 10.

Comparison of QE Scores With Correct and Incorrect Syntactic Graphs for SGBC and SGBD Engines and the Percentage Decrease in QE Scores.

	SGBC model			SGBD model
	Correct graph	Incorrect graph	%	Correct graph	Incorrect graph	%
Zh $\to$ En	0.682	0.510	$-$ 25.21	0.726	0.558	$-$ 23.14
Ru $\to$ En	0.757	0.621	$-$ 17.96	0.770	0.618	$-$ 19.74
De $\to$ En	0.669	0.545	$-$ 18.53	0.720	0.599	$-$ 16.80

Note. QE = quality estimation; BERT = bidirectional encoder representations from transformers; SGBC = syntactic knowledge via graph attention with BERT concatenation; SGBD = syntactic knowledge via graph attention with BERT and decoder; Zh $\to$ En = Chinese to English; Ru $\to$ En = Russian to English; De $\to$ En = German to English.

These findings highlight that accurate syntactic graphs are not only beneficial but essential for maintaining high-quality translations, as inaccuracies in these graphs significantly affect the performance of MT systems. However, the performance degradation is not as severe as when input sentences are randomized. This further suggests that in the SGB models, BERT plays a dominant role, and while incorrect syntactic graphs do harm performance, the impact is more severe when the input errors are so significant that even BERT cannot effectively process them.

7. What Happens When Using Another Pretrained Model

The central focus of this investigation is to determine whether the proposed use of syntactic knowledge on graphs continues to benefit alternative pretrained models, thereby further improving translation quality. XLM-Roberta-large (Conneau et al., 2020) replaces BERT in all three MT scenarios. To distinguish from earlier versions, MT engines incorporating XLM-Roberta-large are labeled Baseline-X, SGBC-X, and SGBD-X. The Chinese and Russian (Zh $\to$ En and Ru $\to$ En) MT engines utilize the UNPC corpus, whereas the German (De $\to$ En) engines employ Europarl. Each training set comprises 0.1 million sentence pairs, with validation and test sets featuring 6K parallel sentence pairs each. Specifications include word embeddings of 1024, a learning rate (excluding GAT) of $2 \times 10^{- 5}$ , a GAT learning rate of $5 \times 10^{- 5}$ , a GAT dropout rate of 0.1, a batch size of 8, and the Adam optimizer. Training is conducted on an RTX 3090 GPU.

Table 11 demonstrates that both SGB engines consistently achieve higher BLEU scores than Baseline-X across various MT directions, with the SGBD-X engine surpassing the SGBC-X engine in every scenario through superior BLEU scores. Bold values indicate the highest BLEU scores for each translation direction. Furthermore, Figure 4 illustrates the QE scores for translations within the PUD corpus for each engine. Baseline-X yields the highest number of translations with QE scores in the 0.2, 0.3, and 0.4 intervals along the $X$ -axis for both Zh and De, a pattern also observed in Ru at the 0.4 and 0.5 intervals. A notable shift in the distribution of translations for Zh and De occurs at the 0.5 mark on the $X$ -axis, where SGBC-X and SGBD-X engines begin to outperform Baseline-X, a trend that persists up to the 0.8 interval. In Ru, the SGB engines similarly exhibit a higher count of translations with elevated QE scores than the Baseline engine at the 0.7 and 0.8 intervals on the $X$ -axis.

Figure 4.

Distribution of QE scores for the MT engines after replacing BERT. The $Y$ -axis shows the number of sentences, while the $X$ -axis shows the range of scores for the QE scores of the translations. Note. QE = quality estimation; MT = machine translation; BERT = bidirectional encoder representations from transformers.

Table 11.

BLEU Scores in Different MT Directions for the MT Engines That Replaced BERT With XLM-Roberta-Large.

	Baseline-X	SGBC-X	SGBD-X
Zh $\to$ En	26.28	26.59	27.13
Ru $\to$ En	23.62	23.86	24.01
De $\to$ En	22.93	23.28	24.46

Note. BLEU = bilingual evaluation undertstudy; MT = machine translation; BERT = bidirectional encoder representations from transformers; Zh $\to$ En = Chinese to English; Ru $\to$ En = Russian to English; De $\to$ En = German to English.

The demonstrated efficacy of our method with XLM-Roberta indicates its applicability beyond a single pretrained model, extending to encoder-based pretrained models in general. This suggests that our approach is not confined to a specific architecture. However, adapting our method to other pretrained models, such as GPT or T5, presents distinct challenges. These models are primarily decoder-based and sequence-to-sequence models, respectively, which differ significantly from the encoder-based architecture of XLM-Roberta. Integrating syntactic knowledge into these models may necessitate alternative strategies, such as modifying the input format or adjusting the attention mechanisms. Despite these challenges, the potential benefits of incorporating syntactic knowledge into a broader range of pretrained models are substantial, as it can lead to more accurate and contextually appropriate translations. Future research will explore these adaptations to further enhance the robustness and applicability of our method.

8. Conclusions

This study explores the integration of syntactic knowledge into MT, particularly focusing on the evaluation of BERT and GAT. Two SGB engines are introduced for translating from Chinese to English (Zh $\to$ En), Russian to English (Ru $\to$ En), and German to English (De $\to$ En), and by leveraging GAT, the representation capabilities of the BERT encoder are enhanced, and the decoder’s understanding of source language sentence structures is improved. The results demonstrate that the proposed SGB engines outperform baseline models in terms of BLEU scores, COMET QE scores, and TransQuest QE scores, indicating significant improvements in translation accuracy and robustness. When translating the PUD corpus, paired $t$ -tests confirm a statistically significant difference in TransQuest QE scores, further validating the substantial improvement in translation quality. We find that the SGB engines, which incorporate graph-structured knowledge, are more adept at recognizing the structural nuances of source language sentences, thereby enhancing translation quality, for instance, the SGB engines achieve notably higher QE scores for Chinese sentences with the “obl:agent” (oblique agent) structure compared to baseline engines. The study also evaluates the syntactic dependency learning performance of GAT using the PUD corpus, and the results show that the learning efficiency improves with an increase in attention heads, though the optimal configuration varies across languages, however, excessive model complexity, beyond two layers, tends to degrade prediction performance, highlighting the importance of balancing complexity and predictive effectiveness. Additionally, the study investigates the impact of GAT’s dependency prediction on translation quality, and the findings indicate that accurate predictions by GAT for certain dependency relations can lead to better translations of source sentences containing those dependencies. RSA experiments further reveal that although GAT is not initially part of BERT, its integration allows specific BERT layers to reevaluate the syntactic structure of source sentences through fine-tuning, and this effect is particularly pronounced in the early and mid-layers of BERT across different languages. Experiments on word order randomization and parser replacement emphasize the critical role of syntactic information embedded in graph structures in enhancing translation quality. We also show that our approach is not limited to BERT; similar performance improvements have been achieved with XLM-Roberta as an alternative model. In summary, this study underscores the significant potential of combining syntactic knowledge embedded in graph structures with language models such as BERT and XLM-Roberta to enhance MT, and the findings support further research into these synergies to improve translation accuracy and interpretability with better knowledge about syntax.

Footnotes

Funding

The author received no financial support for the research,authorship,and/or publication of this article.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

ORCID iDs

Yuqian Dai

Serge Sharoff

Marc De Kamps

References

Bai

Wang

Chen

Yang

Bai

Tong

(2021). Syntax-BERT: Improving pre-trained transformers with syntax trees. In P. Merlo, J. Tiedemann, & Tsarfaty R (Eds.), Proceedings of the 16th conference of the European Chapter of the Association for Computational Linguistics: Main volume (pp. 3011–3020). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.262

Besold

T. R.

d’Avila Garcez

Bader

Bowman

Domingos

Hitzler

Kühnberger

K. U.

Lamb

L. C.

Lima

P. M. V.

de Penning

Pinkas

Poon

Zaverucha

(2021). Neural-symbolic learning and reasoning: A survey and interpretation 1. In Neuro-symbolic artificial intelligence: The state of the art (pp. 1–51). IOS Press.

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

Winter

, … Amodei

(2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Callison-Burch

Osborne

Koehn

(2006). Re-evaluating the role of Bleu in machine translation research. In 11th conference of the European Chapter of the Association for Computational Linguistics (pp. 249–256). Association for Computational Linguistics.

Chen

Wang

Hassoun

Liu

L. P.

(2025). Graph generative pre-trained transformer. ArXiv, abs/2501.01073.

Chen

Zhang

Zhou

(2021). Combining adversarial training and relational graph attention network for aspect-based sentiment analysis with bert. In 2021 14th international congress on image and signal processing, biomedical engineering and informatics (CISP–BMEI) (pp. 1–6). IEEE.

Conneau

Khandelwal

Goyal

Chaudhary

Wenzek

Guzmán

Grave

Ott

Zettlemoyer

Stoyanov

(2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th annual meeting of the Association for Computational Linguistics (pp. 8440–8451). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747

Conneau

Kruszewski

Lample

Barrault

Baroni

(2018). What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th annual meeting of the Association for Computational Linguistics (volume 1: Long papers) (pp. 2126–2136). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1198

Currey

Heafield

(2019). Incorporating source syntax into transformer-based neural machine translation. In Proceedings of the fourth conference on machine translation (volume 1: Research papers) (pp. 24–33). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-5203

10.

Devlin

Chang

M. W.

Lee

Toutanova

(2019a). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

11.

Devlin

Chang

M. W.

Lee

Toutanova

(2019b). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

12.

Egea Gómez

McGill

Saggion

(2021). Syntax-aware transformers for neural machine translation: The case of text to sign gloss translation. In Proceedings of the 14th workshop on building and using comparable corpora (BUCC 2021) (pp. 18–27). Online (Virtual Mode): INCOMA Ltd.

13.

Choi

J. D.

(2021). The stem cell hypothesis: Dilemma behind multi-task learning with transformer encoders. In: Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 5555–5577). Online: Association for Computational Linguistics.

14.

Huang

Sun

Zhang

Wang

(2020). Syntax-aware graph attention network for aspect-level sentiment classification. In Proceedings of the 28th international conference on computational linguistics (pp. 799–810). International Committee on Computational Linguistics.

15.

Imamura

Sumita

(2019). Recycling a pre-trained BERT encoder for neural machine translation. In Proceedings of the 3rd workshop on neural generation and translation (pp. 23–31). Association for Computational Linguistics.

16.

Jain

Wallace

B. C.

(2019). Attention is not explanation. In Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies, volume 1 (long and short papers) (pp. 3543–3556). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1357

17.

Kocmi

Federmann

Grundkiewicz

Junczys-Dowmunt

Matsushita

Menezes

(2021). To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Proceedings of the sixth conference on machine translation (pp. 478–494). Online: Association for Computational Linguistics.

18.

Zheng

Wang

(2022). Automatic requirements classification based on graph attention network. IEEE Access, 10, 30080–30090.

19.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

Stoyanov

(2019). Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

20.

McDonald

Chiang

(2021). Syntax-based attention masking for neural machine translation. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Student research workshop (pp. 47–52). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-srw.7

21.

Merchant

Rahimtoroghi

Pavlick

Tenney

(2020). What happens to bert embeddings during fine-tuning? arXiv preprint arXiv:2004.14448.

22.

Nivre

de Marneffe

M. C.

Ginter

Goldberg

Hajič

Manning

C. D.

McDonald

Petrov

Pyysalo

Silveira

Tsarfaty

Zeman

(2016). Universal dependencies v1: A multilingual treebank collection. In Proceedings of LREC 2016, Portorož, Slovenia(pp. 1659-1666). European Language Resources Association.

23.

Novikova

Dušek

Curry

A. C.

Rieser

(2017). Why we need new evaluation metrics for NLG. arXiv preprint arXiv:1707.06875.

24.

Papineni

Roukos

Ward

Zhu

W. J.

(2001). BLEU: A method for automatic evaluation of machine translation. Technical Report RC22176 (W0109-022), IBM Thomas J. Watson Research Center.

25.

Peng

Hao

Fang

(2021). Syntax-aware neural machine translation directed by syntactic dependency degree. Neural Computing and Applications, 33(23), 16609–16625.

26.

Ranasinghe

Orasan

Mitkov

(2020). Transquest at wmt2020: Sentence-level direct assessment. In Proc WMT (pp. 1127–1136). Online: Association for Computational Linguistics.

27.

Rei

Stewart

Farinha

A. C.

Lavie

(2020). COMET: A neural framework for MT evaluation. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 2685–2702). Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.213

28.

Rogers

Kovaleva

Rumshisky

(2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8, 842–866.

29.

Song

Gildea

Zhang

Wang

(2019). Semantic neural machine translation using amr. Transactions of the Association for Computational Linguistics, 7, 19–31.

30.

Tilwani

Venkataramanan

Sheth

A. P.

(2024). Neurosymbolic ai approach to attribution in large language models. IEEE Intelligent Systems, 39(6), 10–17. https://doi.org/10.1109/MIS.2024.3477108

31.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

(2017). Attention is all you need. In Proceedings of advances in neural information processing systems. Curran Associates, Inc.

32.

Veličković

Cucurull

Casanova

Romero

Liò

Bengio

(2017). Graph attention networks. arXiv preprint arXiv:1710.10903.

33.

Wang

Shen

Yang

Quan

Wang

(2020). Relational graph attention network for aspect-based sentiment analysis. In Annual meeting of the association for computational linguistics. (pp. 3229–3238). Association for Computational Linguistics.

34.

Wiegreffe

Pinter

(2019). Attention is not not explanation. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP–IJCNLP) (pp. 11–20). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1002

35.

Yang

Wang

Zhou

Zhao

Zhang

(2020). Towards Making the Most of BERT in Neural Machine Translation. ArXiv, abs/1908.05672.

36.

Yin

Meng

Zhou

Yang

Zhou

Luo

(2020). A novel graph-based multi-modal fusion encoder for neural machine translation. In Annual meeting of the association for computational linguistics. (pp. 3025–3035). Association for Computational Linguistics.

37.

Zhang

Zhou

Duan

Zhao

Wang

(2020). SG-NET: Syntax guided transformer for language representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 3285–3299. https://doi.org/10.1109/TPAMI.2020.3046683

38.

Zhou

Zhang

Cheng

Song

(2022). Dynamic multichannel fusion mechanism based on a graph attention network and bert for aspect-based sentiment classification. Applied Intelligence, 53, 6800–6813.

39.

Zhu

Xia

Qin

Zhou

Liu

T. Y.

(2020). Incorporating bert into neural machine translation. arXiv preprint arXiv:2002.06823.