Abstract
Introduction
One of the core ideas of Semantic Web is to extend the current Web by enabling machines to understand and respond to complex human requests based on the meaning of various data objects, rather than shallow representations, such as keywords. Central to this vision is the structuring of data in a way that allows for meaningful interconnections between different data points, such as documents, images, or concepts. Taxonomies, which classify and organize concepts into hierarchical structures, are essential to this process as they provide the backbone for organizing information in a way that is both accessible and meaningful. By categorizing data into well-defined classes and relationships, taxonomies facilitate the creation of ontologies, which are more complex frameworks that define the relationships between concepts in the Semantic Web. Such ontologies are supposed to enable more accurate data retrieval, allowing for richer, more nuanced interactions with web content. In essence, taxonomies serve as the building blocks of the Semantic Web, providing the necessary structure for data. Thus integration of taxonomies into the Semantic Web framework could enhance ability to handle complex queries, making it a more powerful tool for knowledge discovery and data management.
More formally, taxonomy is a directed acyclic graph that organizes concepts through various relationships, with each node representing a specific concept connected to others via IS-A relations. Prime examples of such a taxonomy is Princeton WordNet for the English language (Miller, 1998) 1 (further will be referred as the WordNet) or Open English WordNet (McCrae et al., 2019) 2 which is a project further developing of the WordNet in an open source collaborative manner. 3 For an easy interoperability with the Semantic Web the latter resource is shared in various formats, including RDF (Turtle) format. 4 WordNets for other languages are also available and many of them being maintained under the Global WordNet Association. 5 These semantic graphs connect terms into a hierarchical structure via hyponymy/hypernymy relations, while also featuring other types of relations, such as synonymy, antonymy, metonymy, troponymy, etc. WordNet not only includes nodes but also provides definitions, multiple lemmas, and unique sense numbers to distinguish between different meanings within the same synset.
The use of taxonomies is well-justified in various NLP tasks, including entity linking (Corro et al., 2015), named entity recognition (Toral & Muñoz, 2006), and several others (Lenz & Bergmann, 2023; Wang et al., 2023). Yet, despite the widespread adoption of large language models (LLMs), taxonomies continue to be constructed and curated primarily through the manual efforts of such as language experts, for example, linguists or lexicographers, but also enthusiasts and crowdworkers. Earlier neural approaches to natural language processing have struggled to automate taxonomy construction effectively, but this limitation may not apply to the latest generation of LLMs. While some research has shown that Transformer models underperform in this area, these studies were conducted using much less powerful language models than those available today (Hanna & Mareček, 2021; Nikishina et al., 2022a; Radford et al., 2019).
Recent studies of LLMs highlight their impressive capacity to internally store vast amounts of knowledge (Kauf et al., 2023; Sun et al., 2024; Tang et al., 2023). Additionally, as these models scaled up, they demonstrated emerging in-context learning abilities, enabling rapid adaptation to new tasks (Dong et al., 2024). These observations suggest that LLMs could be also effectively leveraged for lexical semantic tasks, such as taxonomy construction. However, despite some previous attempts to apply LLMs in this domain, research remains limited. The few studies that have explored LLMs for lexical semantics have primarily focused on hyponymy and hypernymy relationships, with little attention given to other types of graph relations (Chernomorchenko et al., 2024; Nikishina et al., 2023, 2022b).
Moreover, these studies have generally been limited to hypernym discovery, neglecting the broader range of tasks that taxonomies can support. For instance, research on taxonomy enrichment often uses LLMs only to extract representations that are then fed into Graph Neural Networks (GNNs) (Scarselli et al., 2008) or other simpler graph embedding models, such as node2vec (Grover & Leskovec, 2016), rather than directly employing LLMs for the full range of tasks (Jiang et al., 2022).
In this paper, we aim to fill the gap in existing research by exploring how modern foundation models can learn and apply taxonomy graph relations across multiple lexical semantic tasks. Specifically, we focus on using a single LLM to tackle four distinct tasks simultaneously: taxonomy construction, hypernym discovery, taxonomy enrichment, and lexical entailment. We hypothesize that contemporary LLMs, when pretrained exclusively on the Princeton WordNet, can effectively learn taxonomy relations by leveraging their inherent language knowledge and align it with the established human-labeled structure.
To sum up, the contribution of the paper is as follows:
We We introduce a novel dataset creation Using the developed method, we create Using the dataset, we train We conduct a comprehensive Finally, we
We make produced data, code and models in this study publicly available. 8
This work is an extended version of research originally presented in two conference papers (Moskvoretskii et al., 2024a, 2024b). The novelty of the present journal article compared to these prior publications lies in the following additional contributions:
We deliver We We perform an We report results of taxonomy construction on an Finally, we perform
how LLMs can resolve graph cycles using extracted relations to improve quality of taxonomy construction; how LLMs can leverage multiple relations to refine an already constructed graph; how bidirectional relations can be used to refine constructed taxonomies.
In this section, we provide a brief overview of previous approaches to the lexical semantics tasks that are the focus of our experiments. We explore the development of graph and taxonomy construction methods and discuss the challenges where taxonomic knowledge has shown to be particularly advantageous.
Taxonomies and Large Language Models
To the best of our knowledge, most existing papers do not consider generative transformers for taxonomy learning, while research mostly had focused on encoder-based rather than GPT-style models for taxonomy learning. Notable examples include using pre-trained BERT encoder to estimate hypernymy (Chen et al., 2021; Davies et al., 2023; Hanna & Mareček, 2021). Most studies involving LLMs in taxonomy construction have explored the use of models like LM-Scorer (Jain & Espinosa Anke, 2022), which employs BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) among masked LMs, and GPT-2 (Radford et al., 2019) among causal LMs. These studies typically employ zero-shot sentence probing or experiment with prompts for taxonomy learning. However, their results have not surpassed the state-of-the-art GNN models for tasks like TexEval-2. Notably, there is a lack of research comparing these methods to more recent open-source models such as LLaMA-2 (Touvron et al., 2023a) and Mistral (Jiang et al., 2023) for taxonomy-related tasks, that is the part of the current paper.
Hypernym Discovery
The task of hypernymy discovery involves generating a list of hypernyms for a given hyponym, as illustrated in Figure 1(a). A recent contribution in this area is a taxonomy-adapted, fine-tuned T5 model introduced by Nikishina et al. (2023). Prior to this, several approaches have been explored. The 300-sparsans method (Berend et al., 2018) improves upon the traditional word2vec technique. The Hybrid model (Held & Habash, 2019) combines the k-Nearest Neighbor method with Hearst patterns. CRIM (Bernier-Colborne & Barrière, 2018), recognized as the best performer in the SemEval competition, uses a Multilayer Perceptron (MLP) structure with a contrastive loss function. Lastly, the Recurrent Mapping Model (RMM) (Bai et al., 2021) employs an MLP with residual connections and a contrastive-like loss function.

Examples with input and output for each task are highlighted by color. Rectangle “hypernym” denotes a word generated by the model; circle means a node from the graph. Confidence score determines the existence of a relationship between the two nodes provided in the input. (a) Lexical semantic tasks and (b) Generation and ranking pipelines for solution of various lexical semantic tasks.
The task of taxonomy enrichment involves determining the most suitable position for a missing node within a taxonomy, addressed in SemEval-2016 Task 14 (Jurgens & Pilehvar, 2016). Over the past few years, various architectures have been developed to tackle this task. TMN (Zhang et al., 2021) uses multiple scoring mechanisms to identify
Taxonomy Construction
The task of taxonomy construction is focused on building a domain taxonomy starting from a raw list of terms. Previously, this task was solved with the use of GNN, such as Graph2Taxo (Shang et al., 2020) or employing zero-shot language model for scoring pairs or mask token probability, such as LMScorer and RestrictMLM (Jain & Espinosa Anke, 2022). However, some approaches differ with focus on Hearst patterns boosted with Poincare embeddings for refinement.
Lexical Entailment
The task of lexical entailment involves classifying the semantic connections between word pairs. For instance, if we consider the term “tiger” (a hyponym), it inherently suggests the broader category “big cat” (a hypernym).
Recent research in lexical entailment includes various innovative models. SeVeN (Espinosa-Anke & Schockaert, 2018) encodes relationships between words, while Pair2Vec (Joshi et al., 2019) and a modified GloVe approach from Jameel et al. (2018) utilize word co-occurrence vectors along with Pointwise Mutual Information to understand semantic connections. The LEAR model (Vulić & Mrkšić, 2018), on the other hand, fine-tunes Euclidean space to better reflect hyponymy–hypernymy relationships. Graph-based approaches, the “Global” Entailment Graph (GBL) (Hosseini et al., 2018) employs a GNN focusing on local learning, while its evolution, the “Contextual” Entailment Graph (CTX) (Hosseini et al., 2021), enhances this by integrating contextual link prediction. The CTX model was later improved with an entailment smoothing technique proposed by McKenna et al. (2023), which currently holds SOTA for this task.
Methodology
In this section, we describe the process of building an instruction-tuning dataset specifically designed for taxonomy learning using LLMs and further fine-tuning.
Dataset Construction Method
The dataset creation process is largely based on the English Princeton WordNet 3.0, chosen for its structured and well-maintained organization. Our focus is mainly on the noun subgraph, not only because it represents the most frequent category in WordNet, but also because recent research (Lazaridou et al., 2021) has identified it as a challenging class for language models to master.
We begin dataset creation by utilizing a directed acyclic graph (DAG) derived from WordNet, which is structured around “IS-A” relationships. Next, we randomly select edges or subsets from this graph, dividing them into different subsets while taking into account all possible tree operations. A comprehensive explanation of the dataset construction algorithm can be found in Section 3.1.1.
We posit that a diverse dataset encompassing various taxonomy relations offers two key advantages:
A diverse dataset enhances the model’s ability to generalize, enabling it to understand broader relationships between words across a wide range of subtasks. A diverse dataset also enables the model to develop and apply various strategies for effective taxonomy construction.
To account for the widest possible range of tree operations within the graph, we gather four distinct subsets, with a particular emphasis on hyponym and hypernym prediction. The tasks include the following scenarios (as illustrated in Figure 2):

Examples of IS-A relation structures: (A) hyponym prediction, (B) hypernym prediction, (C) synset mixing, and (D) synset insertion.
We ensure that our test and training datasets are completely distinct, with no overlap between them. Specifically, none of the test nodes is included in any of the subtask scenarios. The statistics for each subset are detailed in Table 1.
Statistics of Taxonomy Subtask Samples Used for Training and Evaluation.
Each column shows the number of samples per subtask type in different training configurations.
More formally we represent the method in Algorithm 1. The input of this algorithm are subsets A, B, C, D mentioned above corresponding to subtasks. These sets are derived from the graph, which are represented as a collection of the following mini-sets as following:
Here,
To facilitate comprehensive set intersections, we introduce the concept of “deep intersection,” denoted as
In the next phase, our goal is to generate random training and testing sets while aiming to balance the categories as much as possible, although some relations are less frequently represented. We ensure that the training set primarily consists of hyponym and hypernym predictions, while other types of samples are evenly distributed. This task is challenging due to the potential for significant overlap among different cases and the sequence in which samples are collected. To manage this complexity, we introduce a distribution over subtasks, denoted as
To regulate the likelihood of samples being allocated to the test set, a Bernoulli distribution was considered, denoted as
For
For
During data collection, we utilize the “pop()” operation, which removes and returns the last element from a set.
To manage the complexities associated with dominant word categories, we perform a topological sort on the graph. We then ensure that no vertex in our sets has a level lower than a specified parameter, referred to as “level.” This condition is expressed as:
We also designate a “target” vertex for each element within the subtasks. This enables us to monitor the inclusion of this specific target vertex in the test set, ensuring the integrity of our evaluation process. The definitions of these “target” vertices vary depending on the subtask and can be outlined as follows:
In this section, we investigate LLM’s ability to learn taxonomic relations.
Models
Using our collected dataset, we train a series of models under the TaxoLLaMA family:
In addition to our fine-tuned models, we evaluate several modern instruction-tuned LLMs in both zero-shot and few-shot settings:
Our inputs include a system prompt that looks as follows:
Then we introduce a technical-style input prompt and the expected output format:
In addition to the data collected through the described algorithm, it is crucial to disambiguate senseof the input node. We employ 3 ways of doing it: incorporating number ID from wordnet, lemmas and definitions in following prompt versions:
Since definitions might not be available for certain subtasks during inference—such as lexical entailment or taxonomy enrichment – we also generate definitions using ChatGPT for test sets that lack pre-existing explanations or source them from Wikidata.
For generating definitions, we used the web interface of ChatGPT 3.5 of February 2024 and the “gpt-3.5-turbo” model from the same period. The prompts used for these requests, along with the statistics of the generated definitions, are detailed in the Appendix A, specifically in Examples 9-10 and Table 16. This step is crucial, as experiments have shown that the absence of definitions can significantly reduce the model’s performance (Moskvoretskii et al., 2024b).
Evaluation
We evaluate the performance of our models using the Mean Reciprocal Rank (MRR), a metric that indicates the rank position of the first correct answer. We chose MRR over other ranking metrics, such as Precision@k or MAP because they might impose overly stringent criteria, which do not reflect abilities of understanding taxonomy. To assess the models, we create a list of potential candidates, each separated by a comma, and then match these candidates with the target words.
Fine-Tuning Details
To optimize several models, we applied a 4-bit quantization technique. Subsequently, we fine-tuned them using LoRA (Hu et al., 2022) for one training epoch with a batch size of 64. We used the AdamW optimizer with a learning rate of
Applications
We further hypothesize that such trained models will be effective in solving taxonomy-related out-of-domain tasks. We propose two ways of adapting to tasks:
Results
The results of tested models are summarized in Table 2. In the zero-shot setting, all base models perform poorly, with MRRs generally below 0.2. Notably, Qwen2.5 models outperform GPT-2 and Phi-3-mini, especially on hypernym detection, which shows that it might be easier to predict hypernyms. However, scores remain low overall, highlighting the difficulty of taxonomic inference without supervision.
Mean Reciprocal Rank (MRR) scores across four taxonomy subtask types: Hyponym prediction, Hypernym prediction, Insertion, and Synset Mixing, along with the overall mean score.
The best-performing model on each task is shown in bold , and the second-best is underlined . Fine-tuned variants of our proposed models – TaxoLLaMA and TaxoLLaMA3.1 – demonstrate substantial improvements.
Mean Reciprocal Rank (MRR) scores across four taxonomy subtask types: Hyponym prediction, Hypernym prediction, Insertion, and Synset Mixing, along with the overall mean score.
The best-performing model on each task is shown in
Table 3 presents precision and recall metrics for the hyponym prediction task using the TaxoLLaMA
Evaluation of Precision and Recall metrics for the Hyponym subtask using the TaxoLLaMA
The results show that Precision is higher than Recall at the earliest ranks, indicating that the model’s top predictions are generally accurate. However, as the rank increases, recall improves and eventually surpasses precision, suggesting that the model retrieves more relevant hyponyms but with lower accuracy at deeper levels. This pattern reflects a trade-off between precision and recall, which contributes to the overall modest MAP scores for the hyponym subtask.
The few-shot setting leads to modest gains, particularly for Qwen2.5-7B, which achieves 0.259 average MRR and the highest insertion score. However, Phi-3-mini and LLaMA3.1-8B show only marginal improvements or even regressions, suggesting limited benefit from prompting for these tasks.
In contrast, the fine-tuned models significantly outperform all others. Our proposed latest models—TaxoLLaMA3.1 and TaxoLLaMA3.1-ALL achieve the top two average scores. TaxoLLaMA3.1 ranks first overall, with the highest scores on hyponym (0.329), hypernym (0.517), and strong performance on synset mixing. Interestingly, while TaxoLLaMA3.1-ALL trails slightly in average score, it leads on synset mixing (0.280), indicating its robustness across structural variations.
The evaluation reveals a clear contrast between TaxoLLaMA and TaxoLLaMA3.1. The former performs better on the Insertion subtask but struggles with Hyponym prediction, while the latter shows improved results on Hyponyms but a decline in Insertion performance. As shown in Table 4, TaxoLLaMA3.1 achieves higher Precision and Recall on Insertion, but a lower MRR. This suggests that although it retrieves more correct answers overall, it identifies correct Hyponyms earlier in the ranking, indicating stronger early precision on that subtask.
Evaluation of Precision and Recall metrics for the Hyponym and Insertion subtasks using the TaxoLLaMA3.1 model.
Among the disambiguation techniques, definitions performs best, particularly on hypernym and insertion tasks. The Mistral-7B backbone achieves similar high performance on hypernym and insertion but underperforms on other tasks.
Interestingly, the top-performing model, TaxoLLaMA3.1, underperforms on the insertion task compared to earlier TaxoLLaMA variants. This may suggest that modern models tend to prioritize easier-to-learn taxonomic patterns during optimization, potentially overlooking more structurally complex relations like insertion.
We believe that the size of the model is the main contributing factor, rather than pre-training data amount. Despite using lemmas or definitions for disambiguation, the score does not change drastically for worst cases, showing that disambiguation is not the key problem. Moreover, the underperformance may be linked to the sequential nature of LM loss in instruction tuning. With multiple correct answers, it poses a problem to properly apply loss, as different orders of correct nodes would imply completely different loss values. The problem usually arises with hyponym prediction.
To provide better understanding of undergoing processes, we splitted the hyponymy cases into more detailed. The narrower case, reflected at Figure 3 are as follows:

Examples of hyponym subtasks: Leaves Divided (A), Internal Nodes (B), Only Leaves (C), Single Leaves (D).
The results in Table 5 show that terminal nodes are predicted better as internal. We believe that this results stems from ambiguity of internal nodes, as we noted through manual examination of them. The main issue with predicting internal nodes (3B), is prediction of more distant nodes (with hop
MRR Scores for the LlaMA-2 model with Different Hyponyms Prediction Subtasks; Column Names Correspond to Figure 3.
To better understand the consistently low average results, we closely examined the model outputs and found that the complexity of the dataset could be a significant factor. Some synsets within the WordNet taxonomy may be overly specialized, which poses a challenge for the model when predicting hyponyms or hypernyms. To investigate this possibility, we categorized our dataset into two distinct groups: commonly known words (classified as the
We revisit the performance metrics for both the “easy” and “hard” subsets and summarized the results in Table 6. Interestingly, models generally performed better on the “hard” nodes, especially when it came to predicting hyponyms. However, when using our best model that incorporates word definitions, the “easy” instances yielded higher scores, particularly in cases that did not involve hyponym predictions. This trend, though, is not consistent across all prompt types; in some cases, “hard” instances were more accurately predicted, even when dealing with hypernyms or internal nodes.
Difference in MRR Scores Between the easy and hard Subsets for Each Taxonomy Learning Subtask, Computed as:
Positive Values (Highlighted in Green) Indicate that the Model Performed Better on the easy Subset; Negative Values (Highlighted in Red) Indicate Better Performance on the hard Subset. Results are shown for Three Input Variants: WordNet Numbers, Lemmas, and Definitions.
Difference in MRR Scores Between the
We believe the results of the ablation study suggest that the model tends to predict less common words more accurately. This could be because the candidate pool for these terms is smaller, allowing the model to focus more directly on the correct answers. Additionally, the model likely encounters these rare words less frequently and typically within consistent, specific contexts, which might enhance its predictive accuracy for such terms.
In this section we describe the application of TaxoLLaMA to taxonomy construction task.
We test the TaxoLLaMA versions on the downstream task: SemEval 2016 Task 13. We use the Eurovoc taxonomies (“Science”, “Environment”) and Wordnet “Food” from SemEval-2016 (Bordea et al., 2016). These datasets are commonly used as a benchmark for testing models’ abilities of taxonomy construction. As mentioned in Section 3.1.1, the test set was deliberately excluded during the TaxoLLaMA training.
To create the taxonomy, we use an uncertainty-based ranking approach. This technique involves assessing the hypernymy relation through perplexity calculations, where a lower perplexity score indicates a stronger relationship. We calculate the perplexity for every possible edge and retain only those below an optimal threshold, determined via brute-force search over a predefined grid. We have not used definitions, as it is infeasible to generate them in this setting. We also apply self-refinement based on hypernymy perplexity to resolve self-loops and delete multiple parental edges, which is further discussed in Section 5.2 and Section 5.4.
Results
In Table 7, we showcase the F1-scores for the Science, Environment and Food datasets. We evaluate our three models version against earlier methods.
F1 Scores for the Taxonomy Construction Task on Three SemEval-2016 Domain-Specific Datasets: Science, Environment, and Food.
Our models’ TaxoLLaMA
, TaxoLLaMA
, and TaxoLLaMA
outperform previous approaches on the Environment and Food domains and achieve competitive results on the Science domain. Bold indicates the best result, and underlined values mark the second-best. Notably, our models are trained solely on WordNet and do not rely on domain-specific taxonomies, yet still generalize well across unseen taxonomic structures.
F1 Scores for the Taxonomy Construction Task on Three SemEval-2016 Domain-Specific Datasets: Science, Environment, and Food.
Our models’
Table 8 presents F1 scores alongside the Fowlkes & Mallows (F&M) index for evaluating the performance of our taxonomy construction approach across the three domains described above. While the F1 score captures the harmonic mean of precision and recall, it does not fully reflect structural alignment in hierarchical tasks. The F&M index, which measures pairwise clustering similarity, is included to provide a complementary perspective on how well the predicted taxonomies preserve the underlying hierarchical structure.
F1 Scores in Comparison to Fowlkes & Mallows Index for Taxonomy Construction Task.
Our results indicate that our method outperforms all existing models on the Environment and Food domains and ranks second on the Science domain. The top-performing approach for the “Science” dataset, Graph2Taxo (Shang et al., 2020), achieves its best score through a GNN-based cross-domain transfer framework, specifically during their ablation study. Interestingly, the framework’s default setup does not produce the highest scores (refer to Shang et al. (2020) (pure) in Table 7). It is also clear that zero-shot LM performed the worst on average, underscoring the need for specific fine-tuning and stronger models (Jain & Espinosa Anke, 2022).
Typically, having multiple parent nodes in taxonomies and ontologies is rare, usually with no more than three parents. We analyzed how our LLM constructs the graph across various thresholds, with the results presented in Table 9. The findings show that assigning multiple parents is common when using non-optimal thresholds, and while less frequent, it still occurs with optimal thresholds.
Distribution of Parent Counts Across Graph Types and Subsets.
Distribution of Parent Counts Across Graph Types and Subsets.
We addressed the issue of multiple parent nodes using several techniques:
The results in Table 10 indicate that most of the self-refinement methods for handling multiple parents improve the quality of the graph in comparison to the baseline, but are still worse than the best (Perplexity) method. The simple perplexity rule proves to be the most effective. We believe this is due to the LLM’s stronger ability to encode hypernym relations, while its synset mixing capability is less developed, likely due to limited data during pretraining.
Differences in F1 Scores for Different Methods for Self-Refinement of LLM in Comparison with the Perplexity Method (best). Baseline Refers to the Taxonomy Construction Without any Refinement Strategy.
In this section, we explore how an LLM can validate edges for both hypernymy and hyponymy relations. After constructing the graph using hypernymy, we investigate the impact of removing edges that fall above the hyponymy threshold on the overall quality. The results presented in Figure 4 demonstrate that using hyponymy for graph refinement can be beneficial, though it requires careful calibration. However, it is not as effective as the refinement techniques used for resolving multiple parental nodes.

The graph for hypernym–hyponym validation for Science and Environment.
Cycles are typically rare in taxonomies, and self-loops should not exist at all, as they contradict the fundamental structure of taxonomies. To address self-loops and larger cycles, we primarily use the perplexity rule, similar to the approach described in Section 5.2, by removing the edge with the highest perplexity.
We also considered eliminating cycles involving three or more nodes by leveraging the LLM’s ability to evaluate the insertion of a node followed by the deletion of the least probable connection. However, cycles with three or more nodes are rare when using optimal thresholds and are not included in our analysis, as they consistently result in lower scores compared to the optimal threshold.
The results in Table 10 show an overall improvement with this procedure, particularly in the scientific domain, where closely related concepts are more likely to form loops.
Taxonomy Construction Strategies
Hypernymy Vs Hyponymy
This experiment presented in Table 11 show that predicting hypernyms performs significantly better than predicting hyponyms, which is coherent with the scores for the respective subtasks during the fine-tuning step.
Results for the Downstream TexEval-2 Task Comparing Different Fine-Tuned Models, Methods for Graph Construction, and Templates for Model Inputs. Hyper Approach Stands for Hypernym Prediction and Hypo for Hyponym Prediction.
Results for the Downstream TexEval-2 Task Comparing Different Fine-Tuned Models, Methods for Graph Construction, and Templates for Model Inputs. Hyper Approach Stands for Hypernym Prediction and Hypo for Hyponym Prediction.
We explored two techniques for building a taxonomic graph. For both of them, we traverse a predefined grid and select the best threshold based on evaluation metrics. However, the search spaces differ: the
Results in Table 11 show that brute-force outperformed the DFS-style approach. That could happen due to error accumulation during graph traversal. Incorrect decision on the first couple levels significantly limits our possible edge space.
Prompt
Prompt was ablated with adding lemmas, empty lemma or specific WordNet number with corresponding models. For prompting with lemmas (as we have no additional lemmas unlike in WordNet), we tried two approaches (duplicate lemma in listing; provide no lemma at all):
Results in Table 11 show that the best result is obtained with either empty lemma or technical numbers. We believe that model could be distracted when the lemma is repeated, therefore scores are lower. It is unexpected that model with WordNet number has shown outperformance for Environment and Strong result for Science, possibly due to more straightforward task.
Hypernym Discovery
We evaluate TaxoLLaMA on the hypernym discovery task from SemEval-2018 (Camacho-Collados et al., 2018) using a generative approach. This task includes an English test set for general hypernyms, as well as two domain-specific sets for “Music” and “Medical.” Additionally, there are general test sets available for Italian and Spanish. The performance is assessed using the Mean Reciprocal Rank (MRR) metric. We employ a zero-shot approach, where the model is tested without fine-tuning on the training datasets. Notably, the test set is distinct from WordNet and may require multiple hops to reach hypernyms, making it suitable for both general and narrow domains.
Results
The results for the English language, presented in Table 12, show that both the fine-tuned
MRR Performance on Hypernym Discovery. * Refers to the Systems that Rely on the Provided Dataset only, without LLM Pretraining or Additional Data being Used. Zero-shot is Trained on the WordNet Data only, Without Fine-Tuning on the Target Dataset.
MRR Performance on Hypernym Discovery. * Refers to the Systems that Rely on the Provided Dataset only, without LLM Pretraining or Additional Data being Used.
In the case of Italian and Spanish, the fine-tuned model exceeds previous SOTA results. This success might be attributed to the model’s inherent multilingual capabilities, given that LLaMA-2 was initially designed to be multilingual, even though fine-tuning was conducted solely on English pairs. However, the zero-shot performance reveals challenges in generating accurate hypernyms for languages other than English. It is important to note that Italian and Spanish data were not part of the instruction tuning dataset.
Zero-shot Performance
To better understand the underperformance in zero-shot scenarios, we analyzed the impact of fine-tuning across different domains and languages, as depicted in Figure 5(a). The analysis shows that, apart from task 2B, the model surpasses previous SOTA results with as few as 50 samples for fine-tuning. Furthermore, the varying scores emphasize the model’s sensitivity to the quality and characteristics of the training data.

Experiments for domain and language adaptation on the hypernym discovery datasets. (a) Fine-tuning and (b) Few-shot learning
We further investigated the few-shot learning approach for Italian and Spanish to evaluate the model’s adaptability in an in-context learning setting, as depicted in Figure 5(b). The model surpassed previous SOTA benchmarks for Italian, showing a near-logarithmic improvement with 30 and 50 shots, but did not perform as well for Spanish. We attribute this suboptimal few-shot performance to the 4-bit quantization and the relatively small model size. Smaller models generally underperform on various benchmarks compared to their larger counterparts, as demonstrated by the example of LLaMA-2 (Touvron et al., 2023b). Moreover, smaller or quantized models have limited capacity compared to larger models, a finding supported by earlier research (Egiazarian et al., 2024; Frantar et al., 2022; Lin et al., 2024; Wang et al., 2022). As it has been already seen (Lin et al., 2024), the benefits of few-shot learning are less pronounced in quantized models compared to full-precision models.
Taxonomy Enrichment
In this section we evaluate TaxoLLaMA on the taxonomy enrichment task. Following the methodology of previous studies (Jiang et al., 2022; Zhang et al., 2021), the task is considered as ranking graph nodes based on their probability of being the correct hypernym. The aim is to position the correct hypernyms at the top of the ranking, ensuring the node is accurately placed within the taxonomy. In our approach, we utilize the generative method, as shown in Figure 1(b).
The taxonomy enrichment benchmark includes datasets such as WordNet Noun, WordNet Verb, MAG-PSY, and MAG-CS (Jiang et al., 2022; Shen et al., 2020). To maintain consistency with the TaxoExpan test set (Shen et al., 2020), we selected 1,000 nodes from each dataset. In line with (Jiang et al., 2022), we utilize scaled MRR (Ying et al., 2018) as the key evaluation metric. This metric is derived by multiplying MRR by 10 and then averaging it across all correct hypernyms associated with each node.
To improve disambiguation, we created definitions for MAG datasets that lacked predefined explanations, either by generating them with ChatGPT or retrieving them from Wikidata. We utilized the ChatGPT 3.5 web interface and the “gpt-3.5-turbo” model, both from February 2024, for generating these definitions. The prompts used and the statistics related to the generated definitions are provided in Appendix A, specifically in Examples 9-10 and Table 16. This step is essential, as missing definitions can lead to a decrease in model performance, as highlighted in Moskvoretskii et al. (2024b).
Results
The results in Table 13 indicate that our model outperforms all previous approaches on the WordNet Noun and WordNet Verb datasets. However, it falls short of the current SOTA method on the more specialized MAG-CS and MAG-PSY taxonomies, even with fine-tuning. Interestingly,
Scaled MRR Across Tasks for Taxonomy Enrichment. Here, “n/a” Stands for “not Applicable”, as TaxoLLaMA has Already Seen WordNet Data and its Performance Cannot be Considered as Zero-Shot. Zero-shot is Trained on the WordNet Data only, Without Fine-Tuning on the Target Dataset.
Scaled MRR Across Tasks for Taxonomy Enrichment. Here, “n/a” Stands for “not Applicable”, as
In this section, we show the application of TaxoLLaMA to the lexical entailment task. For our evaluation, we rely on the Hyperlex benchmark (Vulić et al., 2017) alongside the ANT entailment subset (Guillou & de Vroe, 2023), which serves as a detailed refinement of the Levy/Holt dataset (Holt, 2019).
Ant
This dataset features sentence pairs that differ by a single argument within their syntactic structure (e.g., “The audience
The ranking method here is enriched with confidence scores. The confidence score is the ratio between forward and reversed perplexity. The forward perplexity is the regular one, and the reversed is obtained by first reversing hypernym and hyponym roles.
Based on these confidence scores entailment relations are assesed as the ratio of the hypernym to hyponym ranking scores, with normalization by the L2 norm to estimate the probability of entailment. For example, we compute the perplexity score of “move” as a hypernym of “walk” (
Additionally, we developed TaxoLLaMA
HyperLex
This dataset is designed to assess entailment for both verbs and nouns, using a scale from 0 to 10. A score of 0 signifies no entailment, whereas a score of 10 represents strong entailment. The objective is to maximize correlation with the gold-standard scores. For this dataset, we apply the ranking approach directly, without any additional processing and usage of confidence scores.
Earlier approaches typically generate embeddings and then train a basic SVM on the Hyperlex training set. Fine-tuned models, such as RoBERTa, require significant computational resources and are specifically adapted to the Hyperlex dataset. In contrast, our zero-shot model utilizes perplexities directly as predictions, eliminating the need for any additional training. As a result, direct comparisons may not fully account for the distinct methodologies and resource demands, highlighting the importance of evaluating each method within its own specific context.
Results
Results on the ANT Dataset
The results presented in Table 14(a) compare our models with previous SOTA performances on the ANT dataset. A significant observation is the clear disparity in performance between
Results of Experiment for the Lexical Entailment Tasks on the ANT (a) and HyperLex (b) Datasets.
Results of Experiment for the Lexical Entailment Tasks on the ANT (a) and HyperLex (b) Datasets.
Table 14(a) reveals that
Results on the HyperLex Dataset
Table 14(b) highlights the effectiveness of our model, outperforming the previous SOTA in a zero-shot scenario for the “Lexical” subset and securing second place for the “Random” subset. Interestingly, while most models tend to perform better on the random subset, our approach deviates from this trend, indicating that the larger training size of the random subset may provide greater advantages to other methods. Despite the simplicity of our zero-shot method, it still delivers impressive results. Future research could investigate incorporating this score as a meta-feature in task-specific models, or refining our entire model for better alignment.
In this section, we examine the errors produced by the
Hypernym Discovery and Taxonomy Enrichment
Since we use the same generative approach for both hypernym discovery and taxonomy enrichment, we conduct a combined error analysis. This process is divided into four steps: (i) conducting a manual review to pinpoint the most frequent errors; (ii) performing an automatic error analysis using ChatGPT; (iii) comparing and consolidating the common errors identified; and (iv) classifying these errors with the help of ChatGPT.
We begin by selecting approximately 200 random samples from both the hypernym discovery and taxonomy enrichment datasets and provide explanations for the model’s failure to generate the correct hypernym. Through this process, we identify four categories of errors: (i) predicted hypernyms are excessively broad; (ii) Incorrect or irrelevant definition; (iii) the model fails to produce relevant candidates within the same semantic domain; (iv) miscellaneous cases that do not fit into the other categories.
We further provide in Appendix the prompt to request that ChatGPT generate potential error types 11, the resulting output 12, and the Table 17 summarizing the error types identified across multiple runs. Afterward, we combine the error types identified both manually and automatically into the following categories:
To classify incorrectly predicted instances, we used the prompt provided in Appendix A, as shown in Example 13. The outcomes for each task and dataset are detailed in Table 18 and Figure 6(a) in Appendix B, which illustrate the average error distribution. Additionally, Table 19 includes an example corresponding to each type of error. The most prevalent problem, affecting 75% of the cases, is the prediction of overly broad concepts. This issue is likely due to the model’s adaptation to domain-specific datasets that are more expansive than WordNet, such as those in the “Music” and “Medical” domains.

(a) Average percentage of error types across hypernym discovery and taxonomy enrichment datasets. (b) Automatic evaluation of the MAG datasets using the ChatGPT model. The label
In the case of Italian and Spanish, substantial inaccuracies were primarily due to the grammatical complexities inherent in these languages, compounded by dataset limitations, linguistic nuances, and insufficient pre-training data. Likewise, the MAG datasets encountered challenges related to specificity and ambiguity, which resulted in
A manual review of the MAG taxonomies reveals misclassifications, such as “olfactory toxicity in fish” being incorrectly categorized as a hyponym of “neuroscience.” To further evaluate the accuracy of the predicted hypernyms, we leveraged ChatGPT, drawing inspiration from recent research (Rafailov et al., 2023). We provided ChatGPT with the input queries, predicted nodes, and ground truth nodes, asking for a preference. As shown in Figure 6(b), ChatGPT often preferred neither of the options, with ground truth hypernyms being favored only slightly more often than the predicted ones. An example of the input query used is detailed in Appendix A, Example 14.
Our evaluation of the overlap between the MAG datasets and WordNet data reveals that they have little in common. Specifically, only 5% of the nodes in the MAG graph are also found in the WordNet graph. The overlap is even less in terms of edges, with only 2% in the CS domain and 4% in the PSY domain matching WordNet connections. Additionally, 92% of the identified connections lack any corresponding path within the WordNet structure. Among the connections that do overlap, we discovered that 28% in CS and 10% in PSY mistakenly identify nodes as their own hypernyms. These disparities highlight why
In our final analysis, we visualized the embeddings, which highlighted a clear divergence between the predicted outcomes and the actual ground truth within the MAG subsets—a divergence that was not observed in the WordNet data. Detailed findings from this visualization are discussed in Appendix C.
Our detailed assessment of the predicted graphs across different domain datasets, based on the data in Table 15, reveals consistent trends. In most cases, the gold standard graphs exhibit a higher number of edges, except for the environment domain. Interestingly, the model tends to miss entire clusters of nodes rather than isolated ones: around 30% of the nodes in the
Statistics of Original Graph and the Constructed Graph with Highest F1 Score. The Lower Part of the Table Corresponds to Constructed Graph Statistics.
Statistics of Original Graph and the Constructed Graph with Highest F1 Score. The Lower Part of the Table Corresponds to Constructed Graph Statistics.
Statistics on Definitions Generated with ChatGPT for Different Tasks.
12 Error Types Made by
Errors Type Distribution Across Subset Datasets for Hypernym Prediction: Hypernym Discovery and Taxonomy Enrichment.
Examples for each Error Type Made by
Although some paths generated by the model are highly accurate, its overall performance is inconsistent—either perfectly on target or completely off course. Frequently, paths with high perplexity are mistakenly discarded, suggesting the model struggles particularly with concepts that are neither highly specific nor overly broad but fall somewhere in the middle of the taxonomy.
This issue is exacerbated by the use of perplexity as a relative metric, where some edges are excluded because they exceed the defined perplexity threshold. However, adjusting the threshold to be more lenient can lead to the creation of incorrect edges. This challenge highlights the need to explore alternative methods, such as employing LLMs as embedding tools, to improve the model’s performance.
Our review of the ANT dataset revealed that it comprises nearly 3,000 test samples but only 589 distinct verbs. This suggests that errors associated with a single verb could potentially be repeated multiple times throughout the dataset. However, when we looked at the overlap with WordNet, we found that only 7 of these verb forms matched.
After lemmatization, the number of unique verbs increases to 338, yet around 42% still cannot be found in WordNet. Moreover, for the verbs that do exist in WordNet, no corresponding paths were identified, which may have negatively impacted the model’s performance in this task.
Hyperlex offers more favorable statistics, with nearly 50% of the words being unique and 88% included in WordNet. However, only 27% of the word pairs are represented in the taxonomy, and 99% of these pairs are missing a connecting path.
Perplexity-related errors tend to have high values when dealing with polysemous pairs, such as “spade is a type of card,” and low values for synonyms or paraphrases, which indicates semantic closeness without implying a hypernymy relationship. This suggests that the model struggles with lexical diversity and ambiguity, highlighting the necessity of robust disambiguation capabilities in entailment tasks. Additional details are provided in Appendix D.
Conclusion
In this article, we comprehensively explored the use of LLMs for learning taxonomic relations, evaluating their effectiveness, and applying them to various downstream tasks. To facilitate taxonomy learning, we developed a dataset collection method using WordNet. Our fine-tuned models achieved state-of-the-art performance across several lexical semantic tasks, including taxonomy construction, hypernym discovery, taxonomy enrichment, and lexical entailment. Specifically, our models secured the top performance in 11 out of 16 tasks and ranked second in 4 others, demonstrating that LLMs are well-suited for solving taxonomy-related challenges.
Additionally, we conducted an extensive ablation study on our model, focusing on the learning of hyponymy by categorizing it into subtypes and levels of difficulty. Our findings shown that hyponymy is generally more challenging to learn than hypernymy, particularly for concepts located in the middle of the graph. Furthermore, our results suggest that some taxonomy relations are easier to learn for specialized terminology rather than for common concepts. The study also highlighted the potential of LLMs to refine existing taxonomies by utilizing multiple learned taxonomic relations to assess the accuracy of edges, which significantly improved overall performance. For taxonomy construction task, our experiments showed that hypernymy plays a crucial role, and that basic, straightforward brute-force methods currently yield the best results.
Lastly, we carried out an in-depth analysis of model errors, revealing inconsistencies between WordNet and other taxonomies, and underscoring the need to revisit and possibly revise MAG taxonomies due to numerous misaligned relations.
