Abstract
Keywords
Introduction
In recent years, using graph structures to model and store data has been garnering an increasing amount of attention among practitioners in sectors ranging from academia to government to industry. Indeed by some measures (Jacob, 2021; solidIT, 2022), graph database management systems are the fastest growing database type over the past decade. One of the more obvious manifestations of this rise is the recent growth of large scale public graph databases such as DBpedia (Lehmann et al., 2015), YAGO (Pellissier Tanon et al., 2020), and Wikidata (Vrandeˇić & Krötzsch, 2014). The last of these, for instance, contains just over 100 million entities as of 2024, a near seven fold increase over its count in 2014. The open access to such amounts of graph data has spurred on its use in research related to the Semantic Web, artificial intelligence, and computer science broadly. One field of research which has received considerable attention is that of mathematically modeling the underlying graph structure that emerges when a knowledge base is populated by information. The modeling of this structure—which we refer to as the knowledge graph—proves useful in its application to solve downstream problems such as link prediction, entity clustering, and hierarchy induction. The last two of these provided the impetus for our work.
Entity clustering refers to the task of grouping together entities in a knowledge graph which share similar properties. The measure by which entities are judged to be similar varies and is one of the key considerations when devising an approach to their clustering. Obtaining an entity clustering allows for the discovery of structures which are implicit in the knowledge graph and provides insight into the number and types of categories which exist in the data. The process operates on unlabeled data and is therefore a type of unsupervised learning. As such, it is one of the first and most useful operations applied to a knowledge graph when performing exploratory analysis. Another important unsupervised learning task is that of hierarchy induction. The clearest example of a knowledge graph hierarchy is the class taxonomy which organizes a knowledge graph’s classes through superclass-subclass relations. The task of inducing such a taxonomy merely amounts to learning how the classes are organized hierarchically in the knowledge graph. Similarly, hierarchical clustering of a knowledge graph’s entities extends the clustering task described earlier by imposing a hierarchical organization to the clusters themselves. This allows not only to discover which entities are semantically similar as per the clustering but also how entities relate to one another hierarchically. The motivating factors behind learning knowledge graph hierarchies are various. Perhaps the simplest is that hierarchical structures organize data in a way that is highly intuitive and interpretable to humans. For instance, a hierarchical clustering of knowledge graph entities makes it apparent which entities constitute the broadest concepts in the knowledge graph and how they relate to their descendants. Similarly, a taxonomy of classes reveals implicit relations between entities through its transitive properties. Put plainly, hierarchies induced from knowledge graphs are useful because they are easy to understand. Indeed, the most widely used knowledge bases—such as the aforementioned DBpedia, YAGO, and Wikidata—are organized by hierarchical structures, namely trees and directed acyclic graphs. That is to say, these knowledge graphs are hierarchical at their core. Furthermore, hierarchies are used as components of larger systems to solve common tasks related to knowledge graphs. For instance, hierarchies are used in learning knowledge graph embeddings, both explicitly as an input feature of the model (Xie et al., 2016) and implicitly as a byproduct of the embedding process (Zhang et al., 2020). As embedding is one of the most common problems in the knowledge graph community, learning accurate hierarchies is therefore desirable.
In this regard, our work proposes a generative model for knowledge graphs which induces a clustering of entities and organizes it hierarchically. Our approach belongs to a class of probabilistic graphical models called stochastic blockmodels. In broad strokes, these models operate by decomposing a knowledge graph into a set of probability distributions which are then sampled from to generate the knowledge graph. As a byproduct of this sampling process, a hierarchical clustering of knowledge graph entities is induced. To the best of our knowledge, our approach is the first to apply stochastic blockmodels to knowledge graphs and one of a very few probabilistic graphical models to be used for the purpose of knowledge graph hierarchy induction. To highlight this, we position our work in the context of existing stochastic blockmodels and hierarchy induction methods in Section 2 and provide a gentle introduction for their understanding in Section 3. The formal definition of our model that follows in Section 4 results in a joint distribution which is intractable for exact inference. The parameters for our model must therefore be approximated using collapsed Gibbs sampling. To this end, we provide the full derivation of sampling equations as well as the marginalization of collapsed variables. Additional information to supplement Section 4 may be found in Appendices A through D. Section 6 concludes the article by summarizing its contributions and providing avenues for future work.
Related Work
Our proposed model lies at the intersection of two areas in artificial intelligence which deal with modeling graph data: stochastic blockmodeling and hierarchy induction. Due to the limited overlap of these fields, we provide separate summaries of related works for each.
Stochastic Blockmodels
Stochastic blockmodels are a class of probabilistic graphical models used for generating random graphs with roots in the fields of social science and mathematics. First proposed in 1983 by Holland et al. (1983) for modeling social networks, they have expanded their utility to fields such as biochemistry (Wang et al., 2023), education (Sweet, 2019), and artificial intelligence (Airoldi et al., 2008; Ho et al., 2011; Zhang et al., 2022) among others. In simplest terms, stochastic blockmodels are a type of Bayesian non-parametric graph partition model in that their approach relies on grouping graph entities together via partitions—often referred to as blocks—which share similar structural properties. The generative process by which this partitioning occurs is realized by sampling from a set of probability distributions, giving rise to the stochasticity of stochastic blockmodels. The learning process is then to infer the parameters of these distributions using a Bayesian inference scheme. We provide a technical introduction to stochastic blockmodels in the subsequent section.
The seminal work in this area is the Stochastic Blockmodel (Nowicki & Snijders, 2001) which partitions entities into a fixed number of communities and models the interactions between them as those of their communities. Community relations are modeled via a community relations matrix which assigns a degree to all pairwise interactions between the communities in the model. This idea was extended to the infinite case allowing for an a priori unspecified number of communities via the Chinese restaurant process (Aldous, 1985) in the Infinite Relational Model (Kemp et al., 2006) and its recent hierarchical counterpart the Hierarchical Infinite Relational Model (Saad & Mansinghka, 2021). A variant which relaxes the notion of community membership to allow for entities belonging to multiple communities is the aptly named Mixed Membership Stochastic Blockmodel (Airoldi et al., 2008). By allowing for mixed membership, the model is better able to capture entities whose belonging to a community is not crisp. For instance, the belonging of tomatoes to the community of fruits is not perfect since it can be considered a vegetable in certain contexts such as in cooking. This idea was generalized to the infinite case in the Dynamic Infinite Mixed Membership Stochastic Blockmodel (Ding et al., 2021a) and the hierarchical case in the Multiscale Community Blockmodel (Ho et al., 2011). The latter of these two is closely related to our model and receives more attention later in the article. All of the aforementioned models, however, operate on graphs wherein entities are related to one another through the same type of edge, making them unsuitable for application to knowledge graphs without modification.
The underlying structure of a knowledge graph is that of a multilayer graph wherein entities interact with one another through different types of relations, represented as different types of edges in the graph. These relations may be thought of as separate layers of graphs which share the same entities. Multilayer graphs have also received considerable attention in stochastic blockmodeling. Perhaps the simplest approach is to aggregate the layers in the multilayer graph to a single layer before applying a conventional blockmodeling approach as was done in Berlingerio et al. (2011). A closely related approach is to model each layer in the graph independently as done in Barigozzi et al. (2011) and aggregate the results afterwards. These approaches offer limited success as they do not capture the interlayer dependencies in the multilayer graph and treat each layer as equally valuable in its content during modeling, as pointed out by Paul and Chen (2016). To remedy this, the authors propose a multilayer extension of the aforementioned Stochastic Block Model, aptly named the MultiLayer Stochastic Blockmodel, which modifies the original community relations matrix to a community relations tensor to account for graph multilayeredness. Analogously, a multilayer extension for the Mixed Membership Stochastic Blockmodel was proposed by De Bacco et al. (2017). Finally, the Multilayer Neural Blockmodel (Pietrasik & Reformat, 2021a) was proposed recently as a way to marry neural networks with the probabilistic approach of stochastic blockmodels for modeling multilayer graphs. A comprehensive review of stochastic blockmodels and their applications is provided by Lee and Wilkinson (Lee & Wilkinson, 2019).
Hierarchy Induction Models
In the context of our work, hierarchy induction refers to the discovery of hierarchical structures which are implicit and otherwise unexpressed in a knowledge graph. One concrete way this task is formulated is as that of learning subsumption axioms for classes in a knowledge graph, thereby discovering a hierarchical organization of a knowledge graph’s entities. To this end, Statistical Schema Induction (Völker & Niepert, 2011) uses association rule mining on a knowledge graph’s transaction table to generate subsumption axioms with support and confidence values which are then used as the basis for a greedy algorithm for constructing an ontology. SMICT (Pietrasik & Reformat, 2020) transforms a knowledge graph into a tuple structure wherein entities are annotated by tags and applies a greedy algorithm to learn a taxonomy of classes. This method was extended to perform hierarchical clustering using the Jaccard coefficient (Pietrasik & Reformat, 2021b). More recently, the Non-Parametric Path Based Model (Pietrasik et al., 2024) leveraged a tuple structure in conjunction with a discretized prior rooted in the nested Chinese restaurant process to obtain state-of-the-art results. Despite its strong performance, the optimization scheme that was proposed did not scale well for large-scale knowledge graphs. In general, by transforming a knowledge graph to a tuple structure, various (Heymann & Garcia-Molina, 2006; Schmitz, 2006; Wang et al., 2018) methods in the area of tag hierarchy induction can be leveraged. In a related approach, Chen and Reformat (Chen & Reformat, 2014) derive a similarity matrix from a knowledge graph’s tuple structure which serves as the clustering metric for hierarchical agglomerative clustering. Mohamed (2019) takes a similar approach wherein subjects which are described by the same tag pairs are assigned to the same groups. The similarity between these groups is then calculated to construct a hierarchy. In a method which bears similarity to our own, Zhang et al. (2022) use a non-parametric Bayesian approach to induce a hierarchy of topic communities. Despite a similar statistic framework and inference scheme, the hierarchy induced by this work differs significantly from our own. For instance, relations between communities are not modeled and entities are never explicitly assigned to communities. Along similar lines is GMMSchema (Bonifati et al., 2022) which uses a Gaussian mixture model to generate a schema graph which can be viewed as a hierarchical abstraction of the original knowledge graph.
Another common approach to learning hierarchies from knowledge graphs is via an intermediate representation which lends itself well to existing hierarchy induction methods. To this end, knowledge graph embedding is oftentimes leveraged. This process involves learning a mapping from the discrete knowledge graph to a continuous vector space. The vector representation may then serve as the input to machine and deep learning methods for hierarchy learning. Translation based methods such as the seminal TransE (Bordes et al., 2013) and its extensions (Ji et al., 2015; Lin et al., 2015; Wang et al., 2014) treat relations in a knowledge graph as translations between entities. Additive in nature, they operate on the intuition that embeddings of subjects and objects should be proximal when translated by the relation of a valid triple. These embeddings are learned by minimizing an objective function using an optimization method such as stochastic gradient descent. Bilinear methods (Balažević et al., 2019; Kazemi & Poole, 2018; Nickel et al., 2011; Yang et al., 2015) operate on the binary adjacency tensor of the knowledge graph and factorize entities and relations into vectors and matrices. Triples are then modeled as their resulting product. These methods tend to perform well on measures of performance compared to translation based methods but suffer from higher training complexity. Deep learning models have also been proposed in the context of knowledge graph embeddings. For instance, the Relational Graph Convolution Network (Schlichtkrull et al., 2018) leverages graph convolutions to learn neighborhood information of entities, thereby explicitly incorporating structural information into its modeling. Another widely used deep approach, ConvE (Dettmers et al., 2018), stacks subject and predicate embeddings as a matrix and convolves over them in two dimensions using a neural framework. This approach was extended in ConvKB (Nguyen et al., 2018) which incorporates objects into the convolution process and CapsE (Vu et al., 2019) which uses a similar architecture with capsule layers to yield scores for triples. A recent and comprehensive comparative analysis of various embedding methods may be found in Rossi et al. (2021).
Having obtained an embedded representation of a knowledge graph, hierarchical clustering methods can be applied to induce a hierarchy. For instance RESCAL (Nickel et al., 2011), a bilinear embedding method, was used in conjunction with OPTICS (Ankerst et al., 1999), a density based hierarchical clustering algorithm, in Nickel et al. (2012) to obtain a hierarchical clustering of entities. They found that such an approach achieves more coherent results for concepts which appear at the top of the hierarchy, largely due to data sparsity for descendant concepts. Along similar lines, TIEmb (Ristoski et al., 2017) generates embeddings using RDF2Vec (Ristoski & Paulheim, 2016), an embedding method based on the skip-gram language model (Mikolov et al., 2013), before learning a hierarchical structure based on the proximities of class centroids in the embedded space. The same embedding approach was used in Martel and Zouaq (2021) wherein the embeddings were then clustered using hierarchical agglomerative clustering and assigned types. This type of clustering was used in the field of cybersecurity in Ding et al. (2021b) wherein a bag-of-words representation of a knowledge graph served as input. Compared with the aforementioned subsumption axiom induction methods which rely largely on frequencies and co-occurrences between type classes, embedding based approaches typically embed an entire knowledge graph, thus leveraging a larger and much richer body of information. As such, when compared to subsumption axiom methods, one would expect embedding based methods to be more robust to datasets poorly structured in terms of their type information. To the best of our knowledge, there is yet to be an analysis performed which compares the two approaches.
Preliminaries
Before describing the details of our proposed model, we provide a basic overview of several concepts necessary for its understanding. These concepts are described only insofar as to provide readers with the foundation on which the explanation of our model can be built. We implore readers unfamiliar with knowledge graphs or Bayesian non-parametrics to follow the relevant citations provided in each of the subsequent subsections. To aid in readability we use the following conventions in our notation: lowercase italic Latin letters for iterators and indexers; uppercase italic Latin letters for scalar variables; lowercase boldface Latin letters for vectors; uppercase boldface Latin letters for matrices and tensors; uppercase stylized Latin letters for sets; lowercase Greek letters for hyperparameters; and uppercase Greek letters for functions.
Knowledge Graphs
We refer to Hogan et al. (2021) for their definition of knowledge graphs as

Toy example of a knowledge graph and how it may be modeled by a stochastic blockmodel. Starting from top left quadrant and proceeding clockwise: graphical representation of a knowledge graph with entities
Stochastic blockmodels are a heterogeneous collection of generative models united in their adoption of two characteristics: stochasticity in the generative process and the partitioning of nodes into communities. Describing them by referring to a concrete instance is thus bound to include definitions which do not apply to all members of the class. With this in mind, our introduction to stochastic blockmodels draws on their key characteristics to motivate a toy stochastic blockmodel for generating a knowledge graph. All stochastic blockmodels are defined by a set of probability distributions from which samples are obtained to generate the adjacency tensor of the knowledge graph, Initialize For iteration Toy example of the CRP after sitting patrons
In step 3.2., variables can be initialized by sampling from their prior distributions or specified explicitly if a priori evidence to suggest their true values exists. Step 3.2. depicts the iterative sampling of model variables from their full conditionals. We note that samples obtained early in this process may be drawn from a distribution distant to that of the desired stationary distribution. As such it is necessary to discard the samples obtained before this distribution has been reached. This process is commonly referred to as burning in the Gibbs sampler and the number of discarded iterations as the burn in iterations. Furthermore, as successive samples in this process are autocorrelated, there may be a lag period applied in obtaining results such that samples in during the lag period are also discarded. Thus, if our toy example performs 1000 iterations with a burn in of 900 and a lag of 10, only 9 samples will be obtained as the output of the Gibbs sampler. These 9 samples are then aggregated over to account for the stochasticity in sampling from the posterior and arrive at a final result. The process by which these samples are aggregated are model specific and may be as simple as merely taking the sampled mode. An introduction to Gibbs sampling and related sampling schemes is covered by MacKay (2002) and a thorough discussion of stochastic blockmodels along with their concrete examples is provided by Abbe (2017).

The Chinese restaurant process (CRP) (Aldous, 1985) is a discrete stochastic process that yields a probability distribution in accordance with the preferential attachment principle. In this view, it is both a Dirichlet process (Ferguson, 1973) as it generates a probability distribution and a preferential attachment process (Barabási & Albert, 1999) as the distribution is generated such that probabilities are proportional to past draws. The process is explained through a metaphor of sitting patrons at a Chinese restaurant. Consider this restaurant as containing an infinite number of tables with each table having the capacity to seat an infinite number of patrons. Patrons are seated sequentially, such that the first patron is seated at the first table and every subsequent patron may be seated at an occupied table or the first unoccupied table. The probability of being seated at an occupied table is proportional to the number of patrons already seated at it. This process is illustrated through the toy example in Figure 2 which shows a potential state of the CRP after sitting six patrons along with the sample probabilities of sitting the seventh. Formally, the probability of seating patron Toy example of a nCRP truncated to a depth of

The nested Chinese restaurant process (nCRP) (Blei et al., 2010; Griffiths et al., 2003) is an extension of the CRP formulated to account for hierarchical relations between the generated communities. The realization of this process is an infinitely deep and infinitely branching tree of communities defined by a set of paths,
The stick breaking process (Sethuraman, 1994) is—like the CRP and nCRP—a Dirichlet process that draws its name from a metaphor which describes it. The metaphor starts by breaking a stick of unit length into two fragments at a point in the interval from 0 to 1 as drawn from the Beta distribution. One of the two fragments is preserved and the other fragment is broken again, analogously to the initial stick. This process is repeated an infinite number of times to yield an infinite number of fragments whose combined length is that of the initial stick. These fragments may be viewed as a probability distribution over the infinite sequence of discrete time-steps used to generate them. In other words, the stick breaking process is an infinite extension of the Dirichlet distribution insofar as while the Dirichlet distribution yields a probability distribution over

Toy example of the stick breaking process with values
In describing our proposed model, we will adopt the notations used in the previous section to indicate the connection with the ideas discussed in the preliminaries. To aid in understanding, we first provide a summary of the components of our model before defining the generative process. This is followed by a formalization of the Gibbs sampling procedure and derivation of sampling equations.
Model Description
Like all stochastic blockmodels, our model is defined as a set of probability distributions such that when these distributions are sampled from, they generate the adjacency tensor of the knowledge graph. The choice of these distributions makes assumptions about the underlying structure that governs the graph’s interactions. In devising our model, we assume a hierarchy of entity communities which are captured in the form of a tree. The entities in these communities interact with one another as a function of their membership to a community. In other words, interactions are modeled at the community level and extended downwards to their constituent entities. Unlike most stochastic blockmodels, these community relations are modeled with respect to a predicate in the knowledge graph. In other words, the interactions between entities are predicate dependent such that the degree of interaction between entities, changes depending on the predicate that links them. This allows the model to capture structures extending beyond those implied by mere interaction density. Thus, in order to generate the knowledge graph’s adjacency tensor, we need to know its hierarchical community structure, its entities’ memberships to communities, and the interactions between its communities. The induction of these components, which may be seen as a byproduct of the generative process, is the objective of our model. We note that the communities’ constituent entities do not conform to is-a relationships as would be implied by the hierarchy. This is because the hierarchy is imposed on the communities themselves as opposed to their constituent entities. An example of this is highlighted in Figure 5 where the entity

Toy example depicting a potential hierarchy induced by our model. The table on right side captures the path and level sampled for each entity in the knowledge graph as well as its corresponding community. The left side provides a visualization of this hierarchy.
Entities are assigned to communities through the conjunction of two variables: entity paths and level indicators. Paths define the tree structure over the community hierarchy by sampling from the nCRP as described in the previous section. We thus denote an entity path as
Having sampled entity paths, in order for entities to be assigned to communities, their levels must be obtained. Entity levels are modeled by two variables in our approach: level memberships and level indicators. Level memberships, denoted
Community Relations
Community relations describe the degree to which entities in any two communities are likely to interact with one another through a specific predicate. In other words, they model the probability of observing a value of one in the knowledge graph’s adjacency tensor. These interactions are captured by a
In restricting the values of community relations we take an approach similar to that of the Multiscale Community Blockmodel. Specifically, we borrow the concept of a sibling group which refers to a set of communities that share the same parent in the hierarchy. Only the community relations between communities in the same sibling group are modeled in our approach. Thus, when obtaining the interaction degree of two entities whose communities have the same parent, it is sufficient to merely access the corresponding value in
Community relations are drawn from the Beta distribution parameterized by

The potential community relations induced by our model on the toy example introduced earlier. The hierarchy on the left of the figure has three sibling groups and three predicates:
The generative process of our model refers to the sequential sampling of components which allow for the generation of the target knowledge graph. In other words, the goal is to draw a binary value for each For each entity in the knowledge graph;
For each sender community in the hierarchy; For each receiver community in the hierarchy; For each predicate in the knowledge graph;
For each sender entity in the knowledge graph; For each receiver entity in the knowledge graph;
For each predicate in the knowledge graph;
We note that this process is unsupervised and does not impose any assumptions about the partition of entities to communities or the structure of the hierarchy other than to limit its depth. In fact, the depth is the only constraint imposed on the generative process. The other hyperparameters which must be specified a priori—namely

Plate diagram for our model. Circles indicate random variables whereas diamonds indicate model hyperparameters. Shading indicates observed variables and lack of shading indicates latent variables.
Collapsed Gibbs sampling refers to an extension of Gibbs sampling in which a subset of model variables are marginalized over and therefore do not need to be sampled directly. These variables are said to be collapsed out of the Gibbs sampler. Collapsing of these variables is done analytically via integration and ensures a faster mixing process. This is because the calculation of probability distributions for sampling is generally computationally expensive. Having fewer variables then leads to a faster arrival at the desired stationary distribution. Furthermore, the calculation of probability distributions which have not been collapsed out of the sampling process is generally faster in collapsed Gibbs sampling. This is because in regular Gibbs sampling draws are made from the full conditionals of variables. In collapsed Gibbs sampling, collapsed variables have been integrated out of the process and the remaining variables are conditioned on a lower-dimensional space. Collapsing of variables is usually tractable when they are the conjugate prior of their dependent variables. In our model, community relations and level memberships are both conjugate priors of their dependent variables, namely level indicators and entity relations, respectively. We leverage these conjugacies to marginalize over these two variables in our sampling process. After marginalization, the sampling equations may be derived for the remaining variables.
Marginalizing Community Relations
In order to marginalize out community relations, it is necessary to find a closed form solution which allows for integration during path sampling. To this end, we can leverage the Bernoulli-Beta conjugacy which ensures that given a Bernoulli likelihood and Beta prior, the posterior will also be drawn from the Beta distribution. Employing this conjugacy is possible due to the formulation of our model in which entity relations are drawn from the Bernoulli distribution and community relations assume a Beta prior. We see this explicitly when applying Bayes’ theorem to obtain the posterior as follows:
There are two ways in which to approach marginalizing level memberships in our model. Firstly, Sethuraman (1994) showed that the realization of the stick breaking process follows the Dirichlet distribution. We can leverage this because, in practice, the dimensionality of the level memberships gets bounded to the depth of the tree,
Entity paths are one of the two variables which remain after collapsing the Gibbs sampler and must therefore be sampled directly. To sample a path for entity
Level indicators are drawn from the multinomial distribution conditioned on level memberships. Recall that level memberships were marginalized over in our inference scheme using the multinomial-stick conjugacy and are thus never sampled directly. Nevertheless, we draw them indirectly when computing the prior for level indicators. As with sampling paths, we obtain the distribution proportional to that of level indicators by Bayes’ rule. In what follows, we provide the derivation for the posterior of
Having marginalized out community relations and level memberships as well as derived the sampling equations for entity paths and level indicators, it is possible to perform collapsed Gibbs sampling by iteratively sampling from the remaining variables’ full conditional distributions. This process has a time complexity of
After the Gibbs sampler has been burned in, it is necessary to obtain final samples to induce a hierarchical clustering. We take multiple samples to account for the spread in the posterior distribution. A consequence of this is that samples may differ and need to be aggregated to produce a final result. In this regard, we take the mode over the final samples to arrive at a final hierarchy. The Gibbs sampling procedure is summarized in Algorithm 1.
Evaluation
The evaluation of our model is split into two parts: quantitative and qualitative. The quantitative evaluation provides objective measures of model performance whereas the qualitative evaluation assesses our model through illustrations and subjective analysis of the results. For both types of evaluations, our model first had to be inferred before final samples could be drawn. In this regard, we trained our model on three datasets using 200 burn-in samples using hyperparameters chosen by assessing the model’s log likelihood. After burn-in, ten final samples were obtained by discarding all but the third of successive samples to account for autocorrelation between samples. All models we trained to a depth of
Datasets
Our model was evaluated on three datasets: Synthetic Binary Tree, FB15k-237, and Wikidata. What follows is a brief description of each dataset as well as how it was generated.
Synthetic Binary Tree
The Synthetic Binary Tree (SBT) dataset was synthetically generated to capture our model’s ability to separate communities at the lowest level in the hierarchy. The generative process first constructed a binary tree with a depth of four, assigned entities to communities, and sampled relations for each entity pair. All entities were assigned uniformly to communities on the lowest level of the hierarchy, resulting in 25 entities per leaf community. The sampling probability for each entity pair was determined by the level of their lowest common ancestor. Specifically, sampling probabilities of 0, 0.1, 0.4, and 0.6 were used for levels 0, 1, 2, and 3, respectively. Two entities belonging to the same community have a sampling probability of 1 and are thus always related. The dataset was generated for two predicates which shared the aforementioned sampling probabilities. We note that even though these probabilities are identical, they do not result in a dataset in which entity relations are identical across predicates. The generative process yielded a dataset of 55880 triples, 400 entities, and 2 predicates.
FB15k-237
The FB15k-237 dataset (Toutanova & Chen, 2015) is a subset of the FB15k dataset (Bordes et al., 2013), created by removing redundant and inverse triples. The original FB15k dataset is in turn a subset of a 2013 version of Freebase, from which triples were queried. The FB15k-237 dataset is comprised of 272115 triples, 14541 entities, and 237 predicates thus presenting a computation challenge to our model if modeled in whole. To address this issue, we generated a subset of the data and derived ground truth community labels in an approach inspired by Jain et al. (2021). Specifically, entities were mapped to the WordNet taxonomy (Miller, 1995) through the
Wikidata
The Wikidata dataset was generated by querying Wikidata for triples relating to people and locations. Specifically, artists and footballers corresponding to Wikidata identifiers
Quantitative Evaluation
In our quantitative evaluation, we first analyzed the quality of our learned hierarchical clustering by calculating two clustering quality metrics at each level of the hierarchy: the adjusted Rand index (ARI) (Hubert & Arabie, 1985) and normalized mutual information (NMI) (Shannon, 1948). This type of evaluation jointly assesses the quality of the learned community hierarchy as well as the membership of entities to communities. The ARI is an adjustment to the commonly used Rand index (RI) (Rand, 1971), corrected to account for chance. Specifically, chance is factored in by calculating the expected RI given a random clustering and measuring the obtained clustering’s deviation. Specifically, given an obtained entity clustering
ARI and MNI Scores (Mean
In general, the results indicate that our model is capable of learning a coherent community hierarchy on each of the three datasets tested. Perhaps unsurprisingly, communities at higher levels in the hierarchy are judged as of higher quality as per the two evaluation metrics. This is because the task of clustering entities at higher levels is simpler as the communities are less fine-grained. For instance, on the FB15k-237 dataset, clustering at level 1 requires the distinction between
ARI and MNI Scores (Mean

Plots of average log likelihood of our model across burn in samples on three datasets.

Excerpt of our induced hierarchy on the FB15k-237 dataset. Entities in communities

Excerpt of our induced hierarchy on the Wikidata dataset. Entities in communities

Plots of learned community relations for selected outgoing predicates for the FB15k-237 and Wikidata datasets. Specifically we showcased community
We can also analyze the results of the complete log likelihood as a function of the number of Gibbs samples taken in the inference process. Indeed, while this does not provide us with information about the quality of the obtained results, it does verify the inference process itself. Specifically, we expect to see the log likelihood of our model to rise given more burn-in samples of the Gibbs sampler. This suggests that the likelihood of generating the knowledge graph given the current state of the sampler is increasing and learning is taking place. We can see this rise in Figure 8 which plots the complete log likelihoods of our model across Gibbs samples for the three datasets. We note a dip in log likelihoods on the SBT and Wikidata datasets. This is likely due to the sampler being temporarily stuck in a local minimum before leaving that area in the sample space. The underperformance on the SBT dataset compared to the baseline approaches is due to our method’s underperformance on simple datasets, largely attributed to its stochastic nature. Recall that after burn-in, final samples are drawn and unless the log-likelihood is sufficiently high, these samples will contain suboptimal allocations. These samples are necessary for Gibbs sampling but result in poorer quantitative performance. We hypothesize that drawing more burn-in samples would stabilize the final samples and produce better results. This is supported by the log-likelihoods in Figure 8 but, to establish consistency between the three datasets, was not explored further.
In our qualitative evaluation, we leverage the qualitative attributes possessed by a useful taxonomy as identified by Nickerson et al. (2013). Although these attributes were proposed in the context of taxonomy development, we make use of them in our work as the task of hierarchical clustering shares many of the underlying evaluation principles. Indeed, a taxonomy is implicitly induced using our method, although never explicitly labeled. The proposed taxonomy attributes are as follows: concise, robust, comprehensive, extendable, and explanatory. For each of these attributes, we provide a brief description extended to hierarchical clustering and use it to evaluate our model. A concise clustering is limited in both the number of obtained clusters and the semantic diversity of the entities that constitute each cluster. In our method, this is largely regulated by the hyperparameters Robustness in clustering refers to “maximizing both within-group homogeneity and between-group heterogeneity, [making] groups that are as distinct (nonoverlapping) as possible, with all members within a group being as alike as possible (Bailey, 1994).” We see robustness as an issue when examining the non-footballer professions in Figure 9. Pianists, journalists, politicians, and scientists are not sufficiently semantically homogenous to warrant their inclusion in community A taxonomy is comprehensive if it can classify all entities in the domain. Clustering, including the hierarchical clustering obtained by our method, are induced empirically from the data and thus necessarily classify all entities into communities. As such, our generative model is comprehensive. An extendable clustering is one that allows for the dynamic inclusion of additional information and, in the hierarchical case, the structural change and adaptability to incoming information. In these two respects, our method is highly extendable. Indeed, the Gibbs sampling process itself requires the removal of entities from the hierarchy before they are resampled. Each resampling not only has the potential to change the community assignment of the entity but also to change the structure of the hierarchy itself. Due to the infinite formulation of the nCRP and stick breaking process, there is no prior constraint on the structure. The final qualitative attribute identified in a useful taxonomy is that it is explanatory. In this regard, the taxonomy should provide useful explanations about the objects it is organizing. In the context of hierarchical clustering, these explanations could, for instance, take the form of community labels. Although our model does not assign labels to communities, they can be ascertained by examining the type information of their constituent entities. For instance in Figure 9 the FB15k-237 communities
In this article, we demonstrated the use of stochastic blockmodels for learning hierarchies from knowledge graphs in an approach that is, to the best of our knowledge, the first to marry these two fields in an academic setting. To this end, we proposed a model which leverages the nCRP and stick breaking process to generate a community hierarchy composed of a knowledge graph’s constituent entities. The model is defined non-parametrically and thus makes no assumptions about its structure, allowing it to adapt to the knowledge graph and potentially induce an infinite number of communities on an infinite number of levels. In addition to the model itself, a Markov chain Monte Carlo scheme leveraging collapsed Gibbs sampling was devised for posterior inference of the model’s parameters. The model was evaluated on three datasets using quantitative and qualitative analysis to demonstrate its effectiveness in learning coherent community hierarchies on both synthetic and real-world data. The qualitative analysis made use of attributes commonly employed in taxonomy evaluation, presenting a novel and principled approach for qualitative analysis of hierarchical clusterings. Future work will investigate scalability when applying our model—and indeed stochastic blockmodels more generally—to knowledge graphs. As discussed earlier, the time complexities of the inference schemes for these models usually do not allow for scaling to the types of large knowledge bases that are encountered in real-world applications. The inference scheme proposed in this work is one such instance however several methods exist in the literature more scalable than collapsed Gibbs sampling. An example of such a method is variational inference, a scheme that uses the evidence lower bound to guide the inference process and obtain the posterior distribution. This method is generally faster than Gibbs sampling and, although not asymptotically exact, produces similar results (Salimans et al., 2015). The challenge with this approach is that its optimization equations are more difficult to solve as compared to Markov chain Monte Carlo methods. Despite this, several works have already successfully applied variational inference to probabilistic graphical models. Indeed, the original inference scheme for the Mixed Membership Stochastic Blockmodel leveraged variational inference. Furthermore, Blei and Jordan (2006) provided a variational inference scheme for Dirichlet processes and a variational inference scheme for the nCRP was proposed in Wang and Blei (2009). Departing from the variational approach, Chen et al. (2017) propose an evolution of the Gibbs sampling algorithm for the nCRP with partially collapsed Gibbs sampling. This approach resulted in a time increase of over two magnitudes over the classic Gibbs sampling approach. Another line of approach to increase the scalability of stochastic blockmodels is to devise a model which does not require sampling all
