Sage Journals: Discover world-class research

Abstract

In this article, we investigate the use of probabilistic graphical models, specifically stochastic blockmodels, for the purpose of hierarchical entity clustering on knowledge graphs. These models, seldom used in the Semantic Web community, decompose a graph into a set of probability distributions. The parameters of these distributions are then inferred allowing for their subsequent sampling to generate a random graph. In a non-parametric setting, this allows for the induction of hierarchical clusterings without prior constraints on the hierarchy’s structure. Specifically, this is achieved by the integration of the nested Chinese restaurant process and the stick breaking process into the generative model. In this regard, we propose a model leveraging such integration and derive a collapsed Gibbs sampling scheme for its inference. To aid in understanding, we describe the steps in this derivation and provide an implementation for the sampler. We evaluate our model on synthetic and real-world datasets and quantitatively compare against benchmark models. We further evaluate our results qualitatively and find that our model is capable of inducing coherent cluster hierarchies in small scale settings. The work presented in this article provides the first step for the further application of stochastic blockmodels for knowledge graphs on a larger scale. We conclude the article with potential avenues for future work on more scalable inference schemes.

Keywords

knowledge graphs stochastic blockmodels hierarchical clustering probabilistic graphical models taxonomy induction

1. Introduction

In recent years, using graph structures to model and store data has been garnering an increasing amount of attention among practitioners in sectors ranging from academia to government to industry. Indeed by some measures (Jacob, 2021; solidIT, 2022), graph database management systems are the fastest growing database type over the past decade. One of the more obvious manifestations of this rise is the recent growth of large scale public graph databases such as DBpedia (Lehmann et al., 2015), YAGO (Pellissier Tanon et al., 2020), and Wikidata (Vrandeˇić & Krötzsch, 2014). The last of these, for instance, contains just over 100 million entities as of 2024, a near seven fold increase over its count in 2014. The open access to such amounts of graph data has spurred on its use in research related to the Semantic Web, artificial intelligence, and computer science broadly. One field of research which has received considerable attention is that of mathematically modeling the underlying graph structure that emerges when a knowledge base is populated by information. The modeling of this structure—which we refer to as the knowledge graph—proves useful in its application to solve downstream problems such as link prediction, entity clustering, and hierarchy induction. The last two of these provided the impetus for our work.

Entity clustering refers to the task of grouping together entities in a knowledge graph which share similar properties. The measure by which entities are judged to be similar varies and is one of the key considerations when devising an approach to their clustering. Obtaining an entity clustering allows for the discovery of structures which are implicit in the knowledge graph and provides insight into the number and types of categories which exist in the data. The process operates on unlabeled data and is therefore a type of unsupervised learning. As such, it is one of the first and most useful operations applied to a knowledge graph when performing exploratory analysis. Another important unsupervised learning task is that of hierarchy induction. The clearest example of a knowledge graph hierarchy is the class taxonomy which organizes a knowledge graph’s classes through superclass-subclass relations. The task of inducing such a taxonomy merely amounts to learning how the classes are organized hierarchically in the knowledge graph. Similarly, hierarchical clustering of a knowledge graph’s entities extends the clustering task described earlier by imposing a hierarchical organization to the clusters themselves. This allows not only to discover which entities are semantically similar as per the clustering but also how entities relate to one another hierarchically. The motivating factors behind learning knowledge graph hierarchies are various. Perhaps the simplest is that hierarchical structures organize data in a way that is highly intuitive and interpretable to humans. For instance, a hierarchical clustering of knowledge graph entities makes it apparent which entities constitute the broadest concepts in the knowledge graph and how they relate to their descendants. Similarly, a taxonomy of classes reveals implicit relations between entities through its transitive properties. Put plainly, hierarchies induced from knowledge graphs are useful because they are easy to understand. Indeed, the most widely used knowledge bases—such as the aforementioned DBpedia, YAGO, and Wikidata—are organized by hierarchical structures, namely trees and directed acyclic graphs. That is to say, these knowledge graphs are hierarchical at their core. Furthermore, hierarchies are used as components of larger systems to solve common tasks related to knowledge graphs. For instance, hierarchies are used in learning knowledge graph embeddings, both explicitly as an input feature of the model (Xie et al., 2016) and implicitly as a byproduct of the embedding process (Zhang et al., 2020). As embedding is one of the most common problems in the knowledge graph community, learning accurate hierarchies is therefore desirable.

In this regard, our work proposes a generative model for knowledge graphs which induces a clustering of entities and organizes it hierarchically. Our approach belongs to a class of probabilistic graphical models called stochastic blockmodels. In broad strokes, these models operate by decomposing a knowledge graph into a set of probability distributions which are then sampled from to generate the knowledge graph. As a byproduct of this sampling process, a hierarchical clustering of knowledge graph entities is induced. To the best of our knowledge, our approach is the first to apply stochastic blockmodels to knowledge graphs and one of a very few probabilistic graphical models to be used for the purpose of knowledge graph hierarchy induction. To highlight this, we position our work in the context of existing stochastic blockmodels and hierarchy induction methods in Section 2 and provide a gentle introduction for their understanding in Section 3. The formal definition of our model that follows in Section 4 results in a joint distribution which is intractable for exact inference. The parameters for our model must therefore be approximated using collapsed Gibbs sampling. To this end, we provide the full derivation of sampling equations as well as the marginalization of collapsed variables. Additional information to supplement Section 4 may be found in Appendices A through D. Section 6 concludes the article by summarizing its contributions and providing avenues for future work.

2. Related Work

Our proposed model lies at the intersection of two areas in artificial intelligence which deal with modeling graph data: stochastic blockmodeling and hierarchy induction. Due to the limited overlap of these fields, we provide separate summaries of related works for each.

2.1. Stochastic Blockmodels

Stochastic blockmodels are a class of probabilistic graphical models used for generating random graphs with roots in the fields of social science and mathematics. First proposed in 1983 by Holland et al. (1983) for modeling social networks, they have expanded their utility to fields such as biochemistry (Wang et al., 2023), education (Sweet, 2019), and artificial intelligence (Airoldi et al., 2008; Ho et al., 2011; Zhang et al., 2022) among others. In simplest terms, stochastic blockmodels are a type of Bayesian non-parametric graph partition model in that their approach relies on grouping graph entities together via partitions—often referred to as blocks—which share similar structural properties. The generative process by which this partitioning occurs is realized by sampling from a set of probability distributions, giving rise to the stochasticity of stochastic blockmodels. The learning process is then to infer the parameters of these distributions using a Bayesian inference scheme. We provide a technical introduction to stochastic blockmodels in the subsequent section.

The seminal work in this area is the Stochastic Blockmodel (Nowicki & Snijders, 2001) which partitions entities into a fixed number of communities and models the interactions between them as those of their communities. Community relations are modeled via a community relations matrix which assigns a degree to all pairwise interactions between the communities in the model. This idea was extended to the infinite case allowing for an a priori unspecified number of communities via the Chinese restaurant process (Aldous, 1985) in the Infinite Relational Model (Kemp et al., 2006) and its recent hierarchical counterpart the Hierarchical Infinite Relational Model (Saad & Mansinghka, 2021). A variant which relaxes the notion of community membership to allow for entities belonging to multiple communities is the aptly named Mixed Membership Stochastic Blockmodel (Airoldi et al., 2008). By allowing for mixed membership, the model is better able to capture entities whose belonging to a community is not crisp. For instance, the belonging of tomatoes to the community of fruits is not perfect since it can be considered a vegetable in certain contexts such as in cooking. This idea was generalized to the infinite case in the Dynamic Infinite Mixed Membership Stochastic Blockmodel (Ding et al., 2021a) and the hierarchical case in the Multiscale Community Blockmodel (Ho et al., 2011). The latter of these two is closely related to our model and receives more attention later in the article. All of the aforementioned models, however, operate on graphs wherein entities are related to one another through the same type of edge, making them unsuitable for application to knowledge graphs without modification.

The underlying structure of a knowledge graph is that of a multilayer graph wherein entities interact with one another through different types of relations, represented as different types of edges in the graph. These relations may be thought of as separate layers of graphs which share the same entities. Multilayer graphs have also received considerable attention in stochastic blockmodeling. Perhaps the simplest approach is to aggregate the layers in the multilayer graph to a single layer before applying a conventional blockmodeling approach as was done in Berlingerio et al. (2011). A closely related approach is to model each layer in the graph independently as done in Barigozzi et al. (2011) and aggregate the results afterwards. These approaches offer limited success as they do not capture the interlayer dependencies in the multilayer graph and treat each layer as equally valuable in its content during modeling, as pointed out by Paul and Chen (2016). To remedy this, the authors propose a multilayer extension of the aforementioned Stochastic Block Model, aptly named the MultiLayer Stochastic Blockmodel, which modifies the original community relations matrix to a community relations tensor to account for graph multilayeredness. Analogously, a multilayer extension for the Mixed Membership Stochastic Blockmodel was proposed by De Bacco et al. (2017). Finally, the Multilayer Neural Blockmodel (Pietrasik & Reformat, 2021a) was proposed recently as a way to marry neural networks with the probabilistic approach of stochastic blockmodels for modeling multilayer graphs. A comprehensive review of stochastic blockmodels and their applications is provided by Lee and Wilkinson (Lee & Wilkinson, 2019).

2.2. Hierarchy Induction Models

In the context of our work, hierarchy induction refers to the discovery of hierarchical structures which are implicit and otherwise unexpressed in a knowledge graph. One concrete way this task is formulated is as that of learning subsumption axioms for classes in a knowledge graph, thereby discovering a hierarchical organization of a knowledge graph’s entities. To this end, Statistical Schema Induction (Völker & Niepert, 2011) uses association rule mining on a knowledge graph’s transaction table to generate subsumption axioms with support and confidence values which are then used as the basis for a greedy algorithm for constructing an ontology. SMICT (Pietrasik & Reformat, 2020) transforms a knowledge graph into a tuple structure wherein entities are annotated by tags and applies a greedy algorithm to learn a taxonomy of classes. This method was extended to perform hierarchical clustering using the Jaccard coefficient (Pietrasik & Reformat, 2021b). More recently, the Non-Parametric Path Based Model (Pietrasik et al., 2024) leveraged a tuple structure in conjunction with a discretized prior rooted in the nested Chinese restaurant process to obtain state-of-the-art results. Despite its strong performance, the optimization scheme that was proposed did not scale well for large-scale knowledge graphs. In general, by transforming a knowledge graph to a tuple structure, various (Heymann & Garcia-Molina, 2006; Schmitz, 2006; Wang et al., 2018) methods in the area of tag hierarchy induction can be leveraged. In a related approach, Chen and Reformat (Chen & Reformat, 2014) derive a similarity matrix from a knowledge graph’s tuple structure which serves as the clustering metric for hierarchical agglomerative clustering. Mohamed (2019) takes a similar approach wherein subjects which are described by the same tag pairs are assigned to the same groups. The similarity between these groups is then calculated to construct a hierarchy. In a method which bears similarity to our own, Zhang et al. (2022) use a non-parametric Bayesian approach to induce a hierarchy of topic communities. Despite a similar statistic framework and inference scheme, the hierarchy induced by this work differs significantly from our own. For instance, relations between communities are not modeled and entities are never explicitly assigned to communities. Along similar lines is GMMSchema (Bonifati et al., 2022) which uses a Gaussian mixture model to generate a schema graph which can be viewed as a hierarchical abstraction of the original knowledge graph.

Another common approach to learning hierarchies from knowledge graphs is via an intermediate representation which lends itself well to existing hierarchy induction methods. To this end, knowledge graph embedding is oftentimes leveraged. This process involves learning a mapping from the discrete knowledge graph to a continuous vector space. The vector representation may then serve as the input to machine and deep learning methods for hierarchy learning. Translation based methods such as the seminal TransE (Bordes et al., 2013) and its extensions (Ji et al., 2015; Lin et al., 2015; Wang et al., 2014) treat relations in a knowledge graph as translations between entities. Additive in nature, they operate on the intuition that embeddings of subjects and objects should be proximal when translated by the relation of a valid triple. These embeddings are learned by minimizing an objective function using an optimization method such as stochastic gradient descent. Bilinear methods (Balažević et al., 2019; Kazemi & Poole, 2018; Nickel et al., 2011; Yang et al., 2015) operate on the binary adjacency tensor of the knowledge graph and factorize entities and relations into vectors and matrices. Triples are then modeled as their resulting product. These methods tend to perform well on measures of performance compared to translation based methods but suffer from higher training complexity. Deep learning models have also been proposed in the context of knowledge graph embeddings. For instance, the Relational Graph Convolution Network (Schlichtkrull et al., 2018) leverages graph convolutions to learn neighborhood information of entities, thereby explicitly incorporating structural information into its modeling. Another widely used deep approach, ConvE (Dettmers et al., 2018), stacks subject and predicate embeddings as a matrix and convolves over them in two dimensions using a neural framework. This approach was extended in ConvKB (Nguyen et al., 2018) which incorporates objects into the convolution process and CapsE (Vu et al., 2019) which uses a similar architecture with capsule layers to yield scores for triples. A recent and comprehensive comparative analysis of various embedding methods may be found in Rossi et al. (2021).

Having obtained an embedded representation of a knowledge graph, hierarchical clustering methods can be applied to induce a hierarchy. For instance RESCAL (Nickel et al., 2011), a bilinear embedding method, was used in conjunction with OPTICS (Ankerst et al., 1999), a density based hierarchical clustering algorithm, in Nickel et al. (2012) to obtain a hierarchical clustering of entities. They found that such an approach achieves more coherent results for concepts which appear at the top of the hierarchy, largely due to data sparsity for descendant concepts. Along similar lines, TIEmb (Ristoski et al., 2017) generates embeddings using RDF2Vec (Ristoski & Paulheim, 2016), an embedding method based on the skip-gram language model (Mikolov et al., 2013), before learning a hierarchical structure based on the proximities of class centroids in the embedded space. The same embedding approach was used in Martel and Zouaq (2021) wherein the embeddings were then clustered using hierarchical agglomerative clustering and assigned types. This type of clustering was used in the field of cybersecurity in Ding et al. (2021b) wherein a bag-of-words representation of a knowledge graph served as input. Compared with the aforementioned subsumption axiom induction methods which rely largely on frequencies and co-occurrences between type classes, embedding based approaches typically embed an entire knowledge graph, thus leveraging a larger and much richer body of information. As such, when compared to subsumption axiom methods, one would expect embedding based methods to be more robust to datasets poorly structured in terms of their type information. To the best of our knowledge, there is yet to be an analysis performed which compares the two approaches.

3. Preliminaries

Before describing the details of our proposed model, we provide a basic overview of several concepts necessary for its understanding. These concepts are described only insofar as to provide readers with the foundation on which the explanation of our model can be built. We implore readers unfamiliar with knowledge graphs or Bayesian non-parametrics to follow the relevant citations provided in each of the subsequent subsections. To aid in readability we use the following conventions in our notation: lowercase italic Latin letters for iterators and indexers; uppercase italic Latin letters for scalar variables; lowercase boldface Latin letters for vectors; uppercase boldface Latin letters for matrices and tensors; uppercase stylized Latin letters for sets; lowercase Greek letters for hyperparameters; and uppercase Greek letters for functions.

3.1. Knowledge Graphs

We refer to Hogan et al. (2021) for their definition of knowledge graphs as “a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent potentially different relations between these entities.” Concretely, information is stored as a collection of triples wherein each triple relates a subject entity, $e_{i}$ , to an object entity, $e_{j}$ , via a predicate, $p_{r}$ . Formally, we define a knowledge graph, $G$ , as a set such that $G = {⟨ e_{i}, r_{p}, e_{j} ⟩ \in E \times R \times E}$ where $⟨ e_{i}, r_{p}, e_{j} ⟩$ is a triple, $E$ is the set of entities in $G$ , and $R$ is the set of predicates in $G$ . When put together, the triples form a directed graph with nodes corresponding to entities and edges corresponding to predicates. Each triple in a knowledge graph describes one piece of information or fact. For instance, $⟨$ Henry Ford, occupation, Engineer $⟩$ relates the subject Henry Ford to the object Engineer through the predicate occupation and states, in plain English, that Henry Ford’s occupation is being an engineer. Notice that this definition of knowledge graphs allows for cycles and entity self-relations to exist. This is made clear when analyzing a knowledge graph’s binary adjacency tensor which may be asymmetric and containing non-zero values in its main diagonal. Knowledge graphs are oftentimes represented in their tensor form as it allows for easier numerical operation and thus opens the door to various tools and methods in artificial intelligence. A binary adjacency tensor is obtained from a knowledge graph by ordering its entities and predicates along an $| E | \times | E | \times | R |$ tensor, $G$ , that takes on values $g_{i j r} = 1$ if there exists a triple in $G$ from entity $e_{i}$ to entity $e_{j}$ on predicate $r_{p}$ and $g_{i j r} = 0$ otherwise. This representation is used in stochastic blockmodeling and is the one we will use in this article henceforth. The left half of Figure 1 depicts a simple knowledge graph along with its adjacency tensor representation. A comprehensive introduction to knowledge graphs is provided by Gutierrez and Sequeda (2021).

Figure 1.

Toy example of a knowledge graph and how it may be modeled by a stochastic blockmodel. Starting from top left quadrant and proceeding clockwise: graphical representation of a knowledge graph with entities $e_{0}$ through $e_{7}$ and predicates $r_{0}$ through $r_{2}$ ; graphical representation of aforementioned knowledge graph as modeled by a stochastic blockmodel with communities $t_{0}$ through $t_{2}$ ; potential community relations tensor induced by stochastic blockmodel; adjacency tensor of knowledge graph above it.

3.2. Stochastic Blockmodels

Stochastic blockmodels are a heterogeneous collection of generative models united in their adoption of two characteristics: stochasticity in the generative process and the partitioning of nodes into communities. Describing them by referring to a concrete instance is thus bound to include definitions which do not apply to all members of the class. With this in mind, our introduction to stochastic blockmodels draws on their key characteristics to motivate a toy stochastic blockmodel for generating a knowledge graph. All stochastic blockmodels are defined by a set of probability distributions from which samples are obtained to generate the adjacency tensor of the knowledge graph, $G$ , indexed by $g_{i j r}$ where the subscript $_{i j r}$ indicates the value associated with the triple from $e_{i}$ to $e_{j}$ on predicate $p_{r}$ . In order to perform this generation, the knowledge graph’s entities must first be assigned to one of the model’s communities. This is done by sampling the model’s variables responsible for this assignment. Let $A$ be a tensor representing these variables with a corresponding hyperparameter $α$ responsible for parameterizing their prior distribution. In stochastic blockmodels, the probability of an interaction between two entities is modeled as the degree of interaction between their respective communities. It is necessary, therefore, to capture these community relations by sampling their corresponding model variables. Let $B$ be a tensor representing this subset of variables with a prior hyperparameter $β$ . The joint distribution of this model is obtained by applying the chain rule of probability as follows: $P (G, A, B | α, β) = \prod_{g_{i j r} \in G} P (g_{i j r} | A_{i j r}, B_{i j r}, α, β) P (B_{i j r} | A_{i j r}, β) P (A_{i j r} | α)$ (1)Where $A_{i j r}$ and $B_{i j r}$ indicate the latent variables in $A$ and $B$ associated with sampling $g_{i j r}$ . Notice that the probability of drawing a value in the knowledge graph’s adjacency tensor, $P (g_{i j r} | A_{i j r}, B_{i j r}, α, β)$ , is conditioned on $A$ and $B$ . Thus, in order to generate the knowledge graph, it is necessary to first infer the values of $A$ and $B$ . This inference process is analogous to the training phase of other machine and deep learning models. In most cases, the solution is intractable for exact inference and must be approximated using an inference scheme. Perhaps the simplest inference scheme used in stochastic blockmodeling is Gibbs sampling, a Markov chain Monte Carlo method which can be used for sampling from a joint distribution. Gibbs sampling approximates this distribution by iteratively sampling from its variables’ full conditional distributions. This iterative sampling creates a Markov chain of samples wherein its stationary distribution approximates the joint distribution of the model. Continuing the example above, to infer the blockmodel’s parameters for $g_{i j r}$ , namely $A_{i j r}$ and $B_{i j r}$ , inference is performed on their conditional distributions $P (A_{i j r} | G, B, α)$ and $P (B_{i j r} | G, A, β)$ , respectively. We apply Bayes’ theorem to obtain these distributions. Recall that by this theorem the posterior distribution is proportional to the product of the likelihood and the prior. We can therefore express the conditionals of $A_{i j r}$ and $B_{i j r}$ as follows: $\begin{aligned} P (A_{i j r} | G, B, α) \propto P (G | A_{i j r}, B) P (A | α) \end{aligned}$ (2) $\begin{aligned} P (B_{i j r} | G, A, β) \propto P (G | B_{i j r}, A) P (B | β) \end{aligned}$ (3)Where $P (G | A, B)$ and $P (G | B, A)$ are the likelihoods, and $P (A_{i j r} | α)$ and $P (B_{i j r} | β)$ are the priors of $A_{i j r}$ and $B_{i j r}$ , respectively. The likelihood may be understood as the chance of observing the data given the model parameters. In Equations (2) and (3), it is the likelihood of drawing $G$ from our model with parameters $A$ and $B$ . The prior represents the assumptions about a variable before any data is taken into account. They are oftentimes chosen in order to leverage a conjugacy with their dependent variables. Priors are parameterized by hyperparameters which must be specified a priori. The choice of these hyperparameters influences the density of the prior and can thus change the output of the model. Gibbs sampling draws from the variables’ full conditional distributions iteratively for a predetermined number of iterations, $i t e r s$ . To highlight this, the superscript $i t e r$ is added to denote the value of a variable at the corresponding iteration. The Gibbs sampling process may be summarized as follows:

Initialize $A_{i j r}^{0}$ and $B_{i j r}^{0}$ for each $g_{i j r} \in G$

For iteration $i t e r$ in $1, 2, \dots, i t e r s$

Sample $A_{i j r}^{i t e r} \sim P (A_{i j r}^{i t e r} | G, B^{i t e r - 1}, α)$ for each $g_{i j r} \in G$ using Equation (2)

Sample $B_{i j r}^{i t e r} \sim P (B_{i j r}^{i t e r} | G, A^{i t e r - 1}, α)$ for each $g_{i j r} \in G$ using Equation (3)

In step 3.2., variables can be initialized by sampling from their prior distributions or specified explicitly if a priori evidence to suggest their true values exists. Step 3.2. depicts the iterative sampling of model variables from their full conditionals. We note that samples obtained early in this process may be drawn from a distribution distant to that of the desired stationary distribution. As such it is necessary to discard the samples obtained before this distribution has been reached. This process is commonly referred to as burning in the Gibbs sampler and the number of discarded iterations as the burn in iterations. Furthermore, as successive samples in this process are autocorrelated, there may be a lag period applied in obtaining results such that samples in during the lag period are also discarded. Thus, if our toy example performs 1000 iterations with a burn in of 900 and a lag of 10, only 9 samples will be obtained as the output of the Gibbs sampler. These 9 samples are then aggregated over to account for the stochasticity in sampling from the posterior and arrive at a final result. The process by which these samples are aggregated are model specific and may be as simple as merely taking the sampled mode. An introduction to Gibbs sampling and related sampling schemes is covered by MacKay (2002) and a thorough discussion of stochastic blockmodels along with their concrete examples is provided by Abbe (2017).

Figure 2.

Toy example of the CRP after sitting patrons $e_{0}$ through $e_{5}$ . Tables $t_{0}$ through $t_{2}$ are occupied and table $t_{3}$ is the next unoccupied table. We illustrate Equation (4) by calculating the probabilities of sitting patron $e_{6}$ at tables $t_{0}$ and $t_{3}$ : $P (e_{6} = t_{0}) = \frac{3}{6 + γ}$ and $P (e_{6} = t_{3}) = \frac{γ}{6 + γ}$ .

3.3. The Chinese Restaurant Process

The Chinese restaurant process (CRP) (Aldous, 1985) is a discrete stochastic process that yields a probability distribution in accordance with the preferential attachment principle. In this view, it is both a Dirichlet process (Ferguson, 1973) as it generates a probability distribution and a preferential attachment process (Barabási & Albert, 1999) as the distribution is generated such that probabilities are proportional to past draws. The process is explained through a metaphor of sitting patrons at a Chinese restaurant. Consider this restaurant as containing an infinite number of tables with each table having the capacity to seat an infinite number of patrons. Patrons are seated sequentially, such that the first patron is seated at the first table and every subsequent patron may be seated at an occupied table or the first unoccupied table. The probability of being seated at an occupied table is proportional to the number of patrons already seated at it. This process is illustrated through the toy example in Figure 2 which shows a potential state of the CRP after sitting six patrons along with the sample probabilities of sitting the seventh. Formally, the probability of seating patron $e_{i}$ ¹ at a table $t_{m}$ in a restaurant where $T_{i}$ is the set of occupied tables when patron $e_{i}$ arrives is: $P (e_{i} = t_{m} | e_{0}, e_{1}, \dots, e_{i - 1}, γ) = {\begin{cases} \frac{#_{i}^{m}}{i + γ} & t_{m} \in T_{i} \\ \frac{γ}{i + γ} & t_{m} \notin T_{i} \end{cases}$ (4)Here, $#_{i}^{m}$ is the number of patrons seated at table $t_{m}$ when patron $e_{i}$ arrives and $γ > 0$ is a hyperparameter of the CRP responsible for controlling the probability that an incoming patron is seated at an unoccupied table such that increasing $γ$ increases this probability. Thus, increasing $γ$ values will yield results with an increasing number of occupied tables. Specifically, the expected number of occupied tables grows logarithmically with respect to the number of seated patrons: $E [\sum_{t_{m} \in T_{i}} I (#_{i}^{m} > 0) | γ] = O (γ \log i)$ (5)Where $I$ is the indicator function which returns 1 if the condition is met and 0 otherwise. Big-O notation is leveraged with $O$ to indicate the asymptotic upper bound of the expectation. This principle becomes relevant when controlling the branching factor of the induced tree as we will see later on. The realization of the CRP yields a partition of patrons over the infinitely many tables in the restaurant. If we consider each table to be a community, we can leverage this process to obtain a probability distribution over an infinite number of communities. Indeed this is the main utility of the CRP, namely to serve as a conjugate prior to infinite non-parametric discrete distributions. While this approach allows for the modeling of flat communities, it does not account for hierarchical relations between them. To remedy this, the CRP must be extended to its nested variant.

Figure 3.

Toy example of a nCRP truncated to a depth of $L = 2$ after assigning patrons $e_{0}$ through $e_{5}$ . Solid lines indicate paths which have been taken by patrons and thus exist in the tree whereas dashed lines indicate potential paths. We illustrate Equation (6) by calculating the probability of a patron taking a path through communities $t_{2}$ and $t_{9}$ : $P (e_{6} = t_{2}) = (\frac{2}{2 + γ}) (\frac{2}{6 + γ})$ and $P (e_{6} = t_{9}) = \frac{γ}{6 + γ}$ .

3.4. The Nested Chinese Restaurant Process

The nested Chinese restaurant process (nCRP) (Blei et al., 2010; Griffiths et al., 2003) is an extension of the CRP formulated to account for hierarchical relations between the generated communities. The realization of this process is an infinitely deep and infinitely branching tree of communities defined by a set of paths, $P$ , taken from the root community to a leaf community. In principle, the tree is unbounded in depth, however we limit our discussion to a nCRP bounded to a depth of $L$ . As in the case of the CRP, the allocation of paths along the tree is consistent with the preferential attachment principle. The tree is generated stochastically by sampling a path at each level in the tree via the CRP such that drawing a table is analogous to taking a path at that level. To extend the metaphor of seating patrons at a Chinese restaurant, consider the scenario of an infinite number of restaurants with an infinite number of infinite seat tables. When patrons are seated at these restaurants they are not served food but rather a table specific reference to another restaurant to which they must go. One of these restaurants is designated a root restaurant with no reference and all other restaurants are referenced exactly once. The seating of patrons at these restaurants is performed as in the CRP. We can see how realizing this process yields a tree by examining the paths taken by patrons. They first arrive at the root restaurant before being sent off to one of the root restaurant’s descendant restaurants. At this restaurant the patron is sent off to another descendant restaurant and this process is repeated until $L$ restaurants have been visited in the bounded case. The paths taken by patrons generate the tree as illustrated in the toy example in Figure 3. As before, we extend this analogy of patrons and tables to entities and communities, respectively. Thus, when drawing path $p_{i}$ for entity $e_{i}$ , the process starts by initializing the path at the top level to the root community, namely $p_{i}^{0} = t_{0}$ where the superscript in path $p_{i}$ indexes into the path vector to obtain the community at the corresponding level and $t_{0}$ is the root community. The process then continues by drawing a descendent community according to the CRP. Recall that this draw results in a community which either has or has not been visited before by a previous entity. The latter case corresponds to branching out a new path in the tree at the descendant level. This process is repeated $L$ times at which point the specified depth has been reached. We can formalize this process by extending the previously defined notation. Specifically, let $T_{i}$ be the set of communities in the tree before entity $e_{i}$ has its path sampled and $C_{i}^{q}$ be the set of children communities for community $t_{q}$ at this time as well. The sampling process is then expressed as follows: when entity $e_{i}$ arrives at community $t_{q}$ on the $(l - 1)^{th}$ level in the tree, the probability of selecting an existing community, $p_{i}^{l} \in C_{i}^{q}$ or creating a new community, $p_{i}^{l} \notin T_{i}$ , is: $\begin{aligned} P (p_{i}^{l} = t_{c} | p_{0}, p_{1}, \dots, p_{i - 1}, p_{i}^{1 : l - 1}, γ) & = {\begin{cases} \frac{#_{i}^{t_{c}}}{#_{i}^{t_{q}} + γ} & t_{c} \in C_{i}^{c} \\ \frac{γ}{#_{i}^{t_{q}} + γ} & t_{c} \notin T_{i} \end{cases} \end{aligned}$ (6)Where $#_{i}^{t_{q}}$ and $#_{i}^{t_{c}}$ is the number of entities that have passed through communities $t_{q}$ and $t_{c}$ before entity $e_{i}$ started its path. The superscript in $p_{i}^{1 : l - 1}$ indicates that the probability distribution for sampling $p_{i}^{l}$ is conditioned on the path taken by entity $e_{i}$ up until level $l$ . The hyperparameter $γ$ serves a similar function as in the CRP, namely controlling the branching factor of the tree such that higher $γ$ values yield trees with more branches. The use of the CRP in the path decision process ensures that probability mass will be pulled towards drawing paths which have been more frequently drawn before. The resulting distribution allows us to use the nCRP as a non-parametric prior over a tree structure in our model. In drawing paths, we not only generate a hierarchy but also define a subset of communities to which an entity can belong to, namely those along the path. This highlights an important difference between the CRP and the nCRP. That is that while the CRP is sufficient for drawing a community for an entity, the nCRP must be used alongside another stochastic process in order to determine the level along the path that the entity belongs to. This provides a segue to one such process, specifically the stick breaking process.

3.5. The Stick Breaking Process

The stick breaking process (Sethuraman, 1994) is—like the CRP and nCRP—a Dirichlet process that draws its name from a metaphor which describes it. The metaphor starts by breaking a stick of unit length into two fragments at a point in the interval from 0 to 1 as drawn from the Beta distribution. One of the two fragments is preserved and the other fragment is broken again, analogously to the initial stick. This process is repeated an infinite number of times to yield an infinite number of fragments whose combined length is that of the initial stick. These fragments may be viewed as a probability distribution over the infinite sequence of discrete time-steps used to generate them. In other words, the stick breaking process is an infinite extension of the Dirichlet distribution insofar as while the Dirichlet distribution yields a probability distribution over $L$ categories, the stick breaking process yields a probability distribution over an infinite number of categories. Formally, let the draw from the Beta distribution at the $l^{th}$ iteration of the stick breaking process be denoted as $v^{l} \sim Beta (μ σ, (1 - μ) σ)$ . Thus, the lengths of the first fragment, denoted $a^{1}$ , and its remainder are $v^{1}$ and $1 - v^{1}$ , respectively. To obtain the length of the second fragment, $a^{2}$ , draw $v^{1}$ and break off that fragment from what remains of the stick, namely $a^{2} = v^{1} (1 - v^{1})$ . We define this process for an arbitrary $l^{th}$ time-step as follows: $a^{l} = v^{l} \prod_{k = 1}^{l - 1} (1 - v^{k})$ (7)A concrete example involving the application of this rule is illustrated in Figure 4 which demonstrates the first three breaks of the stick along with the respective values of the broken fragments and their remainders. The realized stick fragments form a probability distribution in that $\sum_{l = 1}^{\infty} a^{l} = 1$ . We can thus define the probability mass function of the stick breaking process, denoted Stick $(μ, σ)$ , as follows: $\begin{aligned} Stick (μ, σ) & = \sum_{l = 1}^{\infty} a^{l} \\ = \sum_{l = 1}^{\infty} v^{l} \prod_{k = 1}^{l - 1} (1 - v^{k}) \end{aligned}$ (8)The stick breaking process is a generalization of the Griffiths-Engen-McCloskey distribution (Blei et al., 2010; Picard & Pitman, 2006) which may be seen as a special case where $μ σ = 1$ . The hyperparameters, $1 > μ > 0$ and $σ > 0$ , control the mean and variance of the distribution, respectively. Specifically, increasing $μ$ values will pull the mean towards fragments broken later in the process and increasing $σ$ values will increase the variance of the distribution. The resulting distribution can be used in conjunction with the nCRP to obtain a community for an entity given its sampled path. This is because by sampling the stick breaking distribution an index is obtained which can correspond to the level on the path that the entity belongs to. This motivates the use of the stick breaking process in our model. Namely, we use the stick breaking process as a prior over the levels in the induced hierarchy. We explain this in detail in the subsequent section.

Figure 4.

Toy example of the stick breaking process with values $v^{1} = 0.125$ $v^{2} = 0.25$ $v^{3} = 0.5$ . Starting at the top of the figure, a unit length stick is broken at $v^{1}$ . The remainder is then iteratively broken proportionally to draws from the Beta distribution.

4. Proposed Model

In describing our proposed model, we will adopt the notations used in the previous section to indicate the connection with the ideas discussed in the preliminaries. To aid in understanding, we first provide a summary of the components of our model before defining the generative process. This is followed by a formalization of the Gibbs sampling procedure and derivation of sampling equations.

4.1. Model Description

Like all stochastic blockmodels, our model is defined as a set of probability distributions such that when these distributions are sampled from, they generate the adjacency tensor of the knowledge graph. The choice of these distributions makes assumptions about the underlying structure that governs the graph’s interactions. In devising our model, we assume a hierarchy of entity communities which are captured in the form of a tree. The entities in these communities interact with one another as a function of their membership to a community. In other words, interactions are modeled at the community level and extended downwards to their constituent entities. Unlike most stochastic blockmodels, these community relations are modeled with respect to a predicate in the knowledge graph. In other words, the interactions between entities are predicate dependent such that the degree of interaction between entities, changes depending on the predicate that links them. This allows the model to capture structures extending beyond those implied by mere interaction density. Thus, in order to generate the knowledge graph’s adjacency tensor, we need to know its hierarchical community structure, its entities’ memberships to communities, and the interactions between its communities. The induction of these components, which may be seen as a byproduct of the generative process, is the objective of our model. We note that the communities’ constituent entities do not conform to is-a relationships as would be implied by the hierarchy. This is because the hierarchy is imposed on the communities themselves as opposed to their constituent entities. An example of this is highlighted in Figure 5 where the entity Canada is a descendant of the entity Pacific Ocean. Of course, Canada is not a Pacific Ocean however the concept modeled by community $t_{5}$ , namely countries, is an instance of the concept modeled by community $t_{2}$ , namely locations.

Figure 5.

Toy example depicting a potential hierarchy induced by our model. The table on right side captures the path and level sampled for each entity in the knowledge graph as well as its corresponding community. The left side provides a visualization of this hierarchy.

4.1.1. Community Memberships

Entities are assigned to communities through the conjunction of two variables: entity paths and level indicators. Paths define the tree structure over the community hierarchy by sampling from the nCRP as described in the previous section. We thus denote an entity path as $p_{i}$ for entity $e_{i}$ , such that $p_{i} := [p_{i}^{1}, p_{i}^{2}, \dots, p_{i}^{L}]$ where $p_{i}^{l}$ represents the community at level $l$ . We draw attention to the fact that this definition omits the root community from the path, namely $p_{i}^{0}$ , since all entities must pass through it. It also allows a hierarchy with a depth of $L$ to have entity path vectors of dimension $L$ , simplifying the notation. Entity paths are drawn from the nCRP, denoted as $p_{i} \sim nCRP (γ)$ . Thus, all the entity paths sampled in the model form a $| E | \times L$ matrix which we denote as $P$ . $γ$ is the aforementioned hyperparameter of the nCRP and is responsible for controlling the probability of generating a new branch in the hierarchy as the path is being sampled. When a new branch is generated at level $l$ such that $l > L$ , $L - l$ new communities are also generated and populated solely by the sampling entity. Furthermore, if a path is resampled such that its corresponding entity obtains a new path which leaves behind empty communities, those empty communities are removed from the hierarchy. As such, the number of communities in the hierarchy is subject to constant change throughout the sampling process.

Having sampled entity paths, in order for entities to be assigned to communities, their levels must be obtained. Entity levels are modeled by two variables in our approach: level memberships and level indicators. Level memberships, denoted $a_{i}$ for entity $e_{i}$ , capture the probability of the entity’s belonging to each of the $L$ levels. As such, all the level memberships in our model form a $| E | \times L$ matrix, $A$ . This is similar to the mixed-membership property of the Mixed Membership Stochastic Blockmodel wherein an entity has a membership distribution over all communities. The difference, as pointed out by Ho et al. (2011), is that in hierarchical models this distribution is restricted to communities along the entity’s sampled path, otherwise the process of obtaining paths, and indeed the hierarchy itself, would lose its meaning. Level memberships are drawn from the stick breaking process, $a_{i} \sim Stick (μ, σ)$ with hyperparameters $μ$ and $σ$ . Recall that this process yields an infinite distribution and must therefore be truncated to a dimension of $L$ to correspond with the depth of the tree. The truncation is performed by removing all probabilities at levels greater than $L$ and renormalizing. The distribution captured by an entity’s level membership is used to sample its level indicator. The level indicator indicates the level to which an entity belongs and thus, in conjunction with its path, assigns it to a community. Level indicators are drawn in the context of an interaction between two entities. Specifically, when modeling the probability of an interaction from entity $e_{i}$ to entity $e_{j}$ we draw two level indicators, one for the sender entity and one for the receiver entity denoted as $z_{i \to j}$ and $z_{i \leftarrow j}$ , respectively. The sender and receiver level indicators correspond to the levels of entities $e_{i}$ and $e_{j}$ in the context of their pairwise interaction. Thus, our model samples $| E |^{2}$ sender and receiver level indicators each leading to two $| E | \times | E |$ matrices $Z_{\to}$ and $Z_{\leftarrow}$ for all the senders and receivers, respectively. To simplify notation in our inference procedure, we concatenate these matrices to form a $| E | \times | E | \times 2$ level indicator tensor, $Z$ . Since level memberships are themselves probability distributions, they may be sampled from directly to indicate an entity’s level. Specifically, level indicators are drawn from multinomial distributions, namely $z_{i \to j} \sim Multinomial (a_{i})$ and $z_{i \leftarrow j} \sim Multinomial (a_{j})$ , which yield one of the $L$ levels in the hierarchy. The interplay between paths and levels when assigning entities to communities may be summarized as follows: paths identify a hierarchy of candidate communities and level indicators select one of the candidates for the entity. This dynamic is captured in the toy example in Figure 5.

4.1.2. Community Relations

Community relations describe the degree to which entities in any two communities are likely to interact with one another through a specific predicate. In other words, they model the probability of observing a value of one in the knowledge graph’s adjacency tensor. These interactions are captured by a $T \times T \times R$ tensor, denoted $C$ , where $T$ is the set of all communities in the hierarchy. We note that because the communities in $T$ are a result of sampling from the nCRP and are thus subject to change with each successive sample, the dimensionality of $C$ is also subject to change in the sampling process. This presents a challenge to our sampling scheme since it is possible to sample communities via the nCRP for which there are no community relation values. We overcome this issue through the marginalization of community relations as discussed in the subsequent subsection. The community relation $c_{p q r}$ is an entry in $C$ and captures the probability of interaction between entities in community $t_{p}$ with entities in community $t_{q}$ through predicate $r_{r}$ . As such, the value of $c_{p q r}$ is bounded to $1 \geq c_{p q r} \geq 0$ . In order to preserve the hierarchical structure that was induced by sampling paths and levels, the community relations must be limited to take on non-zero values only when interacting with communities which are proximal to them in the hierarchy. This restriction is vital as allowing for interaction between any two communities in the hierarchy would render it meaningless and our model would be reduced to a fixed size mixed membership stochastic blockmodel such as the ones described in Section 2.

In restricting the values of community relations we take an approach similar to that of the Multiscale Community Blockmodel. Specifically, we borrow the concept of a sibling group which refers to a set of communities that share the same parent in the hierarchy. Only the community relations between communities in the same sibling group are modeled in our approach. Thus, when obtaining the interaction degree of two entities whose communities have the same parent, it is sufficient to merely access the corresponding value in $C$ . When their communities do not share the same parent, a coarsening procedure is applied to obtain an interaction degree. The coarsening procedure traverses the paths of the two entities to find the deepest pair of communities which are in the same sibling group. Formally, to obtain the community relation degree from entity $e_{i}$ to entity $e_{j}$ on predicate $r_{r}$ , we define the function $Ψ (i, j, r)$ as follows: $\begin{aligned} Ψ (i, j, r) & = {\begin{cases} c_{p_{i}^{z_{i \to j}} p_{j}^{z_{i \leftarrow j}} r} & p_{i}^{z_{i \to j} - 1} = p_{j}^{z_{i \leftarrow j} - 1} \\ c_{Φ (i | j) Φ (j | i) r} & p_{i}^{z_{i \to j} - 1} \neq p_{j}^{z_{i \leftarrow j} - 1} \end{cases} \end{aligned}$ (9)Wherein $Φ (i | j)$ and $Φ (j | i)$ are functions that find the ancestor communities of entities $e_{i}$ and $e_{j}$ which share the same sibling group, respectively. These values are obtained by indexing the entities’ paths on the level at which they diverge from their interaction partner. This process is made clear in their definitions: $\begin{aligned} Φ (i | j) & = p_{i}^{m i n ({l : p_{i}^{l} \neq p_{j}^{l}})} \\ Φ (j | i) & = p_{j}^{m i n ({l : p_{i}^{l} \neq p_{j}^{l}})} \end{aligned}$ (10)This approach differs from the Multiscale Community Blockmodel in that while there are restricted entries in $C$ , these values are never accessed. Instead, all communities are coarsened to an ancestor which allows for their interaction to take on a non-restricted value.

Community relations are drawn from the Beta distribution parameterized by $λ > 0$ and $η > 0$ , and denoted as $c_{p q r} \sim Beta (λ, η)$ . This ensures that community relations take on probability values which can be used in conjunction with the Bernoulli distribution. $λ$ and $η$ are hyperparameters of our model and determine the density of the generated knowledge graph such that increasing $λ$ values with respect to $η$ yields denser results. Figure 6 provides a visualization of a potential sampling of community relations. We note that the three tensors correspond to the three sibling groups for which community relations take on non-restricted values. The diagonal and off-diagonal values in these tensors represent the intra and inter community relations, respectively. Thus, based on these values, there is a probability of 0.8 that Brad Pitt knows Johnny Depp and a probability of 0.6 that Brad Pitt knows Donald Trump. We provide an exploration of the recovered community relations on real-world data in Section 5.

Figure 6.

The potential community relations induced by our model on the toy example introduced earlier. The hierarchy on the left of the figure has three sibling groups and three predicates: $knows$ , $locatedIn$ , and $bornIn$ . The three tensors on the right correspond to the community relations of the three sibling groups.

4.1.3. Generative Process

The generative process of our model refers to the sequential sampling of components which allow for the generation of the target knowledge graph. In other words, the goal is to draw a binary value for each $g_{i j r} \in G$ such that it equals the knowledge graph’s adjacency tensor. But before this can be done, it is necessary to sample the variables it is dependent on. The first components sampled in the generative process are the paths and level memberships for each entity in the knowledge graph from the nCRP and stick distributions, respectively. Having drawn the paths, we now have the set of communities in the hierarchy and can draw community relations from the Beta distribution. At this point in the generative process, entities are not yet assigned to communities. The community memberships for these entities have been drawn, however, allowing for the sampling of community levels for each pair of entities in the knowledge graph from the multinomial distribution. With the community levels drawn, all the components for generating the knowledge graph are in place. The binary value for the interaction from entity $e_{i}$ to entity $e_{j}$ on predicate $r_{r}$ is drawn from the Bernoulli distribution using each entity’s respective community’s interactions, namely $g_{i j r} \sim Bernoulli (Ψ (i, j, r))$ . The plate diagram for this process is illustrated in Figure 7 and the formal definition is as follows:

–
For each entity in the knowledge graph; $e_{i} \in E$ *
$p_{i} \sim nCRP (γ)$
*
$a_{i} \sim Stick (μ, σ)$

–
For each sender community in the hierarchy; $t_{p} \in T$ *
For each receiver community in the hierarchy; $t_{q} \in T$ *
For each predicate in the knowledge graph; $r_{r} \in R$ $\cdot$
$c_{p q r} \sim Beta (λ, η)$

–
For each sender entity in the knowledge graph; $e_{i} \in E$ *
For each receiver entity in the knowledge graph; $e_{j} \in E$ *
$z_{i \to j} \sim Multinomial (a_{i})$
*
$z_{i \leftarrow j} \sim Multinomial (a_{j})$
*
For each predicate in the knowledge graph; $r_{r} \in R$ $\cdot$
$g_{i j r} \sim Bernoulli (Ψ (i, j, r))$

We note that this process is unsupervised and does not impose any assumptions about the partition of entities to communities or the structure of the hierarchy other than to limit its depth. In fact, the depth is the only constraint imposed on the generative process. The other hyperparameters which must be specified a priori—namely $γ$ , $μ$ , $σ$ , $λ$ , and $η$ —merely influence the prior distributions of our model. They may pull the latent variables in the assumed direction but only insofar as the data allows it. This, recall, is due to the sampling of latent variables from their posterior distribution which is conditioned on the data. As a result, with a strong enough likelihood, the effects of the hyperparameters and the prior relatively diminish. As with most stochastic blockmodels, the exact inference for our model is intractable and must be approximated using an inference scheme. For this we adopt collapsed Gibbs sampling, an extension of the aforementioned Gibbs sampling.

Figure 7.
Plate diagram for our model. Circles indicate random variables whereas diamonds indicate model hyperparameters. Shading indicates observed variables and lack of shading indicates latent variables.
4.2. Collapsed Gibbs Sampling

Collapsed Gibbs sampling refers to an extension of Gibbs sampling in which a subset of model variables are marginalized over and therefore do not need to be sampled directly. These variables are said to be collapsed out of the Gibbs sampler. Collapsing of these variables is done analytically via integration and ensures a faster mixing process. This is because the calculation of probability distributions for sampling is generally computationally expensive. Having fewer variables then leads to a faster arrival at the desired stationary distribution. Furthermore, the calculation of probability distributions which have not been collapsed out of the sampling process is generally faster in collapsed Gibbs sampling. This is because in regular Gibbs sampling draws are made from the full conditionals of variables. In collapsed Gibbs sampling, collapsed variables have been integrated out of the process and the remaining variables are conditioned on a lower-dimensional space. Collapsing of variables is usually tractable when they are the conjugate prior of their dependent variables. In our model, community relations and level memberships are both conjugate priors of their dependent variables, namely level indicators and entity relations, respectively. We leverage these conjugacies to marginalize over these two variables in our sampling process. After marginalization, the sampling equations may be derived for the remaining variables.

4.2.1. Marginalizing Community Relations

In order to marginalize out community relations, it is necessary to find a closed form solution which allows for integration during path sampling. To this end, we can leverage the Bernoulli-Beta conjugacy which ensures that given a Bernoulli likelihood and Beta prior, the posterior will also be drawn from the Beta distribution. Employing this conjugacy is possible due to the formulation of our model in which entity relations are drawn from the Bernoulli distribution and community relations assume a Beta prior. We see this explicitly when applying Bayes’ theorem to obtain the posterior as follows: $P (c_{p q r} | C_{- (p q r)}, G, P, Z, λ, η) = \frac{P (G | C, P, Z, λ, η) P (c_{p q r} | C_{- (p q r)}, λ, η)}{\int_{c_{p q r}} P (G | C, P, Z, λ, η) P (c_{p q r} | C_{- (p q r)}, λ, η) d c_{p q r}}$ (11)Where $P (G | C, P, Z, λ, η)$ is the likelihood of generating entity relations and $P (c_{p q r} | C_{- (p q r)}, λ, η)$ is the prior placed on community relations. $C_{- (p q r)}$ indicates the community relations tensor $C$ without $c_{p q r}$ . Before proceeding we introduce helper variables $#^{c_{p q r} = 1}$ and $#^{c_{p q r} = 0}$ to indicate the number of existing and non-existing interactions between entities from community $t_{p}$ to community $t_{q}$ on predicate $r_{r}$ , respectively: $\begin{aligned} #^{c_{p q r} = 1} & = | {g_{x y z} \in G : Ψ (x, y, z) = c_{p q r} \land g_{x y z} = 1} | \\ #^{c_{p q r} = 0} & = | {g_{x y z} \in G : Ψ (x, y, z) = c_{p q r} \land g_{x y z} = 0} | \end{aligned}$ (12)We can now derive a closed-form solution for the posterior of community relations by applying the distributions defined in our model: $\begin{aligned} P (c_{p q r} | C_{- (p q r)}, G, P, Z, λ, η) & \overset{(1)}{=} \frac{(\prod_{g_{x y z} \in G} Bernoulli (c_{p q r}, 1 - c_{p q r})) (Beta (λ, η))}{\int_{c_{p q r}} (\prod_{g_{x y z} \in G} Bernoulli (c_{p q r}, 1 - c_{p q r})) (Beta (λ, η)) d c_{p q r}} \\ \overset{(2)}{=} \frac{(c_{p q r}^{#^{c_{p q r} = 1}} (1 - c_{p q r})^{#^{c_{p q r} = 0}}) (\frac{c_{p q r}^{λ - 1} (1 - c_{p q r})^{η - 1}}{B (λ, η)})}{\int_{c_{p q r}} (c_{p q r}^{#^{c_{p q r} = 1}} (1 - c_{p q r})^{#^{c_{p q r} = 0}}) (\frac{c_{p q r}^{λ - 1} (1 - c_{p q r})^{η - 1}}{B (λ, η)}) d c_{p q r}} \\ \overset{(3)}{=} \frac{c_{p q r}^{#^{c_{p q r} = 1} + λ - 1} (1 - c_{p q r})^{#^{c_{p q r} = 0} + η - 1}}{\int_{c_{p q r}} c_{p q r}^{#^{c_{p q r} = 1} + λ - 1} (1 - c_{p q r})^{#^{c_{p q r} = 0} + η - 1} d c_{p q r}} \\ \overset{(4)}{=} \frac{c_{p q r}^{#^{c_{p q r} = 1} + λ - 1} (1 - c_{p q r})^{#^{c_{p q r} = 0} + η - 1}}{B (#^{c_{p q r} = 1} + λ, #^{c_{p q r} = 0} + η)} \\ = Beta (#^{c_{p q r} = 1} + λ, #^{c_{p q r} = 0} + η) \end{aligned}$ (13)Such that in the derivation above: (1) is the Bayes’ theorem definition as per Equation (11) using the probability distributions defined in our model; (2) uses the probability masses and densities of the Bernoulli and Beta distributions as per Equations (A.1) and (A.3); (3) is obtained by applying power rules and dividing out the Beta function which is constant with respect to $c_{p q r}$ in the integral; and (4) utilizes the integral form of the Beta function as derived in Equation (B.1). The posterior as defined in Equation (13) allows for community relations to be integrated out when sampling paths. As such they are not sampled directly in the inference process.

4.2.2. Marginalizing Level Memberships

There are two ways in which to approach marginalizing level memberships in our model. Firstly, Sethuraman (1994) showed that the realization of the stick breaking process follows the Dirichlet distribution. We can leverage this because, in practice, the dimensionality of the level memberships gets bounded to the depth of the tree, $L$ . It is therefore possible to model level memberships with an $L$ dimensional Dirichlet distribution. As discussed in Ho et al. (2011), this prior has the disadvantage of either being too expressive or not expressive enough depending on its parameterization. Regardless, we show the marginalization of this case in Appendix D. In this subsection, however, we focus on the infinite case using the stick breaking process as defined in our model. To this end, we use the multinomial-stick conjugacy to obtain a stick breaking posterior which is used as the prior for the level indicators later on. The posterior is defined as follows: $\begin{aligned} P (a_{i} | A_{- i}, Z, μ, σ) & = \frac{P (Z | A, μ, σ) P (a_{i} | A_{- i}, μ, σ)}{\int_{a_{i}} P (Z | A, μ, σ) P (a_{i} | A_{- i}, μ, σ) d a_{i}} \\ = \frac{Multinomial (a_{i}) Stick (μ, σ)}{\int_{a_{i}} Multinomial (a_{i}) Stick (μ, σ) d a_{i}} \end{aligned}$ (14)Where $P (Z | A, μ, σ)$ and $P (a_{i} | A_{- i}, μ, σ)$ are the likelihood and prior of level memberships, respectively. We use the definitions from our model and replace these with the multinomial and stick breaking distributions. Before proceeding, we define $z_{i *} = {z_{x \leftrightarrow y} \in Z : x = i \lor y = i}$ representing all level indicators for entity $e_{i}$ . In this notation $z_{i \leftrightarrow j} := z_{i \to j} \lor z_{i \leftarrow j}$ is used as shorthand for any level indicator relating entity $e_{i}$ with entity $e_{j}$ regardless of which entities are taking on the sender and receiver roles. This allows for defining two helper variables, $#^{z_{i *} = l}$ and $#^{z_{i *} > l}$ , to indicate the number of indicators in $z_{i *}$ at and below level $l$ in the hierarchy, respectively: $\begin{aligned} #^{z_{i *} = l} & = | {z_{i \leftrightarrow j} \in z_{i *} : z_{i \leftrightarrow j} = l} | \\ #^{z_{i *} > l} & = | {z_{i \leftrightarrow j} \in z_{i *} : z_{i \leftrightarrow j} > l} | \end{aligned}$ (15)With these definitions in place, we can derive the stick breaking posterior of level memberships: $\begin{aligned} P & (a_{i} | A_{- i}, Z, μ, σ) \\ \overset{(1)}{=} \frac{(\frac{Γ (\sum_{l = 1}^{\infty} #^{z_{i *} = l} + 1))}{\prod_{l = 1}^{\infty} Γ (#^{z_{i *} = l} + 1))} \prod_{l = 1}^{\infty} a_{i}^{l}) (\sum_{l = 1}^{\infty} v_{l} \prod_{k = 1}^{l - 1} (1 - v_{k}) I (a_{i}^{l} = l))}{\int_{V} (\frac{Γ (\sum_{l = 1}^{\infty} #^{z_{i *} = l} + 1))}{\prod_{l = 1}^{\infty} Γ (#^{z_{i *} = l} + 1))} \prod_{l = 1}^{\infty} a_{i}^{l}) (\sum_{l = 1}^{\infty} v_{l} \prod_{k = 1}^{l - 1} (1 - v_{k}) I (a_{i}^{l} = l)) d^{\infty} V} \\ \overset{(2)}{=} \frac{(\frac{Γ (\sum_{l = 1}^{\infty} #^{z_{i *} = l} + 1)}{\prod_{l = 1}^{\infty} Γ (#^{z_{i *} = l} + 1)}) \prod_{l = 1}^{\infty} (v_{l} \prod_{k = 1}^{l - 1} (1 - v_{k}))^{#^{z_{i *} = l}}) (Beta (μ σ, (1 - μ) σ) \prod_{k = 1}^{l - 1} Beta ((1 - μ) σ, μ σ))}{\int_{V} (\frac{Γ (\sum_{l = 1}^{\infty} #^{z_{i *} = l} + 1))}{\prod_{l = 1}^{\infty} Γ (#^{z_{i *} = l} + 1))} \prod_{l = 1}^{\infty} (v_{l} \prod_{k = 1}^{l - 1} (1 - v_{k}))^{#^{z_{i *} = l}}) (Beta (μ σ, (1 - μ) σ) \prod_{k = 1}^{l - 1} Beta ((1 - μ) σ, μ σ)) d^{\infty} V} \\ \overset{(3)}{=} \frac{(\frac{Γ (\sum_{l = 1}^{\infty} #^{z_{i *} = l} + 1))}{\prod_{l = 1}^{\infty} Γ (#^{z_{i *} = l} + 1))} v_{l}^{#^{z_{i *} = l}} (1 - v_{l})^{#^{z_{i *} > l}} \prod_{k = 1}^{l - 1} v_{k}^{#^{z_{i *} > k}} (1 - v_{k})^{#^{z_{i *} = k}})}{\int_{V} (\frac{Γ (\sum_{l = 1}^{\infty} #^{z_{i *} = l} + 1))}{\prod_{l = 1}^{\infty} Γ (#^{z_{i *} = l} + 1))} v_{l}^{#^{z_{i *} = l}} (1 - v_{l})^{#^{z_{i *} > l}} \prod_{k = 1}^{l - 1} v_{k}^{#^{z_{i *} > k}} (1 - v_{k})^{#^{z_{i *} = k}})} \\ \frac{(\frac{v_{l}^{μ σ - 1} (1 - v_{l})^{(1 - μ) σ - 1}}{B (μ σ, (1 - μ) σ)} \prod_{k = 1}^{l - 1} \frac{v_{k}^{(1 - μ) σ - 1} (1 - v_{k})^{μ σ - 1}}{B ((1 - μ) σ, μ σ)})}{(\frac{v_{l}^{μ σ - 1} (1 - v_{l})^{(1 - μ) σ - 1}}{B (μ σ, (1 - μ) σ)} \prod_{k = 1}^{l - 1} \frac{v_{k}^{(1 - μ) σ - 1} (1 - v_{k})^{μ σ - 1}}{B ((1 - μ) σ, μ σ)}) d^{\infty} V} \\ \overset{(4)}{=} \frac{v_{l}^{#^{z_{i *} = l} + μ σ - 1} (1 - v_{l})^{#^{z_{i *} > l} + (1 - μ) σ - 1} \prod_{k = 1}^{l - 1} v_{k}^{#^{z_{i *} > k} + (1 - μ) σ - 1} (1 - v_{k})^{#^{z_{k *} = k} + μ σ - 1}}{\int_{V} v_{l}^{#^{z_{i *} = l} + μ σ - 1} (1 - v_{l})^{#^{z_{i *} > l} + (1 - μ) σ - 1} \prod_{k = 1}^{l - 1} v_{k}^{#^{z_{i *} > k} + (1 - μ) σ - 1} (1 - v_{k})^{#^{z_{k *} = k} + μ σ - 1} d^{\infty} V} \\ \overset{(5)}{=} \frac{v_{l}^{#^{z_{i *} = l} + μ σ - 1} (1 - v_{l})^{#^{z_{i *} > l} + (1 - μ) σ - 1} \prod_{k = 1}^{l - 1} v_{k}^{#^{z_{i *} > k} + (1 - μ) σ - 1} (1 - v_{k})^{#^{z_{k *} = k} + μ σ - 1}}{B (#^{z_{i *} = l} + μ σ, #^{z_{i *} > l} + (1 - μ) σ) \prod_{k = 1}^{l - 1} B (#^{z_{i *} = k} + (1 - μ) σ, #^{z_{k *} = k} + μ σ - 1)} \\ = Beta (#^{z_{i *} = l} + μ σ, #^{z_{i *} > l} + (1 - μ) σ) \prod_{k = 1}^{l - 1} Beta (#^{z_{i *} > k} + (1 - μ) σ, #^{z_{i *} = k} + μ σ) \\ = \sum_{l = 1}^{\infty} v_{l} \prod_{k = 1}^{l - 1} (1 - v_{k}) | z_{i *} \\ = Stick (μ σ, (1 - μ) σ) | z_{i *} \end{aligned}$ (16)Where (1) is an application of the definitions of the Multinomial and stick breaking distributions as per Equations (A.2) and (8) to the posterior as per Equation (14). (2) redefines the likelihood in terms of the Beta samples of the stick breaking process and the prior leverages the mirror symmetry property of the Beta function. The summation and indicator function are both removed at this stage to enhance readability. (3) rearranges the likelihood and substitutes the probability density function of the Beta distribution as per Equation (A.3). (4) involves canceling out terms in the numerator and denominator which are constant with respect to the integration. (5) utilizes the integral form of the Beta function as per Equation (B.1) and the remainder of the derivation merely reverses the definitions applied earlier to arrive at the definition of the stick breaking process. We use the notation $| z_{i *}$ to denote that the distribution preceding the notation is conditioned on $z_{i *}$ .

4.2.3. Sampling Entity Paths

Entity paths are one of the two variables which remain after collapsing the Gibbs sampler and must therefore be sampled directly. To sample a path for entity $e_{i}$ , it must first be removed from the hierarchy, thereby allowing for its full conditional distribution to be obtained. The set of paths after having removed path $p_{i}$ is denoted as $P_{- i}$ . We derive the posterior distribution of $p_{i}$ by applying Bayes’ theorem: $\begin{aligned} P (p_{i} | P_{- i}, G, Z, γ, λ, η) & = \frac{P (G_{i *} | G_{- (i *)}, P, Z, γ, λ, η) P (p_{i} | P_{- i}, γ)}{\int_{p_{i}} P (G_{i *} | G_{- (i *)}, P, Z, γ, λ, η) P (p_{i} | P_{- i}, γ) d p_{i}} \\ \propto P (G_{i *} | G_{- (i *)}, P, Z, γ, λ, η) P (p_{i} | P_{- i}, γ) \end{aligned}$ (17)Where $G_{i *} = {g_{x y z} \in G : i = x \lor i = y}$ denotes all the triples in the knowledge graph that depend on path $p_{i}$ and $G_{- (i *)} = G ∖ G_{i *}$ is its complement. The integral form of the marginal distribution for generating the data is a normalizing constant for the posterior distribution. Calculating this integral is not necessary and we can instead sample paths from its proportional distribution as per Equation (17). The prior for sampling an entity path, $P (p_{i} | P_{- i}, γ)$ , is obtained from the nCRP. We note that due to the iterative nature of the Gibbs sampler, a path for entity $e_{i}$ may already exist in the hierarchy from a previous iteration. As such, it must first be removed, hence the conditioning on $P_{- i}$ . We use $p_{i} = t_{q}$ as a shorthand to indicate the path that terminates at community $t_{q}$ , in other words $p_{i} = t_{q}$ if $p_{i}^{L} = t_{q}$ . Thus, the prior is calculated as follows: $\begin{aligned} P (p_{i} = t_{q} | P_{- i}, γ) & = P (p_{i}^{L} = t_{q} | P_{- i}, γ) \\ = E [(I (p_{i}^{L} = t_{q}) | P_{- i}, γ)] \\ = P (p_{i}^{L} = t_{q} | p_{i}^{1 : L - 1}, P_{- i}, γ) \prod_{l = 1}^{L - 1} P (p_{i}^{l} = t_{q}^{l} | p_{i}^{1 : l - 1}, P_{- i}, γ) \end{aligned}$ (18)Where $t_{q}^{l}$ is the ancestor community of community $t_{q}$ at level $l$ . Equation (18) requires the distribution for sampling a community conditioned on a partially sampled path. Recall that this is defined by the nCRP in Equation (6) and can be adapted here. Specifically, we calculate the probability of taking community $t_{q}$ on level $l$ having already sampled its path up to level $l - 1$ as: $\begin{aligned} P (p_{i}^{l} = t_{q} | p_{i}^{1 : l - 1}, P_{- i}, γ) & = {\begin{cases} \frac{#_{- i}^{t_{q}}}{#_{- i}^{t_{q}^{l - 1}} + γ} & t_{q} \in T_{- i} \\ \frac{γ}{#_{- i}^{t_{q}^{l - 1}} + γ} & t_{q} \notin T_{- i} \end{cases} \end{aligned}$ (19)Where $#_{- i}^{t_{q}}$ extends the notation defined earlier to indicate the number of entities that have gone through community $t_{q}$ in the hierarchy with path $p_{i}$ removed. $T_{- i}$ indicates all the communities in the hierarchy after path $p_{i}$ has been removed. We note that this process requires the sampling of a path to start at the root community and proceed sequentially to a leaf community. Having obtained the prior, it is necessary to update the belief about the posterior with the data via the likelihood. The likelihood given a sampled path, $P (G_{i *} | G_{- (i *)}, P, Z, γ, λ, η)$ , is defined with the help of the following helper variables: $\begin{aligned} C_{i *} & = {c_{p q r} \in C : (\exists g_{x y z} \in G_{i *} : Ψ (x, y, z) = c_{p q r})} \\ #_{- i}^{c_{p q r} = 1} & = | {g_{x y z} \in G_{- (i *)} : Ψ (x, y, z) = c_{p q r} \land g_{x y z} = 1} | \\ #_{- i}^{c_{p q r} = 0} & = | {g_{x y z} \in G_{- (i *)} : Ψ (x, y, z) = c_{p q r} \land g_{x y z} = 0} | \\ #_{i}^{c_{p q r} = 1} & = | {g_{x y z} \in G_{i *} : Ψ (x, y, z) = c_{p q r} \land g_{x y z} = 1} | \\ #_{i}^{c_{p q r} = 0} & = | {g_{x y z} \in G_{i *} : Ψ (x, y, z) = c_{p q r} \land g_{x y z} = 0} | \end{aligned}$ (20)The definitions above capture the following: $C_{i *}$ is the set of communities dependent on an interaction in $G_{i *}$ ; $#_{- i}^{c_{p q r} = 1}$ and $#_{i}^{c_{p q r} = 1}$ are the counts of existing entity relations from communities $t_{p}$ to $t_{q}$ in $G_{- (i)}$ and $G_{i}$ , respectively; and $#_{- i}^{c_{p q r} = 0}$ and $#_{i}^{c_{p q r} = 0}$ are the counts of non-existing entity relations from communities $t_{p}$ to $t_{q}$ in $G_{- (i)}$ and $G_{i}$ , respectively. In the discrete space, the likelihood is understood as the joint probability of generating the data as per a probability mass function. In our model, the data is obtained by drawing from the Bernoulli distribution conditioned on community relations. Recall that these parameters are marginalized out of our model and thus never sampled directly. As such, in order to calculate the likelihood we must integrate with respect to the community relations. This is possible by leveraging Equation (13), which allows us to obtain a closed form solution: $\begin{aligned} P (G_{i *} | & G_{- (i *)}, P, Z, γ, λ, η) \\ = \prod_{c_{p q r} \in C_{i *}} \int_{c_{p q r}} P (G_{i *} | G_{- (i *)}, C, P, Z, γ, μ, σ, λ, η) P (c_{p q r} | C_{- (p q r)}, G_{- (i *)}, P, Z, λ, η) d c_{p q r} \\ \overset{(1)}{=} \prod_{c_{p q r} \in C_{i *}} \int_{c_{p q r}} (\prod_{g_{x y z} \in G_{i *}} Bernoulli (c_{p q r}, 1 - c_{p q r})) (Beta (#_{- i}^{c_{p q r} = 1} + λ, #_{- i}^{c_{p q r} = 0} + η)) d c_{p q r} \\ \overset{(2)}{=} \prod_{c_{p q r} \in C_{i *}} \int_{c_{p q r}} (c_{p q r}^{#_{i}^{c_{p q r} = 1}} (1 - c_{p q r})^{#_{i}^{c_{p q r} = 0}}) (\frac{c_{p q r}^{#_{- i}^{c_{p q r} = 1} + λ - 1} (1 - c_{p q r})^{#_{- i}^{c_{p q r} = 0} + η - 1}}{B (#_{- i}^{c_{p q r} = 1} + λ, #_{- i}^{c_{p q r} = 0} + η)}) d c_{p q r} \\ = \prod_{c_{p q r} \in C_{i *}} \frac{1}{B (#_{- i}^{c_{p q r} = 1} + λ, #_{- i}^{c_{p q r} = 0} + η)} \int_{c_{p q r}} c_{p q r}^{#_{i}^{c_{p q r} = 1} + #_{- i}^{c_{p q r} = 1} + λ - 1} (1 - c_{p q r})^{#_{i}^{c_{p q r} = 0} + #_{- i}^{c_{p q r} = 0} + η - 1} d c_{p q r} \\ \overset{(3)}{=} \prod_{c_{p q r} \in C_{i *}} \frac{B (#_{i}^{c_{p q r} = 1} + #_{- i}^{c_{p q r} = 1} + λ, #_{i}^{c_{p q r} = 0} + #_{- i}^{c_{p q r} = 0} + η)}{B (#_{- i}^{c_{p q r} = 1} + λ, #_{- i}^{c_{p q r} = 0} + η)} \\ \overset{(4)}{=} \prod_{c_{p q r} \in C_{i *}} (\frac{Γ (#_{i}^{c_{p q r} = 1} + #_{- i}^{c_{p q r} = 1} + λ) Γ (#_{i}^{c_{p q r} = 0} + #_{- i}^{c_{p q r} = 0} + η)}{Γ (#_{i}^{c_{p q r} = 1} + #_{- i}^{c_{p q r} = 1} + #_{i}^{c_{p q r} = 0} + #_{- i}^{c_{p q r} = 0} + λ + η)}) (\frac{Γ (#_{- i}^{c_{p q r} = 1} + #_{- i}^{c_{p q r} = 0} + λ + η)}{Γ (#_{- i}^{c_{p q r} = 1} + λ) Γ (#_{- i}^{c_{p q r} = 0} + η)}) \end{aligned}$ (21)In the derivation above: (1) the prior probability of drawing $c_{p q r}$ is obtained from Equation (13); (2) utilizes the definitions as per Equations (A.1) and (A.3) as well as the helper variables introduced in Equation (20); (3) leverages the integral form of the Beta function as per Equation (B.1); and (4) expands the Beta function to its Gamma formulation as per Equation (A.3). Having derived the prior and likelihood in Equations (18) and (21), respectively, it is possible to sample from Equation (17) to obtain entity paths in our model. The time complexity of sampling one such path is $O (| E |^{2} | R | L)$ . This is due to the fact that it is necessary to obtain a sampling probability for all potential paths in the hierarchy, which has a bound of $| E | L$ in the case where each entity takes a unique path. For each of these potential paths, the iteration through all $| E |$ entities and $| R |$ predicates is required to determine the effect on the likelihood selecting such a path would have.

4.2.4. Sampling Level Indicators

Level indicators are drawn from the multinomial distribution conditioned on level memberships. Recall that level memberships were marginalized over in our inference scheme using the multinomial-stick conjugacy and are thus never sampled directly. Nevertheless, we draw them indirectly when computing the prior for level indicators. As with sampling paths, we obtain the distribution proportional to that of level indicators by Bayes’ rule. In what follows, we provide the derivation for the posterior of $z_{i \to j}$ and note that given its structural symmetry, $z_{i \leftarrow j}$ is derived analogously. The posterior distribution of $z_{i \to j}$ is expressed as: $\begin{aligned} P (z_{i \to j} | Z_{- (i \to j)}, G, P, γ, μ, σ, λ, η) & = \frac{P (g_{i j *} | G_{- (i j *)}, P, Z, λ, η) P (z_{i \to j} | Z_{- (i \to j)}, μ, σ)}{\int_{z_{i \to j}} P (g_{i j *} | G_{- (i j *)}, P, Z, λ, η) P (z_{i \to j} | Z_{- (i \to j)}, μ, σ) d z_{i \to j}} \\ \propto P (g_{i j *} | G_{- (i j *)}, P, Z, λ, η) P (z_{i \to j} | Z_{- (i \to j)}, μ, σ) \end{aligned}$ (22)Where $g_{i j *} = {g_{x y z} \in G : i = x \land j = y}$ denotes the vector of relations in $G$ from entity $e_{i}$ to $e_{j}$ across all predicates, $G_{- (i j *)} = G ∖ g_{i j *}$ is its complement, and $Z_{- (i \to j)}$ is all the level indicators excluding $z_{i \to j}$ . The prior probability for sampling levels, $P (z_{i \to j} | Z_{- (i \to j)}, μ, σ)$ , is drawn from the derived posterior of level memberships in Equation (16). We follow Blei et al. (2010) and use the law of total expectations to obtain the probability of $z_{i \to j}$ realizing level $l$ as the expectation of the size of the stick broken off at the $l^{t h}$ break. To do this we define two variables which may be seen as the directed extensions of those introduced in Equation (15): $\begin{aligned} #^{Z_{- (i \to j)} = l} = | {z_{i \to j} \in Z_{- (i \to j)} : z_{i \to j} = l} | \\ #^{Z_{- (i \to j)} > l} = | {z_{i \to j} \in Z_{- (i \to j)} : z_{i \to j} > l} | \end{aligned}$ (23)With these variables in place, we obtain the prior distribution as follows: $\begin{aligned} P (z_{i \to j} = l | Z_{- (i \to j)}, G_{- (i j *)}, P, μ, σ) & = E [I (z_{i \to j} = l) | Z_{- (i \to j)}, μ, σ] \\ \overset{(1)}{=} E [E [I (z_{i \to j} = l) | v_{1}, v_{2}, \dots, v_{l}, Z_{- (i \to j)}, μ, σ]] \\ \overset{(2)}{=} E [\sum_{m = 1}^{\infty} v_{l} \prod_{k = 1}^{l - 1} (1 - v_{k}) I (m = l) | Z_{- (i \to j)}, μ, σ] \\ = E [v^{l} | Z_{- (i \to j)}, μ, σ] \prod_{k = 1}^{l - 1} E [1 - v^{k} | Z_{- (i \to j)}, μ, σ] \\ \overset{(3)}{=} \frac{μ σ + #^{Z_{- (i \to j)} = l}}{σ + #^{Z_{- (i \to j)} = l} + #^{Z_{- (i \to j)} > l}} \prod_{k = 1}^{l - 1} \frac{(1 - μ) σ + #^{Z_{- (i \to j)} > k}}{σ + #^{Z_{- (i \to j)} = k} + #^{Z_{- (i \to j)} > k}} \end{aligned}$ (24)Where (1) is derived by the application of the law of total expectation; (2) is obtained from the probability of drawing level $l$ from the stick breaking process conditioned on the successive draws from the Beta distribution, denoted $v_{1}, v_{2}, \dots, v_{l}$ , as per Equation (8); and (3) is the expected value of drawing from the Beta distribution conditioned on $Z_{- (i \to j)}$ as per the level membership posterior obtained in Equation (16). The likelihood, $P (g_{i j *} | G_{- (i j *)}, P, Z, λ, η)$ , is obtained analogously to entity paths. To aid in this derivation, we define two constants as follows: $\begin{aligned} #_{- (i j r)}^{c_{p q r} = 1} & = | {g_{x y z} \in G_{- (i j r)} : Ψ (x, y, z) = c_{p q r} \land g_{x y z} = 1} | \\ #_{- (i j r)}^{c_{p q r} = 0} & = | {g_{x y z} \in G_{- (i j r)} : Ψ (x, y, z) = c_{p q r} \land g_{x y z} = 0} | \end{aligned}$ (25)Such that $#_{- (i j r)}^{c_{p q r} = 1}$ and $#_{- (i j r)}^{c_{p q r} = 0}$ capture the number of existing and non-existing relations from communities $t_{p}$ to $t_{q}$ not including $g_{i j r}$ . With these in place, we can derive the level indicator likelihood. This process is analogous to the one for entity paths in that we use the Bernoulli distribution for model output and integrate over community relations: $\begin{aligned} P (g_{i j *} | G_{- (i j *)}, & P, Z, λ, η) \\ = \int_{c_{p q r}} P (g_{i j *} | G_{- (i j *)}, C, P, Z, γ, μ, σ, λ, η) P (c_{p q r} | C_{- (p q r)}, G_{- (i j *)}, P, Z, λ, η) d c_{p q r} \\ \overset{(1)}{=} \int_{c_{p q r}} \prod_{g_{i j r} \in g_{i j *}} P (g_{i j r} | G_{- (i j r)}, C, P, Z, γ, μ, σ, λ, η) P (c_{p q r} | C_{- (p q r)}, G_{- (i j r)}, P, Z, λ, η) d c_{p q r} \\ \overset{(2)}{=} \int_{c_{p q r}} \prod_{g_{i j r} \in g_{i j *}} (Bernoulli (c_{p q r}, 1 - c_{p q r})) (Beta (#_{- (i j r)}^{c_{p q r} = 1} + λ, #_{- (i j r)}^{c_{p q r} = 0} + η)) d c_{p q r} \\ = \int_{c_{p q r}} \prod_{g_{i j r} \in g_{i j *}} (c_{p q r}^{g_{i j r}} (1 - c_{p q r})^{1 - g_{i j r}}) (\frac{c_{p q r}^{#_{- (i j r)}^{c_{p q r} = 1} + λ - 1} (1 - c_{p q r})^{#_{- (i j r)}^{c_{p q r} = 0} + η - 1}}{B (#_{- (i j r)}^{c_{p q r} = 1} + λ, #_{- (i j r)}^{c_{p q r} = 0} + η)}) d c_{p q r} \\ = \prod_{g_{i j r} \in g_{i j *}} \frac{1}{B (#_{- (i j r)}^{c_{p q r} = 1} + λ, #_{- (i j r)}^{c_{p q r} = 0} + η)} \int_{c_{p q r}} c_{p q r}^{g_{i j r} + #_{- (i j r)}^{c_{p q r} = 1} + λ - 1} (1 - c_{p q r})^{(1 - g_{i j r}) + #_{- (i j r)}^{c_{p q r} = 0} + η - 1} d c_{p q r} \\ \overset{(3)}{=} \prod_{g_{i j r} \in g_{i j *}} \frac{B (#_{- (i j r)}^{c_{p q r} = 1} + g_{i j r} + λ, #_{- (i j r)}^{c_{p q r} = 0} + (1 - g_{i j r}) + η)}{B (#_{- (i j r)}^{c_{p q r} = 1} + λ, #_{- (i j r)}^{c_{p q r} = 0} + η)} \\ \overset{(4)}{=} \prod_{g_{i j r} \in g_{i j *}} \frac{Γ (#_{- (i j r)}^{c_{p q r} = 1} + g_{i j r} + λ) Γ (#_{- (i j r)}^{c_{p q r} = 0} + (1 - g_{i j r}) + η) Γ (#_{- (i j r)}^{c_{p q r} = 1} + #_{- (i j r)}^{c_{p q r} = 0} + λ + η)}{Γ (#_{- (i j r)}^{c_{p q r} = 1} + #_{- (i j r)}^{c_{p q r} = 0} + 1 + λ + η) Γ (#_{- (i j r)}^{c_{p q r} = 1} + λ) Γ (#_{- (i j r)}^{c_{p q r} = 0} + η)} \\ \overset{(5)}{=} \prod_{g_{i j r} \in g_{i j *}} \frac{g_{i j r} (#_{- (i j r)}^{c_{p q r} = 1} + λ) + (1 - g_{i j r}) (#_{- (i j r)}^{c_{p q r} = 0} + η)}{#_{- (i j r)}^{c_{p q r} = 1} + #_{- (i j r)}^{c_{p q r} = 0} + λ + η} \end{aligned}$ (26)Where (1) applies the chain rule of probability; (2) utilizes the prior for $c_{p q r}$ obtained in Equation (13); (3) and (4) leverage the integral and Gamma forms of the Beta function as per Equations (B.1) and (A.3), respectively; and (5) simplifies the preceding equation for computational reason by eliminating the Gamma function as shown in Appendix C. With the prior and likelihood derived in closed form as per Equations (24) and (26), respectively, it is possible to sample level indicators via Equation (22). The time complexity of sampling a level indicator is $O (| R | L)$ due to the $| R |$ calculations that need to be performed at each of the $L$ levels in the hierarchy.

4.2.5. Sampling Procedure

Having marginalized out community relations and level memberships as well as derived the sampling equations for entity paths and level indicators, it is possible to perform collapsed Gibbs sampling by iteratively sampling from the remaining variables’ full conditional distributions. This process has a time complexity of $O (| E |^{2} | R | L + | E |^{3} | R | L)$ for each iteration of the sampler where the former and latter terms are derived from the sampling complexities of the level indicators and entity paths, respectively. This makes the inference scheme infeasible for large-scale datasets. We respond to this issue by modifying one of the characteristics of collapsed Gibbs sampling, namely that samples are obtained in equal proportions. In its original formulation, one iteration of the sampler samples $| E |^{2}$ level indicators and $| E |$ entity paths. One of the assumptions underlying this process is that the relative importance of all samples is the same. Such an assumption may be ill-adapted for knowledge graphs which are oftentimes sparse in their adjacency tensors and whose entities exhibit highly imbalanced relation densities. In this regard, the placement of highly connected entities will have a disproportionate effect on the model likelihood and therefore the induced hierarchy as well. Preferentially sampling these entities may result in faster arrival at a distribution from which we can obtain output samples. Consider, for instance, a knowledge graph with the entities Thing and Henry Ford. Assuming that Thing has a higher relation density than Henry Ford, its proper placement in the hierarchy may be more critical for model output than Henry Ford. With this in mind, we propose a stochastic sampling scheme in which samples are drawn for an entity in proportion to their probability of interacting with other entities. Specifically, we introduce a sampling probability, denoted $s_{i}$ for entity $e_{i}$ , which specifies the chance of sampling a variable for the corresponding entity in an iteration of the collapsed Gibbs sampler. This probability is calculated for each entity as the fraction of entities in the knowledge graph which have fewer relations than itself. Such a formulation ensures that $1 \geq s_{i} > 0$ which allows $s_{i}$ to serve as the parameter of a Bernoulli distribution to indicate whether a variable will get sampled in the current iteration of the Gibbs sampler.

After the Gibbs sampler has been burned in, it is necessary to obtain final samples to induce a hierarchical clustering. We take multiple samples to account for the spread in the posterior distribution. A consequence of this is that samples may differ and need to be aggregated to produce a final result. In this regard, we take the mode over the final samples to arrive at a final hierarchy. The Gibbs sampling procedure is summarized in Algorithm 1.

5. Evaluation

The evaluation of our model is split into two parts: quantitative and qualitative. The quantitative evaluation provides objective measures of model performance whereas the qualitative evaluation assesses our model through illustrations and subjective analysis of the results. For both types of evaluations, our model first had to be inferred before final samples could be drawn. In this regard, we trained our model on three datasets using 200 burn-in samples using hyperparameters chosen by assessing the model’s log likelihood. After burn-in, ten final samples were obtained by discarding all but the third of successive samples to account for autocorrelation between samples. All models we trained to a depth of $L = 4$ . Furthermore, the model was trained five times for each dataset to account for stochasticity in the inference process. The implementation of our method as well as the datasets necessary to recreate our evaluation has been made publicly available on GitHub.²

5.1. Datasets

Our model was evaluated on three datasets: Synthetic Binary Tree, FB15k-237, and Wikidata. What follows is a brief description of each dataset as well as how it was generated.

5.1.1. Synthetic Binary Tree

The Synthetic Binary Tree (SBT) dataset was synthetically generated to capture our model’s ability to separate communities at the lowest level in the hierarchy. The generative process first constructed a binary tree with a depth of four, assigned entities to communities, and sampled relations for each entity pair. All entities were assigned uniformly to communities on the lowest level of the hierarchy, resulting in 25 entities per leaf community. The sampling probability for each entity pair was determined by the level of their lowest common ancestor. Specifically, sampling probabilities of 0, 0.1, 0.4, and 0.6 were used for levels 0, 1, 2, and 3, respectively. Two entities belonging to the same community have a sampling probability of 1 and are thus always related. The dataset was generated for two predicates which shared the aforementioned sampling probabilities. We note that even though these probabilities are identical, they do not result in a dataset in which entity relations are identical across predicates. The generative process yielded a dataset of 55880 triples, 400 entities, and 2 predicates.

5.1.2. FB15k-237

The FB15k-237 dataset (Toutanova & Chen, 2015) is a subset of the FB15k dataset (Bordes et al., 2013), created by removing redundant and inverse triples. The original FB15k dataset is in turn a subset of a 2013 version of Freebase, from which triples were queried. The FB15k-237 dataset is comprised of 272115 triples, 14541 entities, and 237 predicates thus presenting a computation challenge to our model if modeled in whole. To address this issue, we generated a subset of the data and derived ground truth community labels in an approach inspired by Jain et al. (2021). Specifically, entities were mapped to the WordNet taxonomy (Miller, 1995) through the $s a m e A s$ predicate, which relates entities from Freebase and YAGO. Triples were then extracted to contain subjects from the sets provided in Zhang et al. (2022). This process yielded a subset of the data containing 103550 triples, 10018 entities, and 190 predicates. Finally, the subset was reduced even further by extracting only the triples relating to footballers, pianists, journalists, politicians, and scientists as per the identifiers /m/05vyk, /m/06q2q, /m/0gl2ny2, /m/0fj9f, /m/0d8qb on the predicate /people/person/profession. This final step resulted in a dataset with 2499 triples, 1142 entities, and 79 predicates.

5.1.3. Wikidata

The Wikidata dataset was generated by querying Wikidata for triples relating to people and locations. Specifically, artists and footballers corresponding to Wikidata identifiers wd:Q1028181 and wd:Q937857 respectively were extracted. These entities were then filtered to having been born in cities in four countries: Germany, the United Kingdom, Canada, and the United States of America. Furthermore, the knowledge graph was reduced to the following predicates: instance of, place of birth, citizen of, occupation, country, and located in which are represented by the identifiers wdt:P31, wdt:P19, wdt:P27, wdt:P106, wdt:P17, and wdt:P131, respectively. Finally, the tripleset was further reduced to yield 2525 triples, 716 entities, and 6 predicates.

5.2. Quantitative Evaluation

In our quantitative evaluation, we first analyzed the quality of our learned hierarchical clustering by calculating two clustering quality metrics at each level of the hierarchy: the adjusted Rand index (ARI) (Hubert & Arabie, 1985) and normalized mutual information (NMI) (Shannon, 1948). This type of evaluation jointly assesses the quality of the learned community hierarchy as well as the membership of entities to communities. The ARI is an adjustment to the commonly used Rand index (RI) (Rand, 1971), corrected to account for chance. Specifically, chance is factored in by calculating the expected RI given a random clustering and measuring the obtained clustering’s deviation. Specifically, given an obtained entity clustering $C = {C_{1}, C_{2}, \dots, C_{o}}$ and the ground truth clustering $C^{*} = {C_{1}^{*}, C_{2}^{*}, \dots, C_{t}^{*}}$ , the ARI is calculated as follows: $A R I = \frac{\sum_{C_{i} \in C} \sum_{C_{j}^{*} \in C^{*}} (\binom{#_{i j}}{2}) - {(\binom{| E |}{2})}^{- 1} (\sum_{C_{i} \in C} (\binom{#_{i}}{2}) \sum_{C_{j}^{*} \in C^{*}} (\binom{#_{j}^{*}}{2}))}{2^{- 1} (\sum_{C_{i} \in C} (\binom{#_{i}}{2}) + \sum_{C_{j}^{*} \in C^{*}} (\binom{#_{j}^{*}}{2})) - {(\binom{| E |}{2})}^{- 1} (\sum_{C_{i} \in C} (\binom{#_{i}}{2}) \sum_{C_{j}^{*} \in C^{*}} (\binom{#_{j}^{*}}{2}))}$ (27)where $#_{i j} = | C_{i} \cap C_{j}^{*} |$ is the number of entities in common between a ground truth and obtained cluster pair; $#_{i} = \sum_{C_{j}^{*} \in C^{*}} | C_{i} \cap C_{j}^{*} |$ is the total number of entities in obtained cluster $C_{i}$ ; and $#_{j}^{*} = \sum_{C_{i} \in C} | C_{i} \cap C_{j}^{*} |$ is the total number of entities in ground truth cluster $C_{j}^{*}$ . The NMI is a normalized extension of the mutual information (MI) score which quantifies the information gained about the obtained clustering given the ground truth clusters. The normalization of the MI score ensures the result is in the range $[0, 1]$ thereby allowing for its comparison against clusterings of different sizes. Utilizing the notation defined earlier, we define MI and NMI as follows: $N M I = \frac{\sum_{C_{i} \in C} \sum_{C_{j}^{*} \in C^{*}} \frac{| C_{i} \cap C_{j}^{*} |}{| E |} \log (\frac{| E | | C_{i} \cap C_{j}^{*} |}{| C_{i} | | C_{j}^{*} |})}{mean (- \sum_{C_{i} \in C} \frac{| C_{i} |}{| E |} \log (\frac{| C_{i} |}{| E |}), - \sum_{C_{j}^{*} \in C^{*}} \frac{| C_{j}^{*} |}{| E |} \log (\frac{| C_{j}^{*} |}{| E |}))}$ (28)For both the ARI and NMI, higher scores indicate a clustering of higher quality. We summarize the results of our clustering as per these two metrics in Table 1.

Table 1.

ARI and MNI Scores (Mean $\pm$ Standard Deviation) of Our Model on the SBT, FB15k-237, and Wikidata Datasets.

	SBT		FB15k-237		Wikidata
Method	ARI	NMI	ARI	NMI	ARI	NMI
Level 1	0.3055	0.4855	0.5326	0.6646	0.8411	0.7991
	$\pm 0.0685$	$\pm 0.1013$	$\pm 0.1308$	$\pm 0.0702$	$\pm 0.2980$	$\pm 0.1581$
Level 2	0.5895	0.7826	0.3492	0.5083	$0.8057$	$0.7232$
	$\pm 0.2826$	$\pm 0.1434$	$\pm 0.2044$	$\pm 0.1175$	$\pm 0.2839$	$\pm 0.1410$
Level 3	0.7279	0.8882	0.2851	0.4329	$0.4255$	$0.5880$
	$\pm 0.1656$	$\pm 0.0621$	$\pm 0.1993$	$\pm 0.1030$	$\pm 0.2749$	$\pm 0.1367$
Level 4	0.8337	0.9319	0.1964	0.5334	$0.3812$	$0.4980$
	$\pm 0.1032$	$\pm 0.0357$	$\pm 0.0438$	$\pm 0.0288$	$\pm 0.2500$	$\pm 0.1309$
Overall	0.6141	0.7721	0.3408	0.5348	$0.6134$	$0.6521$
	$\pm 0.2577$	$\pm 0.1988$	$\pm 0.1867$	$\pm 0.1145$	$\pm 0.3341$	$\pm 0.1770$

In general, the results indicate that our model is capable of learning a coherent community hierarchy on each of the three datasets tested. Perhaps unsurprisingly, communities at higher levels in the hierarchy are judged as of higher quality as per the two evaluation metrics. This is because the task of clustering entities at higher levels is simpler as the communities are less fine-grained. For instance, on the FB15k-237 dataset, clustering at level 1 requires the distinction between Place and Person whereas level 4 requires the distinction between AmericanFootballPlayer and IceHockeyPlayer. We note that the SBT dataset is an exception to this. This is likely due to the nature of the dataset wherein entity relations are drawn at higher proportions between neighboring communities at lower levels of the hierarchy. In this sense, the claim made before gets inverted and it is easier to assign communities at lower levels in the hierarchy. We also compared our model against embedding and clustering methods used in conjunction. Specifically, we first embedded each of the knowledge graphs using the RDF2VEC and TransE embedding methods. Afterwards, we applied four clustering methods: $k$ -means, Agglomerative, DBSCAN, and Spectral. These results are summarized in Table 2 and indicate comparable or superior performance to baselines.

Table 2.

ARI and MNI Scores (Mean $\pm$ Standard Deviation) of Our Model on the SBT, FB15k-237, and Wikidata Datasets as Compared With Baseline Approaches.

	SBT		FB15k-237		Wikidata
Method	ARI	NMI	ARI	NMI	ARI	NMI
RDF2VEC
$k$ -means	0.8060	0.8928	0.0109	0.1402	0.2672	0.2918
	$\pm 0.1845$	$\pm 0.0707$	$\pm 0.0929$	$\pm 0.1052$	$\pm 0.1582$	$\pm 0.1040$
Agglomerative	0.8750	0.9317	0.0461	0.1532	0.4674	0.5287
	$\pm 0.1254$	$\pm 0.0575$	$\pm 0.0860$	$\pm 0.1435$	$\pm 0.3281$	$\pm 0.2052$
DBSCAN	0.5549	0.6904	0.1468	0.2293	0.3831	0.3698
	$\pm 0.4576$	$\pm 0.3032$	$\pm 0.1291$	$\pm 0.0561$	$\pm 0.2343$	$\pm 0.0935$
Spectral	0.6175	0.7590	-0.0014	0.0347	0.0918	0.1021
	$\pm 0.3540$	$\pm 0.2924$	$\pm 0.0082$	$\pm 0.03129$	$\pm 0.0636$	$0.0297$
TransE
$k$ -means	0.9851	0.9958	0.3559	0.4334	0.7427	0.6504
	$\pm 0.0334$	$\pm 0.0066$	$\pm 0.0776$	$\pm 0.1096$	$\pm 0.1953$	$\pm 0.2468$
Agglomerative	1.0000	1.0000	0.1362	0.3107	0.3799	0.3650
	$\pm 0.0000$	$\pm 0.0000$	$\pm 0.1379$	$\pm 0.1104$	$\pm 0.4037$	$\pm 0.3780$
DBSCAN	0.8899	0.9665	0.2768	0.2582	0.2418	0.3128
	$\pm 0.1213$	$\pm 0.03829$	$\pm 0.1616$	$\pm 0.0728$	$\pm 0.1355$	$\pm 0.1143$
Spectral	1.0000	1.0000	0.1400	0.2509	0.2778	0.3296
	$\pm 0.0000$	$\pm 0.0000$	$\pm 0.1920$	$\pm 0.2418$	$\pm 0.1541$	$\pm 0.0153$
Our method	0.6141	0.7721	0.3408	0.5348	0.6134	0.6521
	$\pm 0.2577$	$\pm 0.1988$	$\pm 0.1867$	$\pm 0.1145$	$\pm 0.3341$	$\pm 0.1770$

Figure 8.

Plots of average log likelihood of our model across burn in samples on three datasets.

Figure 9.

Excerpt of our induced hierarchy on the FB15k-237 dataset. Entities in communities O and P are footballers. Ennio Morricone is a pianist. Ernest Hemmingway is a journalist. Abraham Lincoln, Winston Churchill, John Quincy Adams, and Julius Caesar are politicians. Aristotle and Leonardo da Vinci are scientists. The entities in communities T and U are countries.

Figure 10.

Excerpt of our induced hierarchy on the Wikidata dataset. Entities in communities Q and R are painters, in communities S and T are footballers, in communities V and W are cities, and in communities X, Y, and Z are countries as defined in Wikidata.

Figure 11.

Plots of learned community relations for selected outgoing predicates for the FB15k-237 and Wikidata datasets. Specifically we showcased community O (Footballers) and community T (Nations) outgoing relations for the FB15k-237 dataset (top) and community Q (Painters) and community V (Cities) outgoing relations for the Wikidata dataset.

We can also analyze the results of the complete log likelihood as a function of the number of Gibbs samples taken in the inference process. Indeed, while this does not provide us with information about the quality of the obtained results, it does verify the inference process itself. Specifically, we expect to see the log likelihood of our model to rise given more burn-in samples of the Gibbs sampler. This suggests that the likelihood of generating the knowledge graph given the current state of the sampler is increasing and learning is taking place. We can see this rise in Figure 8 which plots the complete log likelihoods of our model across Gibbs samples for the three datasets. We note a dip in log likelihoods on the SBT and Wikidata datasets. This is likely due to the sampler being temporarily stuck in a local minimum before leaving that area in the sample space. The underperformance on the SBT dataset compared to the baseline approaches is due to our method’s underperformance on simple datasets, largely attributed to its stochastic nature. Recall that after burn-in, final samples are drawn and unless the log-likelihood is sufficiently high, these samples will contain suboptimal allocations. These samples are necessary for Gibbs sampling but result in poorer quantitative performance. We hypothesize that drawing more burn-in samples would stabilize the final samples and produce better results. This is supported by the log-likelihoods in Figure 8 but, to establish consistency between the three datasets, was not explored further.

5.3. Qualitative Evaluation

In our qualitative evaluation, we leverage the qualitative attributes possessed by a useful taxonomy as identified by Nickerson et al. (2013). Although these attributes were proposed in the context of taxonomy development, we make use of them in our work as the task of hierarchical clustering shares many of the underlying evaluation principles. Indeed, a taxonomy is implicitly induced using our method, although never explicitly labeled. The proposed taxonomy attributes are as follows: concise, robust, comprehensive, extendable, and explanatory. For each of these attributes, we provide a brief description extended to hierarchical clustering and use it to evaluate our model.

–
A concise clustering is limited in both the number of obtained clusters and the semantic diversity of the entities that constitute each cluster. In our method, this is largely regulated by the hyperparameters $γ$ , allowing for control over the number of clusters obtained, and the interplay between $λ$ and $σ$ , regulating the intracluster density. In practice, however, the hyperparameter tuning required is fickle. Figure 9 demonstrates this through an excerpt of the results obtained on the FB15k-237 dataset. Singleton communities R and S are both composed of politician entities and, in keeping with the conciseness attribute, warrant a merger. Community Q on the other hand is composed of a heterogeneous set of entities and requires splitting. It is concise, but, as we shall elaborate upon later, it is not robust. Thus the direction in which to adjust the hyperparameters is not clear. The induced hierarchy also suffers in terms of conciseness at higher levels in the hierarchy. Communities A and B are both comprised of people and require a merger at their level. This issue can largely be explained by the data itself. Footballers in our dataset have structural properties which differ them from the persons clustered in the community B subtree. In addition to sharing the same profession, they belong to football teams, have football specific triples such as the position they play, and are more likely than non-footballer persons in the dataset to have information relating to physical characteristics such as height and weight. In contrast, non-footballer persons have less structural similarities, even within members of their own professions. For instance, even though all scientists have their scientific contributions, this is not reflected in the data as uniformly. The Wikidata dataset—an excerpt of which is displayed in Figure 10—is simpler than FB15k-237 due to there being only two professions with largely disjoint neighborhoods. The induced hierarchy is not concise, however, in the splitting of persons and places. Specifically, this happens at the first level where persons are split into communities A and B and places into communities C and D. We note, however, the high ARI and NMI scores at this level despite this error. A closer inspection reveals that not all runs of our model encounter this issue, thus the scores are higher than they would be had they been measured only on the hierarchy presented in the excerpt.
–
Robustness in clustering refers to “maximizing both within-group homogeneity and between-group heterogeneity, [making] groups that are as distinct (nonoverlapping) as possible, with all members within a group being as alike as possible (Bailey, 1994).” We see robustness as an issue when examining the non-footballer professions in Figure 9. Pianists, journalists, politicians, and scientists are not sufficiently semantically homogenous to warrant their inclusion in community Q. Another issue is the splitting of nations into communities O and P as there is no evident geographic, social, political, or economic distinction made in the clustering. A deeper look into the data also reveals no apparent structural differences between communities M and N. It is likely, therefore, that this issue can be due to model inaccuracy and arrival at a suboptimal state after inference. The Wikidata dataset also suffers from unsubstantiated splits at the lowest levels as seen in communities R, T, and Y. This issue may be the result of the small size of the data and the model’s resultant sensitivity to the relational information provided. For instance, close to half of the entities in this dataset have first order neighborhoods of one or two entities, giving our model little to learn from.
–
A taxonomy is comprehensive if it can classify all entities in the domain. Clustering, including the hierarchical clustering obtained by our method, are induced empirically from the data and thus necessarily classify all entities into communities. As such, our generative model is comprehensive.
–
An extendable clustering is one that allows for the dynamic inclusion of additional information and, in the hierarchical case, the structural change and adaptability to incoming information. In these two respects, our method is highly extendable. Indeed, the Gibbs sampling process itself requires the removal of entities from the hierarchy before they are resampled. Each resampling not only has the potential to change the community assignment of the entity but also to change the structure of the hierarchy itself. Due to the infinite formulation of the nCRP and stick breaking process, there is no prior constraint on the structure.
–
The final qualitative attribute identified in a useful taxonomy is that it is explanatory. In this regard, the taxonomy should provide useful explanations about the objects it is organizing. In the context of hierarchical clustering, these explanations could, for instance, take the form of community labels. Although our model does not assign labels to communities, they can be ascertained by examining the type information of their constituent entities. For instance in Figure 9 the FB15k-237 communities O, X, and P correspond to footballers, nations, and persons and the Wikidata communities Q, S, V, and X correspond to painters, footballers, cities, and countries. An explanatory advantage of our method is that it infers community relations as part of the generative process. A fragment of these relations is conveyed in Figure 11. The results indicate community relations which are largely expected. For instance, on the FB15k-237 dataset, footballers are much more likely to be related to nations by predicates nationality and lived in than athlete. Furthermore, we see that nations are equally unlikely to be the subjects with the other communities or predicates such as lived in, nationality, and athlete. On the Wikidata dataset, we likewise see explainable results. Painters, for instance, are likely to be related to cities by place of birth and countries by nationality. They are unlikely to relate to other painters and footballers on these predicates.

6. Conclusion

In this article, we demonstrated the use of stochastic blockmodels for learning hierarchies from knowledge graphs in an approach that is, to the best of our knowledge, the first to marry these two fields in an academic setting. To this end, we proposed a model which leverages the nCRP and stick breaking process to generate a community hierarchy composed of a knowledge graph’s constituent entities. The model is defined non-parametrically and thus makes no assumptions about its structure, allowing it to adapt to the knowledge graph and potentially induce an infinite number of communities on an infinite number of levels. In addition to the model itself, a Markov chain Monte Carlo scheme leveraging collapsed Gibbs sampling was devised for posterior inference of the model’s parameters. The model was evaluated on three datasets using quantitative and qualitative analysis to demonstrate its effectiveness in learning coherent community hierarchies on both synthetic and real-world data. The qualitative analysis made use of attributes commonly employed in taxonomy evaluation, presenting a novel and principled approach for qualitative analysis of hierarchical clusterings. Future work will investigate scalability when applying our model—and indeed stochastic blockmodels more generally—to knowledge graphs. As discussed earlier, the time complexities of the inference schemes for these models usually do not allow for scaling to the types of large knowledge bases that are encountered in real-world applications. The inference scheme proposed in this work is one such instance however several methods exist in the literature more scalable than collapsed Gibbs sampling. An example of such a method is variational inference, a scheme that uses the evidence lower bound to guide the inference process and obtain the posterior distribution. This method is generally faster than Gibbs sampling and, although not asymptotically exact, produces similar results (Salimans et al., 2015). The challenge with this approach is that its optimization equations are more difficult to solve as compared to Markov chain Monte Carlo methods. Despite this, several works have already successfully applied variational inference to probabilistic graphical models. Indeed, the original inference scheme for the Mixed Membership Stochastic Blockmodel leveraged variational inference. Furthermore, Blei and Jordan (2006) provided a variational inference scheme for Dirichlet processes and a variational inference scheme for the nCRP was proposed in Wang and Blei (2009). Departing from the variational approach, Chen et al. (2017) propose an evolution of the Gibbs sampling algorithm for the nCRP with partially collapsed Gibbs sampling. This approach resulted in a time increase of over two magnitudes over the classic Gibbs sampling approach. Another line of approach to increase the scalability of stochastic blockmodels is to devise a model which does not require sampling all $| E |^{2} | R |$ relations in the knowledge graph directly. To this end, the Bernoulli-Poisson link function has been applied successfully to simple graphs (Fan et al., 2019; Rai et al., 2015; Zhou, 2015). These methods eliminate the need for quadratic time relation sampling and instead rely on density based sampling which is less computationally demanding, especially on sparse networks. Given that most knowledge graphs are highly sparse, applying such an approach appears promising. The work presented in this article provides evidence for further research in this direction in the Semantic Web community.

Footnotes

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship and/or publication of this article.

References

Abbe

(2017). Community detection and stochastic block models: Recent developments. The Journal of Machine Learning Research, 18(1), 6446–6531. https://doi.org/10.5555/3122009.3242034

Airoldi

E. M.

Blei

D. M.

Fienberg

S. E.

Xing

E. P.

(2008). Mixed membership stochastic blockmodels. The Journal of Machine Learning Research, 9, 1981–2014. https://doi.org/10.5555/1390681.1442798

Aldous

D. J.

(1985). Exchangeability and related topics. In École d’Été de Probabilités de Saint-Flour XIII—1983 (pp. 1–198). Berlin, Germany: Springer. https://doi.org/doi.org/10.1007/BFb0099421

Ankerst

Breunig

M. M.

Kriegel

H. P.

Sander

(1999). Optics: Ordering points to identify the clustering structure. ACM Sigmod Record, 28(2), 49–60. https://doi.org/10.1145/304181.304187

Bailey

K. D.

(1994). Typologies and Taxonomies: An Introduction to Classification Techniques. Vol. 102. Thousand Oaks, CA, USA: SAGE Publications, Inc. https://doi.org/doi.org/10.4135/9781412986397

Balažević

Allen

Hospedales

T. M.

(2019). Tucker: Tensor factorization for knowledge graph completion. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 5185–5194). https://doi.org/10.18653/v1/D19-1522

Barabási

A. L.

Albert

(1999). Emergence of scaling in random networks. Science (New York, N.Y.), 286(5439), 509–512. https://doi.org/10.1126/science.286.5439.509

Barigozzi

Fagiolo

Mangioni

(2011). Identifying the community structure of the international-trade multi-network. Physica A: Statistical Mechanics and its Applications, 390(11), 2051–2066. https://doi.org/10.1016/j.physa.2011.02.004

Berlingerio

Coscia

Giannotti

(2011). Finding redundant and complementary communities in multidimensional networks. In Proceedings of the 20th ACM international conference on Information and knowledge management (pp. 2181–2184). New York, NY, USA: Association for Computing Machinery. https://doi.org/doi.org/10.1145/2063576.2063921

10.

Blei

D. M.

Griffiths

T. L.

Jordan

M. I.

(2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM (JACM), 57(2), 1–30. https://doi.org/10.1145/1667053.1667056

11.

Blei

D. M.

Jordan

M. I.

(2006). Variational inference for dirichlet process mixtures. Bayesian Analysis, 1(1), 121–143. https://doi.org/10.1214/06-BA104

12.

Bonifati

Dumbrava

Mir

(2022). Hierarchical clustering for property graph schema discovery. In EDBT 2022: 25th International conference on extending database technology (pp. 449–453). Edinburgh, United Kingdom: OpenProceedings.org.

13.

Bordes

Usunier

Garcia-Durán

Weston

Yakhnenko

(2013). Translating embeddings for modeling multi-relational data. In Proceedings of the 27th International conference on neural information processing systems - Volume 2 (pp. 2787–2795). Red Hook, NY, USA: Curran Associates Inc. https://doi.org/10.5555/2999792.2999923

14.

Chen

Zhu

Liu

(2017). Scalable inference for nested chinese restaurant process topic models. arXiv preprint arXiv:1702.07083 https://doi.org/10.48550/arXiv.1702.07083

15.

Chen

J. X.

Reformat

M. Z.

(2014). Learning categories from linked open data. In A. Laurent, O. Strauss, B. Bouchon-Meunier, & R. R. Yager (Eds.), International conference on information processing and management of uncertainty in knowledge-based systems (pp. 396–405). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-08852-5_41

16.

De Bacco

Power

E. A.

Larremore

D. B.

Moore

(2017). Community detection, link prediction, and layer interdependence in multilayer networks. Physical Review E, 95, 042317. https://doi.org/10.1103/PhysRevE.95.042317

17.

Dettmers

Minervini

Stenetorp

Riedel

(2018). Convolutional 2d knowledge graph embeddings. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32). AAAI Press. https://doi.org/10.5555/3504035.3504256

18.

Ding

Cao

Liu

Wang

(2021a). A method for discovering hidden patterns of cybersecurity knowledge based on hierarchical clustering. In 2021 IEEE Sixth international conference on data science in cyberspace (DSC) (pp. 334–338). https://doi.org/10.1109/DSC53577.2021.00053

19.

Ding

Cao

Liu

Wang

(2021b). A method for discovering hidden patterns of cybersecurity knowledge based on hierarchical clustering. In 2021 IEEE Sixth international conference on data science in cyberspace (DSC) (pp. 334–338). IEEE

20.

Fan

SIsson

Chen

(2019). Scalable deep generative relational model with high-order node dependence. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 32). Red Hook, NY, USA: Curran Associates Inc. https://doi.org/10.5555/3454287.3455421

21.

Ferguson

T. S.

(1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2), 209–230. https://doi.org/10.1214/aos/1176342360

22.

Griffiths

Jordan

Tenenbaum

Blei

(2003). Hierarchical topic models and the nested chinese restaurant process. Advances in Neural Information Processing Systems, 16. https://doi.org/10.5555/2981345.2981348

23.

Gutierrez

Sequeda

J. F.

(2021). Knowledge graphs. Communications of the ACM, 64(3), 96–104. https://doi.org/10.1145/3418294

24.

Heymann

Garcia-Molina

(2006). Collaborative creation of communal hierarchical taxonomies in social tagging systems. Technical Report 2006-10, Stanford InfoLab, Stanford, CA, USA.

25.

Parikh

Song

Xing

(2011). Multiscale community blockmodel for network exploration. In G. Gordon, D. Dunson, & M. Dudík (Eds.), Proceedings of the Fourteenth international conference on artificial intelligence and statistics, Proceedings of Machine Learning Research (Vol. 15, pp. 333–341). Fort Lauderdale, FL, USA: PMLR. https://doi.org/10.1080/01621459.2012.682530

26.

Hogan

Blomqvist

Cochez

d’Amato

Melo

G. D.

Gutierrez

Kirrane

Gayo

J. E. L.

Navigli

Neumaier

Ngomo

A. C. N.

Polleres

Rashid

S. M.

Rula

Schmelzeisen

Sequeda

Staab

Zimmermann

(2021). Knowledge graphs. ACM Computing Surveys (CSUR), 54(4), 1–37. https://doi.org/10.1145/3447772

27.

Holland

P. W.

Laskey

K. B.

Leinhardt

(1983). Stochastic blockmodels: First steps. Social Networks, 5(2), 109–137. https://doi.org/10.1016/0378-8733(83)90021-7

28.

Hubert

Arabie

(1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075

29.

Jacob

(2021). You need to be thinking in knowledge graphs. https://www.forbes.com/sites/forbestechcouncil/2021/09/20/you-need-to-be-thinking-in-knowledge-graphs/.

30.

Jain

Kalo

J. C.

Balke

W. T.

Krestel

(2021). Do embeddings actually capture knowledge graph semantics? In R. Verborgh, K. Hose, H. Paulheim, P. A. Champin, M. Maleshkova, O. Corcho, P. Ristoski, & M. Alam (Eds.), The Semantic Web (pp. 143–159). Cham: Springer International Publishing. ISBN 978-3-030-77385-4. https://doi.org/10.1007/978-3-030-77385-4_9

31.

Liu

Zhao

(2015). Knowledge graph embedding via dynamic mapping matrix. In C. Zong & M. Strube (Eds.), Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers) (pp. 687–696). Beijing, CN: Association for Computational Linguistics. https://doi.org/10.3115/v1/P15-1067

32.

Kazemi

S. M.

Poole

(2018). Simple embedding for link prediction in knowledge graphs. In Proceedings of the 32nd International conference on neural information processing systems, NIPS’18 (Vol. 31, p. 4289–4300). Red Hook, NY, USA: Curran Associates Inc. https://doi.org/10.5555/3327144.3327341

33.

Kemp

Tenenbaum

J. B.

Griffiths

T. L.

Yamada

Ueda

(2006). Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National conference on artificial intelligence - Volume 1 (pp. 381–388). AAAI’06. AAAI Press. ISBN 9781577352815. https://doi.org/10.5555/1597538.1597600

34.

Lee

Wilkinson

D. J.

(2019). A review of stochastic block models and extensions for graph clustering. Applied Network Science, 4(1), 1–50. https://doi.org/10.1007/s41109-019-0232-2

35.

Lehmann

Isele

Jakob

Jentzsch

Kontokostas

Mendes

P. N.

Hellmann

Morsey

Van Kleef

Auer

Bizer

(2015). Dbpedia—a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195. https://doi.org/10.3233/SW-140134

36.

Lin

Liu

Sun

Liu

Zhu

(2015). Learning entity and relation embeddings for knowledge graph completion. In Twenty-ninth AAAI conference on artificial intelligence (Vol. 1). https://doi.org/10.1609/aaai.v29i1.9491

37.

MacKay

D. J.

(2002). Information Theory, Inference and Learning Algorithms. Cambridge, UK: Cambridge University Press. ISBN 0521642981. https://doi.org/10.1017/S026357470426043X

38.

Martel

Zouaq

(2021). Taxonomy extraction using knowledge graph embeddings and hierarchical clustering. In Proceedings of the 36th Annual ACM Symposium on Applied Computing (pp. 836–844). SAC ’21. New York, NY, USA: Association for Computing Machinery. ISBN 9781450381048. https://doi.org/10.1145/3412841.3441959

39.

Mikolov

Chen

Corrado

Dean

(2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 https://doi.org/10.48550/arXiv.1301.3781

40.

Miller

G. A.

(1995). Wordnet: A lexical database for english. Communications of the ACM, 38(11), 39–41. https://doi.org/10.1145/219717.219748

41.

Mohamed

S. K.

(2019). Unsupervised hierarchical grouping of knowledge graph entities. arXiv preprint arXiv:1908.07281 https://doi.org/10.48550/arXiv.1908.07281

42.

Nguyen

D. Q.

Nguyen

T. D.

Nguyen

D. Q.

Phung

(2018). A novel embedding model for knowledge base completion based on convolutional neural network. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Human language technologies, Volume 2 (Short Papers) (pp. 327–333). New Orleans, LA, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2053

43.

Nickel

Tresp

Kriegel

H. P.

(2011). A three-way model for collective learning on multi-relational data. In ICML’11 (pp. 809–816). Madison, WI, USA: Omnipress. ISBN 9781450306195. https://doi.org/10.5555/3104482.3104584

44.

Nickel

Tresp

Kriegel

H. P.

(2012). Factorizing yago: scalable machine learning for linked data. In Proceedings of the 21st International conference on world wide web (pp. 271–280). WWW ’12. New York, NY, USA: Association for Computing Machinery. ISBN 9781450312295. https://doi.org/10.1145/2187836.2187874

45.

Nickerson

R. C.

Varshney

Muntermann

(2013). A method for taxonomy development and its application in information systems. European Journal of Information Systems, 22(3), 336–359. https://doi.org/10.1057/ejis.2012.26

46.

Nowicki

Snijders

T. A. B.

(2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455), 1077–1087. https://doi.org/10.1198/016214501753208735

47.

Paul

Chen

(2016). Consistent community detection in multi-relational data through restricted multi-layer stochastic blockmodel. Electronic Journal of Statistics, 10(2), 3807–3870. https://doi.org/10.1214/16-EJS1211

48.

Pellissier Tanon

Weikum

Suchanek

(2020). Yago 4: A reason-able knowledge base. In The Semantic Web: 17th International conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings (pp. 583–596). Berlin, Heidelberg: Springer-Verlag. ISBN 978-3-030-49460-5. https://doi.org/10.1007/978-3-030-49461-2_34

49.

Picard

Pitman

(2006). Combinatorial stochastic processes: Ecole D’Eté De probabilités De saint-flour XXXII - 2002. Berlin, Germany: Springer. ISBN 9783540309901. https://doi.org/10.1007/b11601500

50.

Pietrasik

Reformat

(2020). A simple method for inducing class taxonomies in knowledge graphs. In The Semantic Web: 17th International conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings (pp. 53–68). Berlin, Heidelberg: Springer-Verlag. ISBN 978-3-030-49460-5. https://doi.org/10.1007/978-3-030-49461-2_4

51.

Pietrasik

Reformat

(2021a). Neural blockmodeling for multilayer networks. In 2021 International joint conference on neural networks (IJCNN) (pp. 1–8). https://doi.org/10.1109/IJCNN52387.2021.9533909

52.

Pietrasik

Reformat

(2021b). Path based hierarchical clustering on knowledge graphs. arXiv preprint arXiv:2109.13178 https://doi.org/10.48550/arXiv.2109.13178

53.

Pietrasik

Reformat

Wilbik

(2024). Non-parametric path based model for taxonomy induction in knowledge graphs. In Belgium netherlands conference on artificial intelligence.

54.

Rai

Henao

Carin

(2015). Large-scale Bayesian multi-label learning via topic-based label embeddings. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 28). Curran Associates Inc. https://doi.org/10.5555/2969442.2969599

55.

Rand

W. M.

(1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850. https://doi.org/10.2307/2284239

56.

Ristoski

Faralli

Ponzetto

S. P.

Paulheim

(2017). Large-scale taxonomy induction using entity and word embeddings. In Proceedings of the international conference on web intelligence (pp. 81–87). WI ’17. New York, NY, USA: Association for Computing Machinery. ISBN 9781450349512. https://doi.org/10.1145/3106426.3106465

57.

Ristoski

Paulheim

(2016). Rdf2vec: Rdf graph embeddings for data mining. In: Groth P, Simperl E, Gray A, Sabou M, Krötzsch M, Lecue F, Flöck F and Gil Y (eds.) International semantic web conference (pp. 498–514). Springer, Cham: Springer International Publishing. ISBN 978-3-319-46523-4. https://doi.org/10.1007/978-3-319-46523-4_30

58.

Rossi

Barbosa

Firmani

Matinata

Merialdo

(2021). Knowledge graph embedding for link prediction: A comparative analysis. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(2), 1–49. https://doi.org/10.1145/3424672

59.

Saad

F. A.

Mansinghka

V. K.

(2021). Hierarchical infinite relational model. In C. de Campos & M. H. Maathuis (Eds.), Proceedings of the Thirty-Seventh conference on uncertainty in artificial intelligence, Proceedings of Machine Learning Research (Vol. 161, pp. 1067–1077). PMLR. https://doi.org/10.48550/arXiv.2108.07208

60.

Salimans

Kingma

Welling

(2015). Markov chain monte carlo and variational inference: Bridging the gap. In Proceedings of the 32nd International conference on international conference on machine learning - Volume 37 (pp. 1218–1226). ICML’15. New York, NY, USA: JMLR.org. https://doi.org/10.5555/3045118.3045248

61.

Schlichtkrull

Kipf

T. N.

Bloem

Berg

R. V. D.

Titov

Welling

(2018). Modeling relational data with graph convolutional networks. In A. Gangemi, R. Navigli, M. E. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, & M. Alam (Eds.), The Semantic Web (pp. 593–607). Cham: Springer International Publishing. ISBN 978-3-319-93417-4. https://doi.org/10.1007/978-3-319-93417-4_38

62.

Schmitz

(2006). Inducing ontology from flickr tags. In Proc. of the Collaborative Web Tagging Workshop (WWW ’06) (Vol. 50, p. 39)

63.

Sethuraman

(1994). A constructive definition of dirichlet priors. Statistica Sinica, 4(2), 639–650. https://doi.org/10.21236/ADA238689

64.

Shannon

C. E.

(1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

65.

solidIT . (2022). Dbms popularity broken down by database model. https://db-engines.com/en/ranking_categories.

66.

Sweet

T. M.

(2019). Modeling social networks as mediators: A mixed membership stochastic blockmodel for mediation. Journal of Educational and Behavioral Statistics, 44(2), 210–240. https://doi.org/10.3102/1076998618814255

67.

Toutanova

Chen

(2015). Observed versus latent features for knowledge base and text inference. In A. Allauzen, E. Grefenstette, K. M. Hermann, H. Larochelle, & S. W. T. Yih (Eds.), Proceedings of the 3rd Workshop on continuous vector space models and their compositionality (pp. 57–66). Beijing, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/W15-4007

68.

Völker

Niepert

(2011). Statistical schema induction. In G. Antoniou, M. Grobelnik, E. Simperl, B. Parsia, D. Plexousakis, P. De Leenheer, & J. Pan (Eds.), Extended Semantic Web Conference (pp. 124–138). Berlin, Germany: Springer Berlin Heidelberg. ISBN 978-3-642-21034-1. https://doi.org/10.1007/978-3-642-21034-1_9

69.

Vrandeˇić

Krötzsch

(2014). Wikidata: A free collaborative knowledgebase. Communications of the ACM, 57(10), 78–85. https://doi.org/10.1145/2629489

70.

Nguyen

T. D.

Nguyen

D. Q.

Phung

(2019). A capsule network-based embedding model for knowledge graph completion and search personalization. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, Volume 1 (Long and Short Papers) (pp. 2180–2189). Minneapolis, MN, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1226

71.

Wang

Blei

(2009). Variational inference for the nested chinese restaurant process. Advances in Neural Information Processing Systems, 22. https://doi.org/10.5555/2984093.2984316

72.

Wang

Mao

Yin

(2018). A hybrid approach for tag hierarchy construction. In R. Capilla, B. Gallina, & C. Cetina (Eds.), New opportunities for software reuse (pp. 59–75). Cham: Springer International Publishing. ISBN 978-3-319-90421-4. https://doi.org/10.1007/978-3-319-90421-4_4

73.

Wang

Yang

Zhang

Wang

(2023). Ppisb: A novel network-based algorithm of predicting protein-protein interactions with mixed membership stochastic blockmodel. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(2), 1606–1612. https://doi.org/10.1109/TCBB.2022.3196336

74.

Wang

Zhang

Feng

Chen

(2014). Knowledge graph embedding by translating on hyperplanes. In Proceedings of the Twenty-Eighth AAAI conference on artificial intelligence (pp. 1112–1119). AAAI’14. AAAI Press. https://doi.org/10.1609/aaai.v28i1.8870

75.

Xie

Liu

Sun

(2016). Representation learning of knowledge graphs with hierarchical types. In IJCAI’16 (pp. 2965–2971). AAAI Press. ISBN 9781577357704. https://doi.org/10.5555/3060832.3061036

76.

Yang

Yih

S. Wt.

Gao

Deng

(2015). Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the international conference on learning representations (ICLR) 2015. https://doi.org/10.48550/arXiv.1412.6575

77.

Zhang

Pietrasik

Reformat

(2022). Hierarchical topic modelling for knowledge graphs. In P. Groth, M. E. Vidal, F. Suchanek, P. Szekley, P. Kapanipathi, C. Pesquita, H. Skaf-Molli, & M. Tamper (Eds.), The semantic web (pp. 270–286). Cham: Springer International Publishing. ISBN 978-3-031-06981-9. https://doi.org/10.1007/978-3-031-06981-9_16

78.

Zhang

Cai

Zhang

Wang

(2020). Learning hierarchy-aware knowledge graph embeddings for link prediction. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 3065–3072). https://doi.org/10.1007/s00521-024-09775-y

79.

Zhou

(2015). Infinite Edge Partition Models for Overlapping Community Detection and Link Prediction. In G. Lebanon & S. V. N. Vishwanathan (Eds.), Proceedings of the Eighteenth international conference on artificial intelligence and statistics, Proceedings of Machine Learning Research (Vol. 38, pp. 1135–1143). San Diego, CA, USA: PMLR. https://doi.org/10.48550/arXiv.1501.06218