Abstract
Introduction
Biographical dictionaries are scholarly resources used by the public and by the academic community alike. Most national biographical dictionaries follow the traditional form of combining a lengthy non-structured text, often written with authorial individuality and personal insight, with a structured synopsis of basic biographical facts, such as family relations, education, works, career events, and so on. Biographies are an invaluable information source for researchers across various disciplines with an interest in the past [30]. A well-known example of a biographical dictionary is the Oxford Dictionary of National Biography (ODNB)1
In this paper, we use the BiographySampo portal and its data, based on the National Biography of Finland, to study and analyze biographees, their lives, and the source material with two goals in mind. Firstly, our goal is to argue and show that using biographies as Linked Data opens up unprecedented new possibilities for the study by distant reading [41,42]. Secondly, the analyses present novel insights into the nature and contents of the NBF. Here, our focus is on the historiographical analysis of biographies. We anticipate that comparative results can be expected, if the methodology and tools introduced are applied to similar national biographical dictionaries. Our approach can also be applied to other domains of Cultural Heritage data, such as museum collections, library catalogs, manuscripts in archives, archaeological finds, etc., as demonstrated by the Sampo series of semantic portals8
In Finland, the National Biography collection and several other collections of biographical and prosopographical data have been compiled and are maintained by the Finnish Literature Society (SKS)9
The kernel of the collection is the National Biography of Finland (
The NBF contains 6500 lives and goes back a thousand years in history. The National Biography of Finland was one of the largest projects ever carried out in the field of history in Finland: it involved twenty historians serving in the three editorial boards (Swedish era, Russian era, and Independence era) and over 900 other scholars who wrote the biographies. The writing of the articles began in 1993 and the first articles were published online in 1997 when Finland celebrated her 80 years of independence. The majority of the biographies were written before the year 2000. Some 6000 articles were published in print in 2003–2007 (Suomen kansallisbiografia 1–10 [31]) by the Finnish Literature Society.
Early on in the project, half of the 6000 lives to be commissioned were allocated to the period of independence from 1917 onward. The Swedish era from the earliest decades to 1809 and the Russian era from 1809 to 1917 were each given a 25 percent of the entries.
Contrary to most national biographical dictionaries, the NBF includes people who are still alive, although most of them are already past the peak of their career and activity. The reason was the emphasis on the period of independence in the work of the editorial board. Had only deceased Finns been included, the big picture of the independence era created by the lives would have been incomplete and distorted.
In addition to the NBF, the Finnish Literature Society has also published other biographical collections, e.g., the Finnish Clergy 1554–1721 and 1800–1920, the Finnish Generals and Admirals in the Russian armed forces 1809–1917, and the Finnish Business Leaders, totaling today over 13100 biographies. The biographies have been made available also as a web service.10
BiographySampo11 Online at www.biografiasampo.fi; see project homepage Prosopography is a method that is used to study groups of people through their biographical data. The goal of prosopography is to find connections, trends, and patterns from these groups.
BiographySampo is based on the Sampo model [17] that formulates the idea of aggregating and publishing distributed, heterogeneous local data sources in a global linked data service. In this way, the data of all data providers can be enriched with each other’s content, by reasoning based on Semantic Web standards, and the global data can be used easily across original local data silo boundaries. This arguably creates a sustainable “business model” where every data provider wins through collaboration, and of course the end users in particular. Data alignment and linking in this approach is based on a shared global data model and a set of shared domain ontologies (places, people, etc.) that are used for describing the contents of the different data sources for semantic interoperability.
The data is searched, explored, and analyzed in a kind of standardized way with the following way. Firstly, the landing page of the portal provides the user with multiple “perspectives” for searching and exploring the underlying data. In our case, biographical data can be accessed from seven search perspectives [21]: Persons, Places, Lives on maps, Statistics, Networks, Relations, and Linguistics. Secondly, each perspective provides the end-user with a semantic faceted search engine, where the results can be filtered and found flexibly by making selections using a set of orthogonal facets (e.g., place, time, person, etc.). Thirdly, after filtering down a target set of entities of interest, the set can be analyzed and visualized using a variety of ready-to-use data-analytic tools. For example, various map- and network-based visualizations and statistics are available. Furthermore, the SPARQL endpoint of the underlying Linked Open Data service can be used for querying, analyzing, and visualizing the data in flexible ways using tools, such as Yasgui [44] for SPARQL, or Jupyter13
Biographical collections can be used to study the underlying historical world. However, the texts, the language used, and the biographical collection as a whole can also be studied from a different, historiographical perspective as an artifact reflecting its own time, the editorial values and biases in selecting the biographees, the authors’ perspectives, and also from a linguistic points of view. Such analyses have been already made for some national dictionaries of biography, e.g., for the ODNB [55] and the Irish Ainm [2].
Christopher N. Warren claims [55] that national dictionaries of biography, such as the ODNB, speak with a double voice: they give us information about things as they happened, but are at the same time a testimony about how a key piece of historiographical infrastructure was made. He sees the ODNB as data and, at the same time, as a historical artifact. There are also related studies using, e.g., Wikipedia articles as the data source [29,39]. This paper presents, in the same vein, a study of the National Biography of Finland. The methods and tools created in our work for the analysis are generic and can be re-used for similar tasks based on Linked Data standards. The data and SPARQL endpoint used are available at the Linked Data Finland platform15
Aside publishing biographical dictionaries in print and on the Web, representing and analyzing biographical data has grown into a new research and application field. In 2015, the first Biographical Data in Digital World workshop BD2015 was held presenting several works on studying and analyzing biographies as data [51], and the proceedings of BD2017 contain more similar works [7]. In [34], analytic visualizations were created based on U.S. Legislator registry data. The idea of biographical network analysis is related to the Six Degrees of Francis Bacon system16
Extracting Linked Data from texts has been studied in several works, cf. e.g. [8,43]. In [6] language technology was applied for extracting entities and relations in RDF using Dutch biographies in the BiographyNet.17
This paper is structured as follows. First, an overview of the NBF data and its transformation into Linked Open Data is described. After this, various data analyses are presented and discussed using the tools of the portal as well as Google Colab scripting. Finally, issues related to data quality and interpretation of the analyses are discussed, and directions for further research are outlined.
This section explains contents of the NBF data to be used in our analyses, and how the source data was transformed into Linked Data and published in a SPARQL endpoint on the Semantic Web.

Amount of biographies by biographee’s birth decade; screenshot from the BiographySampo portal.
BiographySampo contains some 13100 biographies including the core NBF and four supplement datasets: Finnish Clergy 1554–1721, Finnish Clergy 1800–1920, Finnish Generals and Admirals 1809–1917, and Business Leaders. The NBF alone contains 6478 entries, 5268 men, 929 women, 11 couples, and 268 families [22]. In the NBF dataset, there were also two individual biographees whose gender is missing in the data. The earliest biographee is a saint approximately from the year 200, whereas there are also many biographies about living persons in the collection, such as Jenni Haukio, the current First Lady of Finland. The distribution of the biographical texts by decade can be seen in Fig. 1. In this paper, only men and women in the core NBF dataset are considered; the couples and the families are left out as well as the other four supplement datasets mentioned above.
A biography text in the NBF is represented in two major parts: First, there is a narrative text on the life of the biographee, including a lead section. This text is written in ordinary natural Finnish. The text is used in the online version of the NBF and includes hand coded HTML links to related biographies in the collection; this is the only semantic markup in the text. After the free text section, a summary of the person’s life is presented including basic data about the biographee (name, birth, death etc.) and information about family relations, life events, and career achievements [56]. In the NBF, the summary is unstructured text, too, but written in a semi-formal language using different section headings and notations for separating, e.g., information about family relations from career achievements. The sentences in the semi-formal part are shortened, use specific short hand notations, and do not, e.g., have predicates.
In addition to the biographical text, the NBF data includes structured metadata about the biographies and the biographees available as a spreadsheet in CSV format. The metadata contains the basic biographical information of the biographee, i.e., person names with possible variations like maiden or altered names, places and times of birth and death, vocational/occupational group of the person (Politics, Economics, Science, etc.), and a link to the photo of the person. The metadata is used as the basis for searching biographies in the online version of the NBF. In addition to biographical metadata, the dataset included information about the authors of the biographies, their gender and birth year.
In addition to the biographies, BiographySampo also makes use of several external data sources for enriching the data. For example, the biographees are linked with
Transformation into Linked Data
In BiographySampo, the metadata CSV as well as the textual biographies were analyzed and transformed automatically into linked data, and links to external data sources were established. The modeling choices, transformation, and enriching of the data have been described in various articles throughout the project [22,24,35,48,49]. The result was published as a SPARQL endpoint that was used as the basis for the semantic portal and the analyses presented in this paper. The data in the service can be divided into the following conceptual categories:
Family relations are modelled using the Bio CRM model [52], an extension of the CIDOC CRM standard. The method and process of extracting the family relations is described and the results are evaluated in [35].
The quality of the data in these categories in terms of uncertainty, incompleteness, and errors is different depending on the data source and the knowledge extraction process used. This matter will be discussed later in chapter 3 when presenting and interpreting the analyses made using these data.

Amounts of extracted biographical and linguistic data.
The final outcome of the knowledge extraction process is illustrated in Fig. 2. The linked data is divided into mutually related biographical and linguistic knowledge graphs. The size on the knowledge graphs is documented in terms of the number of instances in different classes, except for the values of LOD cloud links and Morphological data, which are amounts of triples. For example, the biographees were involved in all together 117000 events during their lives, and the free text parts contain nearly 7 million words.
Finally, the transformed knowledge graphs were published openly (under the CC BY 4.0 license,27
See the dataset home page at
In this chapter, we present analyses based on the NBF data service. In BiographySampo there are ready-to-use tools [35,36,49] for general statistics and more conceptual categories such as linguistic analysis, network analysis, and map visualizations. This chapter starts with general statistics. After this more detailed analyses based on the conceptual categories of data are presented and interpreted. Some analyses can be tested online in BiographySampo as part of the tool set available there. For others, the SPARQL endpoint has been used with Google Colab, and a variety of Python data analysis and visualization tools such as Matplotlib.31
The general statistics of the NBF can be created and visualized in BiographySampo with versatile options. The statistics tell about the demographic nature of the people included in the dataset. The statistical tools are available online through a “Statistics” application perspective,32
In Fig. 1, the number of biographies have been plotted by decade. The plot is taken from the BiographySampo portal’s statistical analysis page. In the plot, the decade has been selected based on the birth year of the biographee. The distribution shows a peak of biographies that have been written about people born between the end of 19th century and the beginning of the 20th century and they have been active when the Finnish identity as a sovereign nation was established. There are also a few peaks earlier in history that are in general less well-known in Finnish history. In some cases, the data is not accurate enough and the birth year of a biographee is not known. In these cases it has been set to the beginning of a century, which explains the earlier peeks in the beginning of each century.

Number of male and female biographees alive on a timeline.
Similarly to [55] we have plotted the distribution of people alive on a timeline based on biographee’s birth and death data. Figure 3 depicts the number of biographees alive in different times but due to lack of total population information in Finland before 1900s we do not have comparison between biographees and general population but we wanted to look at women in contrast to all biographees. The blue curve is the total amount, the dashed red curve the amount of females, and the dotted line is the proportion of females. The curve indicates that the largest number of biographees lived during the first half of the 20th century. The total curve appears smooth and does not show sudden changes due to historical events, e.g., the Second World War. The female percentage reaches a local maximum during the late 19th century and is growing constantly from 1950.

Average lifespan of the biographee’s; screenshot from the BiographySampo portal.
BiographySampo portal also allows one to look at the properties of the biographees, such as their average lifespan depicted in Fig. 4. The average life span for all biographees is 70.2 years. When comparing the male and female biographees, women on average live up to 72.2 years and men 69.8 years of age. Most biographees have died during their adulthood, but there are a few exceptions. For example, Sigfrid Jusélius (1887–1898),33

Average age of marriage; screenshot from the BiographySampo portal.
The statistics application perspective of BiographySampo gives also insight into the life events of the biographies, such as getting married or having children. For example, Fig. 5 shows that the biographees got married on average at the age of 29 but there are also a few teen marriages and some older couples. A comparison of male and female biographees shows that women marry younger at the age of 26 than men at the age of 30 years. Men also marry more often after the age of 60 years.

Average number of spouses for female and male biographees; screenshots from the BiographySampo portal.

Average number of children for female and male biographees; screenshots from the BiographySampo portal.
There are also statistics about the number of children and spouses in the portal. The Fig. 6 the number of spouses for women and men and the Fig. 7 represents the amount of children. These plots are taken from the BiographySampo’s statistics comparison view. Women’s statistics are on the left hand side whereas the men’s statistics are on the right hand side. Based on the statistics most women are married but have no children whereas men are mostly married to one partner and have no children. On average men have more children than women. Based on further data analysis using SPARQL queries,37 Query amount of unmarried and childless men and women: Query most common jobs for unmarried and childless persons:

Sankey diagram depicting the correlations between the vocations of husbands and wifes; screenshot from the BiographySampo portal with English translations in red text.
The BiographySampo portal allows users to generate statistical visualizations of correlations between, e.g., vocations or places of birth or death between biographees and their relatives. The Sankey diagram in Fig. 8 visualizes correlations between the vocations of spouses so that husbands’ vocations are on the left and their wives’ on the right. The visualization suggests, for example, that men having a vocation related to theater often have an actress (
The NBF dataset also contains the vocations of each biographee except for 116 people. In this article the terms vocation and vocational group are used instead of terms occupation and occupational group. The vocation term is used because the person data contains in addition to occupational titles also, for example, honorary titles, academic degrees, and ranks of the peerage.
The biographees were distributed into vocational groups already at the stage when the collection was being mapped out by the editorial board. They chose to use a fairly standardized vocational classification previously used by other research projects in the 1980’s, which was slightly modified to include all vocational groups in the NBF.
The use of vocational groups has a dual goal. On one hand they gave the editorial board a means to compose a diverse collection of biographies, and on the other hand they give the reader one more possibility to search the biographies. The vocational groups made it possible to take into account the different sectors and periods of Finnish history in selecting the biographees. The vocational groups are also useful as a search feature since they categorize the different titles (e.g., prime minister) to domains (e.g., politics).
Table 1 lists the 10 most common vocations for all, female and male biographees. The number in parentheses after the vocation indicates the number of occurrences. The list of the most common vocations for all and for men are similar but may have a different order of titles. The most common ones of these vocations appear for both female and male biographees. However, there are vocations which are more related to only one gender, like Lutheran minister and merchant for males, or actress and queen for females. The queen appears in the female vocations because the dataset contains all the historical rulers of Finland with their spouses.
Most common vocations by gender
Most common vocations by gender
In addition to vocations, there are also vocational groups for each biographee in the data. The vocational groups categorize the different titles, such as director, to different domains. Figure 9 depicts the distribution of the most common vocational groups in the NBF. In this figure, the vocational domains have been grouped based on the vocational grouping in the data. For example, musicians, authors, and artists are considered to be in the group

Most common vocational groups in the NBF.

Correlations of the most common vocational groups.

The most common vocations ranked on a timeline.

The most common vocations on a timeline.
As mentioned earlier, a biographee can belong to more than one vocational group. The Fig. 10 depicts the most common intersecting vocational groups for a biographee who has more than one vocational group. For example, Field Marshal, president Gustaf Mannerheim (1867–1951)39
In addition to looking at the most common vocations and vocational groups, there is also a difference in most common vocations as a function of time which is depicted in Fig. 11 and 12. Figure 11 shows the ranking of 12 of the most common vocations and Fig. 12 the total amount of people with these vocations. The figures show that some vocations, e.g., director, professor, or author have a constantly high rank throughout the timeline. On the other hand, vocations like minister or reporter start gaining a higher rank during the late 19th century. Actor gains its highest rank in the years 1930–50 and naturally there are no movie actors before the cinema was invented and brought to Finland. Furthermore, some vocations such as merchant or Lutheran minister descend in the rank in the 19th century.
The biographies have 5410 mentions of a father and 5310 mentions of a mother. In 619 cases the father also has a biographical entry, 94 of the mothers have biographies. Generally, especially with earlier biographees it is common that the vocation of a mother is not mentioned. There are approx. 5850 mothers whose vocation remains unknown, while 1130 fathers are missing this information. As an observation, there are, e.g., 340 cases where the father is a farmer, and 256 cases where he is a Lutheran minister. In cases like this, one could assume that the mother has been a farmer’s wife, although it is not mentioned in the data entries.
Table 2 shows the 10 most common vocations of the biographees’ parents. Six different columns where chosen similarly as in [55]. In the table teacher, farmer’s wife, and nurse appear as the most common vocations of a mother, while farmer, director, and merchant as the most common of a father. On the other hand, some vocations of the biographees (Table 1) like minister, painter, or scholar do not appear in the parent data at all. Baroness and queen appear in the list of men’s mothers, indicating that among nobility, the mother often has a biography entry in the dataset in her own right. The bottom row shows the number of cases where the information about a parent’s vocation was not available.
Most common vocations of parents by gender
Most common vocations of parents by gender

Correlations between the vocational groups of parents and children.
Figure 13 depicts the correlation between the vocational groups of a child and his/her parents. The horizontal rows correspond to the groups of a child while the vertical columns to the groups of a parent. The number of biographees in each group is in the parenthesis after the group label. The values in the cells are normalized so that the values in each column sum up to one. To wit, the cell indicates the conditional probability for the group of child when the group of parent is known. Due to the dominant values at the diagonal of the matrix, there is an obvious correlation between the groups of a parent and of a child. The strongest correlations are found in the groups of
Events include the births and deaths converted from the structured CSV data, added with the lifetime events extracted from the semi-formal descriptions. An event usually contains a timespan and a possible reference to a place; we have extracted these mentions so that the event data can be illustrated on maps and timelines. The birth information was available for 6210 and death for 5800 out of the total of 6230 people. The semi-formal chapter of lifetime events was split into paragraphs describing the career, achievements (works, acknowledgments etc.), and a list of references. 5080 biographies contained a description of career and 3450 of achievements. Many of the people without a career description were historical figures of whom the records of education or vocations are not available. The data extraction generated 69400 events of career, 29900 events of achievement, and 18000 mentions of honor.

Timeline with the number of events.
The timeline in Fig. 14 depicts the number of events by year, e.g., births, deaths, and events related to a person’s career. Generally the curve clearly follows the distribution of people alive shown in Fig. 3. The curve reaches the highest count around 1918, the time of the Russian revolution, of the beginning of Finland’s independence and the Finnish Civil War. On the other hand, the curve shows a downwards peak in 1942, during the Second World War. This decrease is explained by the missing events in people’s civil careers, although there are military personnel in the people data. Furthermore, before the decade 1850 the data is so sparse and major events of that time, e.g., wars or plague pandemics, do not form distinct peaks to the figure.
Similarly to [55] we have ranked the ten most often mentioned places on a timeline in Fig. 15 but the illustration also contains names of towns and cities. The data was binned to intervals of 20 years. Helsinki became the capital of Finland in 1812 and has a constant highest ranking from the 1840’s onward. The chart also shows a strong connection to Sweden with even more events than with the former capital Turku. Paris has had a high ranking during the latter half of the 19th century when it was a popular location for, e.g., university studies. The United States started to gain attention in the early 20th century. This attraction peaked during the decades 1940–1960. The old Finnish city of Vyborg lost its significance after the Second World War when it was annexed by the Soviet Union.

Top 10 places on a timeline.
Figure 16 depicts a simplified illustration showing the referenced countries or continents. Generally biographees have had close connections to Sweden and Germany, and historically also to Russia, although it’s significance has decreased during the 20th century. The Baltic Countries have increased their ranking after gaining independence from the Soviet Union. The third position of the United States after the 1940’s is explained by, e.g., international studies. Africa has gained an increasing rank after 1960’s due to, e.g., activities of development aid organized by the United Nations.

Top 15 countries on a timeline.

Comparing life maps of male (left) and female (right) biographees in the NBF in the BiographySampo portal.
BiographySampo also provides the user with a map search view40
There is also a Life Maps application perspective in the portal. This perspective contains two kinds of prosopographical tools: (1)

Extract from the reference network.
Based on the person data and extracted person references, the BiographySampo portal also contains network visualizations of people and how they are referenced in biographies. The networks enable the study of egocentric and socio-centric networks. In addition to using the BiographySampo portal, it is also possible to study the networks by using SPARQL queries to get the data. As an example, Fig. 18 depicts an extract around the vocational categories culture (marked with red) and politics (marked with blue) and black for other groups. The network is generated using the HTML links because of the coverage; currently the person references are extracted for people born in the 1900s. HTML links referenced people in different datasets of SKS and were made only for the first occurrence of a biographee’s name. The graph shows that the politicians form one solid cluster while the people who are grouped by their vocation to culture vocational group are divided into three smaller clusters, one representing literature, one classical music, and one popular culture, when the corresponding biographies are analyzed by close reading.

Sentences that reference people.
In addition to enabling browsing of the data via networks, the tools in BiographySampo also enable link analysis currently only for biographies with HTML links. For each person, there is a view44

Plotting number of references by decade using the BiographySampo portal.
BiographySampo also contains a chart for each biography, where the links from the source biography to other target biographies are calculated based on the birth decade of the target. This is illustrated in Fig. 20, where the references of a source biographee and people referenced in the source’s biography are plotted by their decade of birth. These plots show (a) the influence of the source biographee by decade48 I.e. by the birth year of the person whose biography references the source biographee. By their decade of birth.
In the BiographySampo portal there are no ready-to-use tools for counting references between biographies. In situations like this, one can use the data service SPARQL API directly to find out, for example, based on the HTML links who are the most often referred or “important” biographees. In Table 3 is the list of the top 10 people most commonly referred in the biographies of women. Whereas Table 4 is based on counting the references from the biographies of men. In addition to counting the references, the tables contain corresponding listings in the right column based on the PageRank centrality measure of the reference network. The PageRank measure and algorithm [3,4] was developed in Google to sort search results in a relevance order: the idea is to calculate the web pages’ importance recursively based on the number of times the page is referred to and the PageRank of the referencing nodes, which emphasizes the value of references from highly ranked pages. Using the PageRank method leads to quite different ranking orders from the counting based rankings.
Top 10 referenced people in female biographies
The PageRank measures have been calculated using the NetworkX Python library50
Top 10 referenced people in male biographies
Table 5 depicts the people with the highest centrality measures during chosen periods in the history of Finland. The data was generated by first constructing the entire graph, and then filtering people related to each period and picking the ten people with the highest PageRank measures. The first column describes the years (–1809) when Finland was a part of Sweden. The first row under the header has the number of people during each period. Most of the people in the first column are monarchs of Russia or Sweden with Peter the Great, Emperor of Russian, on the first place and Empress Elizabeth on the second. Next, during the time in the second column (1809–1917) the Grand Duchy of Finland was an autonomous part of the Russian Empire. In contrast to the first column, the highly ranked people are not monarchs but prominent figures in Finnish culture and politics, such as the politician J.V. Snellman, and the poets and writers J. L. Runeberg and Z. Topelius. The third column covering the early years of the Finnish independence 1918–1944 contains mostly presidents and significant politicians of the era like the fourth column of years 1945–1994 between the Second War World and joining the European Union. One can, e.g., notice that presidents Paasikivi and Kekkonen as well as Field Marshal, president Mannerheim are present in both columns. In general, all the columns during the Finnish independence (1918–) are dominated by politicians.
People with highest PageRank values during five historical periods
Out of the references from male biographies 93.3% refer to a male biography, whereas only 6.7% to a female biography. On the other hand, from the female biographies 28.2% refer to a female biography. The average amount of links in a biography is 4.18 and there is no significant difference between the genders.
The difference between the ages of linked biographees was also studied with the observation that on average the mentioned person is 6.18 years older than the biographee. However, for females the average is 8.93 years while for men 5.73. A histogram of age differences is depicted in Fig. 21, where the negative values refer to an older person. The histogram shows that the modes of female and male distributions are both around zero, indicating that all people have plenty of links to people of nearly the same age. On the other hand, females have more links to people who are 20–75 years older while men have more links to people who are 10–50 years older than they. These statistics where calculated by picking random samples of the same size from both genders in order to avoid the male dominating bias in the data. This observation may be partly explained by the more frequent mentions of relatives in female biographies.

Histogram of differences in age of linked biographees.
Percentages of references to relatives by gender
Table 6 shows the percentage of references between a biographee and his/her relative who is also a biographee. The studied relations are parents, spouses, children, siblings, and other relatives, e.g., cousins, grandparents and -children, or in-law-relatives. The table clearly indicates that females have in general more relatives in the dataset. Females have in average 2.11% of relatives mentioned in their biographies, while the corresponding value for men is 1.17%. Especially the spouse is mentioned in 0.74% of female biographies, while only in 0.11% of male biographies.
Figure 22 depicts the correlation between the vocational groups of two linked biographees. The numeric values of rows, columns, and cells follow the same principle as in Fig. 13. The strongest correlations are found in the groups of

Correlations between the vocational groups of linked biographees.
The data has been enriched by linking mentions of people in the biographies, complementing the existing HTML links in the source data. The F-score of the HTML links in the source dataset is 97.3%. The result was calculated for 181 links from 35 biographies sampled randomly from the dataset. In few cases some biographies had not linked people who had a biography (mainly because they were written before the linking could be made), and in a couple cases the links pointed to wrong people. Some biographies had no links to other biographies. Typically, the biographies of athletes had no links because they only mentioned people such as team mates or coaches. The biographies are rarely written about coaches or lesser known athletes. In 75.5% of the biographies of athletes contained links while other vocational groups had links in over 81% of biographies, 88.2% of female and 89.8% of male biographees had links.The automatically extracted links add missing relations between biographees in addition to mentions of people who don’t have biographies in the dataset. These automatically created links are used alongside the HTML links in the BiographySampo portal in a contextual reader application for the biographies and in reference networks.51
Table 7 contains general metrics of the four networks, (1) manually linked HTML network, (2) automatically linked network, (3) the network linked both manually and automatically, and (4) the genealogical network. This table contains first the numbers of nodes and edges in the network. Average degree indicates the average amount of links for a single node and highest degree (HD) is the highest node degree in the network. Max clique size is the largest size of a clique, e.g., a value 8 indicates that there exists a subgroup of 8 people who all are linked to one another. The table shows the number of separated components in the network, and the size of the largest connected component. It is to be observed that the genealogical network is scattered into numerous separated components, while the three reference networks are all more connected having giant components connecting most of the data points. The Diameter is the number of edges along the longest path between any two nodes in the network. Alpha
Comparison between the four networks in the BiographySampo data using standard network metrics
When comparing the results shown in Table 7 one has to remember how the automatic references complete the graph of HTML links which is clearly shown by the measures of nodes and edge counts, average and highest degree, and giant component size. The last example network, the genealogical network is completely different by its nature where the people are linked by family relations.
Comparison between five example networks and reference networks of BiographySampo
Hashmi et al. [11] used a random sampling strategy for calculating the network measures in their study for structural similarity of social, communication, or collaboration networks. The example networks in their study are Twitter Friendship Network, Epinions Social Network, Wikipedia Vote Network, EU Email Communication Network, and Author Network. Their sampling strategy was to sample subgraphs of the size of 500 nodes with a breadth-first search and then calculate the values as average of ten such samples. Table 8 shows our reference networks in comparison with the five example networks analysed by Hashmi et al. where we used the same strategy to calculate the metrics. Comparing the values to their results shows that, e.g., the number of edges and therefore also the densities in our reference networks are in the same range as in Email and Author networks. Also the values indicating a small world or scale free behavior, e.g., CCG and

Amount of words in biographies by decade; screenshot from the BiographySampo portal.
The biographies in BiographySampo can also be studied from a linguistic perspective in the Language Analysis view52
In addition to the general statistics about the word count by decade, the user can get a list of the biographies with highest and lowest word counts. In Table 9, the top 10 of the longest and shortest biographies are listed based on their word counts. In the Table 9(a) of the longest biographies, the list mainly consists of politicians, presidents, and regents of Finland with one exception, Mikael Agricola. In Table 9(b) of the shortest biographies, there are people with different vocations, such as a local government official, two artists, a lesser known ruler, an athlete, and a priest. Most of the people in the list of the longest biographies are people who were in power or active during and after the World War II, such as president Urho Kekkonen. In the list of the shortest biographies, there are people who have been active in the Middle Ages or in the 18th and early 19th century.
Longest and shortest biographies
In Table 10 the top 10 vocations that have the highest and lowest average word count in biographies are listed based on their word counts and on the number of biographies in the group. In Table 10(a) of vocations with the highest average word count, the list consists mainly of vocations that dominated also the list of biographees with the longest biographies by word count. The list’s first group of the longest biographies has only 7 biographies by different authors and is about the lovers, muses, and favorites of politicians, artists, nobility, and military personnel who lived before the Finnish Independence. The other groups contain more biographies and have lower average word counts. In contrast, in the Table 9(b) lists the vocations with the shortest biographies (the lowest average word count). There are vocations, such as artisans, athletes, families, clergy, and government administrative officials. Some of these were found also on the list of the shortest biographies. The vocational group with the shortest biographies is athletes followed by artisans and judicial authorities.
Top 10 longest and shortest texts by vocation
In addition to word counts, the actual words and their frequencies can be listed for a filtered set of biographies. Table 11 lists the most common words (nouns, adjectives, and proper nouns) and the most common keywords for the whole NBF. The list of adjectives (Table 11(c)) contains common adjectives such as Finnish, new, first, great. These lists become more descriptive after the most common stop words are ignored. In the Table 11(a), the most common keywords are listed for the biographies and the number of times they appear (in column Count) in different biographies. The keywords have been extracted using the basic TF-IDF method from the nouns in the biographies. As can be seen from the table, this method typically picks up titles and other attributes related to the people described in the biographical texts, such as professors, kings, or women. In comparison, Table 11(b) lists the most common nouns in the biographies, containing similar words as in the keyword listing but in singular form (e.g., university and professor). However, these nouns constitute roughly 0.6% or less of the nouns and 0.2% or less of all the words in the dataset. All the keywords in the top 10 list can be found by looking at the top 50 nouns list.
Top 10 words and keywords in BiographySampo
Top ten words used in the biographies of female politicians
As mentioned earlier, the user can select using facets any selection of the given data for inspection. As an example, we have selected the most common words used in the biographies of male and female politicians (e.g., MPs, presidents, ministers, rulers, and other political influencers in Finnish history). In Table 12 and Table 13 are the lists of the top ten nouns and adjectives for female and male politicians in BiographySampo. The table contains list of words for each group and the word count for the given word. Both lists have been created by querying from the biographical texts the top words of each part-of-speech group and filtering out most common words using a Finnish stop word list.53
Top ten words used in the biographies of male politicians
In BiographySampo’s dataset there are not only data about the biographees and their relatives but also about the authors of the biographical texts and their publishing dates. In this section statistics about the articles and their authors presented based on SPARQL queries to the data service.
The authors were chosen by the editorial board based on their expertise and previous research. Precedence was given to researchers who had recently published on the person in question or who had a deep knowledge of a specific field or period of history. The whole group of authors, more than 900 Finnish scholars, is so large and varied that it is very difficult to scrutinize them, especially because they come from so many fields of research. In addition to historians, they are specialists in various fields, e.g., art studies, jurisprudence, and medicine. The majority had a doctoral degree and a university affiliation. It is a group that can’t be easily analyzed, since the information in the editorial database only includes their title and date of birth but not the affiliation or the field of study.
The authors had to undertake to follow the guidelines and goals of the NBF, set by the editorial board. All articles were peer reviewed before being accepted for publication.

Number of articles written yearly in total.
Since the publication of the NBF in print from 2003 to 2007, only 400 new biographies have been published. These newer articles were written thematically including biographies or people in different minorities, politicians, authors, actors and actresses, movie makers, theater directors, music educators, circus performers, and cartoonists.
The distribution of the number of articles published yearly can be seen in Fig. 24. The figure shows how the articles have been published from 1997 onward until 2016 (the most recent articles are not included in the BiographySampo). The figure has peaks before 2008 (the end of the publishing in print) and afterwards a minor peak in 2010 when a collection of new articles called the Multifaceted Finland was published online. Figure 25 depicts the distribution of how old the authors were when publishing biographies. The distribution also shows the difference between male and female authors.

Author age distribution.
Statistics about male and female authors of the biographies can be seen in Table 14, indicating also the gender of biographees they write about. The fraction of female writers is 32% of all writers in the dataset; the male writers dominate (68%) this dataset. There are three authors whose gender is unclear in the data, but they have written only 90 articles (approximately 1% of the articles). On closer inspection on whom the authors write about, it can be seen that men write mainly about men (94%) and women write about both genders. 41% of the female authors have so far written only about men and 26% about only women, while 5.7% of male authors write only about women.
Breakdown of articles written by men and women
Table 15 indicates that the female authors have written more often about people who are known influencers of culture, rewarded individuals, or people active in charitable or non-governmental organizations. In contrast to this, the male writers have mainly written about prominent politicians, scientists, or economical influencers. According to the editorial policies of the NBF, the authors have not chosen their target biographees freely but were asked by the editors to write about particular people. The authors were selected based on what was known to be their areas of expertise.
Most popular vocational groups of biographees for female and male authors
BiographySampo offers historians and the public data analytic tools that can be used for biographical and prosopographical research without experience in computer science by using the portal. With a little experience in formulating SPARQL queries and/or Python programming, the underlying SPARQL endpoint can be used for custom-made complex data analyses. In this paper, both approaches were used for creating historiographical analyses of the core part of the BiographySampo data, the National Biography of Finland. In addition, we have evaluated our methods to estimate the reliability of our results. Our approach gives scholars novel biographical and prosopographical tools for analyzing individual persons and their groups. The tools combine the quantitative approach and distant reading methods [28] with the qualitative approach, often based on close reading, typical to biographical research. The portal contains numerous views that enable the users to study the lives of the biographees as well as prosopographical groups in terms of statistics, maps, language usage, and networks based on references made in the biographies or based on the family relations extracted from the biographical descriptions.
The key findings of this paper give insight to the editors of the National Biography as well as to researchers in biography, prosopography, and historiography. They also highlight the possibilities and issues in modeling historical data related to, e.g, editorial choices, modeling uncertainty, serendipitous knowledge discovery, and data literacy.
Using automatically structured linked data in research needs new kind data literacy from the end user. As discussed above, in BiographySampo some parts (subgraphs) in the NBF dataset are based on reliable hand coded metadata while others were created by the machine. In big datasets like this it is not possible to check and correct the generated data manually, so more errors are expected to be encountered than in manually curated datasets. Furthermore, the linked data approach is based on using explicit classifications and ontologies for which different opinions may arise. In many cases, the underlying real world is too complex to be modelled fully in practice. For example, the historical place ontology underlying BiographySampo covers centuries of places that in reality change in time. For example, Finland was part of Sweden until 1809, then part of Russia until becoming independent in 1917, and after that some parts of her were annexed to the Soviet Union that became later the modern Russia.
The gaps in describing the lives of historical figures caused also challenges for analytics and data modeling. There are irregularities in describing biographees, their relatives, and vocations due to lack of reliable historical sources. This makes knowledge extraction somewhat challenging at times and the possibility for errors can increase, as the algorithms may misinterpret the original data and skip or mislabel data resulting in, for example, mislabeled family relations and anomalies in statistical or network visualizations. For example, similarly to what is mentioned by [28], the exact birth and death years of some people who lived in the early days of history are not known precisely, and heavily rounded inexact dates, such as 1100, appear in the data. The source data does not tell whether a year, such as 1100, is rounded or actually is a precise value. Without better knowledge, the system now assumes that all dates are accurate, resulting,e.g., in a peak of 100-year-old people in statistical visualizations. This phenomenon indicates how source criticism and understanding the underlying data is needed when interpreting quantitative results. A mechanism for representing uncertainty in a machine understandable way would be needed to address the problem, but it remains a topic for future research.
In our work, the data was transformed from the CSV format to RDF and used as an input for further enrichment and transformation. Modelling the person and document metadata as RDF facilitated to creating the visualizations and performing the analyses depicted in this article. The transformation, extraction, and linking of the data was performed with satisfactory results (cf. Section 2.2). This data was used to enable distant reading by building data analytical applications and visualizations into BiographySampo. Unlike in [2,54,55], the data is in RDF format stored as knowledge graphs.
The Linked Data infrastructure created for BiographySampo also enables serendipitous knowledge discovery. The user can not only learn about the demographics through the statistical lens but also the connections between individual biographees through the network visualizations and reference analysis tools. The transformed knowledge graphs are published openly and can be queried with SPARQL to learn more about the data and the demographics.
Based on the analytics presented in this paper we have shown how to use Linked Data and SPARQL to create statistical, linguistic, and network analytics and visualizations to study a biographical data collection and its demographic features. These applications are related to the analytics represented in [2,54,55] but extend these analytics to describe the NBF dataset and also consider how the data has been created and used [37]. The data quality is not only impacted by its modeling and transformation process but also by its biases and sometimes historical uncertainty that exists in the source data. In comparison to the Ainm [2], the NBF is also biased towards the period from the mid 19th century onward whereas the ODNB [55] covers a wider span of time between the 16th century and current times.
Similarly to the Ainm and the ODNB, the visualizations tell the history of both the nation and of the collection itself. The place visualizations in this paper conform mainly to Finnish historical narratives that are tied to its neighbouring and European countries. Similar themes are present in the visualizations regarding relatives and vocations. The social structures are different in different countries, and cannot be used easily for transnational comparisons. As in Ainm and ODNB, the demographic of our dataset consists mainly of men while women are a minority. Furthermore, the networks are also influenced by the authors’ decisions as each reference to another person is based on a choice. This has also become evident through the language analysis, as the lists of most common words in biographies of women contain more words to describe families than in the biographies of men. However, the language usage requires closer inspection to sort out the influence of the authors and it remains as a future work.
The Linked Data approach presented in this paper helps one to describe and analyze a biography collection with its strengths and weaknesses for further research, and to find out points of interest for close reading. The methods, results, and insights presented for the NBF can be utilized in DH research for other similar collections to learn more about the demographics of the collection itself, the underlying history, and to evaluate the reliability of the results.
