Abstract
Parliamentary plenary speeches as FAIR data for problem solving
A foundation of democracy in any constitutional state based on rule of law is openness and transparency of political decision making. An important requirement of this is provision of open access to parliamentary data for the voters, media, parliamentarians, and researchers of politics. The minutes of plenary sessions in parliaments in particular provide lots of information about the democratic decisions made, political life, language, and culture [4,15].
This paper concerns the problem of publishing and using data about the plenary session speeches of parliaments and the parliamentarians involved in the discussions. As a case study, the Parliament of Finland (PoF) in considered. An infrastructure and system called
The minutes of the plenary sessions of the PoF have been available openly as printed books at the Library of Parliament and Archive of Parliament, and later also through the PoF’s open data service as scanned PDF documents, HTML pages, or as XML documents, depending on which parliamentary sessions are in question.2 However, they have not been published as data in accordance with the modern FAIR principles in a Findable, Accessible, Interoperable and Re-usable form for searching, browsing, and data analytic applications.3 If the user knows during which parliament a speech was given, he could download, e.g., a scanned minutes book, which can be over thousand pages long, and search for the speech and other information in the document. But if one wants, for example, to find out the answers to the following questions, this kind of online service and research method based on downloading and close-reading documents is not a viable solution:
The answers to this kind of questions, for example, can be determined computationally with the help of the
For example, to answer question (1) all speeches mentioning “NATO” can be first filtered using the text search facet. The results can then be sorted by the time of the speech or by visualizing them on a timeline. For question (2) the faceted search hit counts available on the speaker facet and on the party facet after filtering speeches mentioning “finlandization” tell the answer; the distributions of speeches along facets can also be visualized using, e.g., a pie chart on the facet. For answering question (3), the speech type facet is first set for filtering regular speeches. After this the speaker facet hit counts tell that Mr. Vennamo has given most speeches. By selecting him next on the speaker facet the result set of his regular speeches can be visualized on a timeline. Answering question (4) is based on the fact that interruptions were marked in the original textual transcripts of the primary minutes’ data and were extracted as RDF data when transforming texts into linked data. Google Colab with Python scripting based on SRARQL querying the underlying triplestore was used to calculate an “interruption matrix” telling who has interrupted whom and how often.
This paper presents the data publishing infrastructure
This paper is a substantially extended version of our earlier workshop paper [21],4 extending and aggregating results from our other papers about
In the following, related research on parliamentary data is first reviewed (Section 2). After this the data-driven creation of the underlying ontology of the PoF is discussed in Section 3 and the data production pipeline of the mostly textual speech data and its different outputs are explained (Section 4). Examples of using the
Related work on parliamentary speech data
Parliaments and cultural heritage organizations in different countries have created parliamentary speech corpora and digital parliamentary datasets of both historical and contemporary parliaments [12,37]. The goal of this work has been to improve the findability, accessibility, interoperability, and re-usability of these key documents of democratic societies for the public, researchers, and other users. The digitization has also allowed researchers to engage in novel and interdisciplinary research using the new parliamentary data. As part of the digitization and research initiatives, web user interfaces and data services have been developed that allow to browse, study, and download the digitised materials. An example of this is the Lipad project and the Canadian Hansard6 [3].
The projects on parliamentary data have focused on the curation, annotation, and harmonization of the national parliamentary corpora. Also semantic web technologies have been applied for linking and enriching the parliamentary data with other datasets. In the pioneering project Linked Data of the European Parliament (LinkedEP), the debates of the European Parliament and the political affiliation information were connected as linked data into other datasets, such as DBpedia and the EuroVoc thesaurus [76]. The LinkedEP data was made available through a SPARQL endpoint and an online user interface. The Open Data Portal of the European Parliament provides today lots of datasets as LOD and in CSV format.7 Other examples of linked data parliament initiatives are the LinkedSaeima for the Latvian parliament [7], the Italian Parliament data,8 and the historical Imperial Diet of Regensburg of 1576 project [6]. An EU level initiative for harmonization and annotation of national parliamentary corpora is the ParlaMint project as part of the CLARIN infrastructure.9 The ParlaMint project applies the TEI-based Parla-CLARIN scheme,10 and created uniformly annotated multilingual parliamentary corpora with its partners. The ParlaMint II corpus involves 27 national parliamentary corpora [54] (see also [55]).
In Finland, the minutes of the PoF have been digitized by the Parliament itself, but are challenging to use, as they have been produced separately for different periods, stored in different data formats, vary in quality, and lack descriptive metadata [36,67]. Subsets of Finnish parliamentary debates have been published before:
The FIN-CLARIN’s Language Bank11 [41] contains the speech corpus 2008–2016 of linguistically annotated plenary debates and also links to the session videos [48]. The Voices of Democracy project has produced a research corpus that includes grammatically annotated plenary minutes in 1980–2018 as well as interviews of veteran MPs conducted by the PoF after 1988 [2]. The International Harvard ParlSpeech Corpus [62] contains speeches of the Finnish parliamentarians 1991–2015 but has gaps in the coverage.
Parliamentary data is used in many fields, such as linguistics, political science, legal studies, media studies, economics, and history. Parliamentary debates data combined with the political affiliation information of the speakers allows to study, e.g., (political) language and its use, legislative processes, political decision-making, and the debated societal issues (see for example [12,37]). Metadata and annotations make it possible to structure and differentiate the speeches, for example, between parties, gender, government-opposition role, or by professional groups, and to filter and analyse the speeches based on the annotated features. Parliamentary data also allows long-term studies as the data often extends over several decades or even a century [13].
Parliamentary debates have also been used in thematic and conceptual analyses (e.g., [10,13,26,29,31,60]) and to study the language and the opinions of the parties or MPs (e.g., [1,2,5,44,49]). The speech data have been used in translation studies using, for example, the EuroParl Corpus12 of the European Parliament debates.
Several linguistic and social science studies have been conducted using debates of the PoF. La Mela [36], also Kettunen and La Mela [31], have studied the history of Nordic right of public access to nature, and examined the quality of the previous PoF open data. The digitized minutes have been utilized in the development of language technology methods [31]. Andrushschenko et al. [2] have used their grammatically structured corpus for selected digital humanities research cases. Simola [65] has explored the differences in political speech between parties in the long term (1907–2018), and Makkonen and Loukasmäki [47] have used topic modeling to study the plenary debates of PoF in 1999–2014. The FIN-CLARIN’s Parliamentary Corpus has been used, for example, by Lillqvist et al. [43] in their study on debates about public debt.
Previous applications of the Finnish parliamentary data cover only a small part of the entire time series of the Finnish parliamentary speeches. Data analysis tools to examine the results are few, such as the concordance analysis of the Language Bank Korp, where the words found are visualized in their textual contexts with statistics about word occurrences.
An ontology of the Parliament of Finland and its MPs
The data in
The data transformation pipeline of
This section describes the Ontology of PoF, i.e., the data model, and how it was populated with data instances in order to create the P-KG. The P-KG is used as a basis for the S-KG described later in Section 4.
How the Parliament of Finland works
The organization and activities of the PoF are is documented in [15]. Legislation procedures in PoF can be initiated today by a
The organizational structure of the PoF has evolved in time. Creating an overarching ontology over different times is a challenge due to the dynamic nature of the PoF: lots of parties, groups and other organizational units, have been established, restructured, and vanished since 1907. Reassembling the history of the POF from the documents available was deemed infeasible. Furthermore, it turned our that explicit descriptions of how the parliaments have worked in history were no readily available. We therefore created the PoF ontology in a data-driven fashion based on the data available concerning the MPs and other speakers in the plenary sessions and governments making the proposals [21].

Pipeline of transforming data about the MPs and other speakers of the plenary sessions into the ontology of the PoF.
The data transformation pipeline for the P-KG is depicted in Figure 1. The most important primary data source for creating the PoF ontology was the database of MPs provided by the PoF with some additional information from the Finnish Government web pages (on the left in the figure). It contains in a custom XML format biographical data about all MPs, such as date and place of birth, periods of time as an MP, electoral districts, memberships in parties, committees, other groups, and organizations, and publications of the MPs. From this data it was possible by using XML structures and Named Entity Recognition (NER) and Linking (NEL) to extract ontological classes for the PoF Ontology, such as electoral districts, parties, and committees, and at the same time populate the ontology with instances of people, committees, and other classes. Regular expressions worked well for NER and NEL was performed using custom Python scripting.
The XML data was first transformed into a CSV table

PoF ontology data model [42] based on Bio CRM.
The data model of the PoF Ontology extracted from the data is presented in Figure 2. It is based on the Bio CRM [74] ontology, an extension of CIDOC CRM17 for representing biographical information based on role-centric modeling. Bio CRM makes a distinction between attributes, relations, and events, where entities participate in different roles in a qualified manner. The namespaces used in the model are described in the figure on the left.
The key idea of the model is to represent an actor’s activities as a sequence of events (
There are almost 200 different roles in use in the PoF Ontology. The data model has been populated by the MP database and related sources as well as by using a set of external domain ontologies, such as places based on the ontology YSO Places,18 groups and organizations (harvested from the data), and vocations based on the AMMO ontology [34]. Table 1 summarizes the number of instances of the main classes of the data model of Figure 2, and Table 2 lists the number of different event types extracted.
Resources
Resources
Events
The ontology was created in a data-driven fashion. This means that if the data misses something, say the membership of an MP in a particular committee at a time, then the list of members in that committee instance is incomplete. It is known that the data is not fully complete. For example, the MP database for some old committees record only their chairs, not ordinary memberships. Checking and analysing possible missing data has not been done systematically afterwards; it is assumed that the database is complete in this sense and that the user is aware about the fact that this may not always be the case. Validation could be done based on historical sources that, e g., provide lists of members in different committees in different times if such data can be found.
For validating the transformed data, the data model and its integrity constraints can be presented in a machine-processable format using the ShEx Shape Expressions language.19 We have made initial validation experiments with the PyShEx20 validator. Based on the experiments, we have identified some errors both in the schema and the data. We plan a full-scale ShEx validation phase integrated in the data conversion and publication process to spot and report errors in the dataset.
PoF ontology available online
The linked data is available on the LDF.fi platform as separate graphs interlinked with the S-KG in a SPARQL endpoint.21 The PoF Ontology with instance data are also available as RDF Turtle files on Zenodo.org22 using the CC BY 4.0 license. In addition, the central CSV data file
The data can be downloaded also through the
Speech data of plenary sessions
This section describes the data model of the Speech Graph S-KG and how it was transformed from the mostly textual plenary session minutes from different times.
Transformation pipeline for speech data
The plenary discussions in PoF consist of

Pipeline for transforming the minutes of plenary sessions into speech data.
Figure 3 illustrates the process used for transforming the minutes of the plenary sessions into datasets and data services on different publishing platforms. The data is first transformed into simple literal data CSV tables that are published using the national CSC Allas data store.24 The CSV format can be of use for DH researchers developing and using their own tools, and this data publication also serves as the primary source for publishing semantically richer versions of the data. The CSV data is then enriched into Parla-CLARIN XML TEI25 form that includes, e.g., identifiers for the speakers, and into ParlaMint format where additional linguistic annotations pertaining to, e.g., named entities in the texts are explicated. Also a ParlaMint subcorpus has been created and published as part of the larger collection European ParlaMint corpora provided by the ParlaMint platfrom26 [11]. The semantically richest publication form of the data is the RDF 1.1. Turtle27 version. This publication combines the KGs of speech data and the related P-KG of prosopograhical data, based on the PoF Ontology enriched with additional data from external sources (cf. Section 3). This data has been published as data dumps on the Allas Store and Zenodo.org, and as a LOD service on the Linked Data Finland platform28 [25], including a SPARQL endpoint, content negotiation of URIs, linked data browsing, and other services. When enriching the CSV tables into XML and RDF formats, also the interruption markup in the speeches is extracted from the text and transformed into structured forms that can be used in data analyses.

Data model for speech data in the default namespace

Data model for the linguistic annotations of speech data 2015– in the default namespace
The data model of S-KG is depicted in Figure 4. The speeches of the latest and best quality dataset 2015– have been annotated with extracted named entities, keywords, and topical categories, and the data also includes lemmatized versions of the speeches. The data model for these annotations can be seen in Figure 5. More documentation about the S-KG data model can be found in [66,67], on the Linked Data Finland platform homepage of
In the transformation process the minutes are first transformed into simple textual CSV files. The rationale for producing and publishing CSV tables is that they can be used easily by spreadsheet programs for analysing the data and by using various computational methods. From a computational point of view, the CSV data can be created automatically because no advanced data processing, such as named entity linking, is included in process. The only exception to this are the URI identifiers for the speakers and parties that are extracted from the Actors file
The speech data comes from three sources and formats depending on the time of the plenary session:

The percentage of recognized words with LAS tool using original PoF OCR (red dashed line) and our new OCR (blue line) results.
In order to extract their textual contents, we re-OCRed the PDF documents of the Corpus 1907–1999 using multilingual Deep Neural models, as presented in [9]. Figure 6 shows the percentage of recognized words across the whole documents with the Language Analysis Command-Line Tool (LAS) [45] using the original PoF documents and our new OCR results. The new OCR results are consistently better than the original PoF version, with the biggest improvement for the material from 1920s, which is the most challenging period of time due to poor paper quality. The words are recognized on multilingual datasets using only Finnish morphology so they do not show the absolute word accuracy rate, which is estimated to be in the 98–99 % range for Finnish text [9]. Finally, long documents were split into 1–8 separate PDF files, each containing the minutes for several plenary sessions. The extracted texts were structured by Python scripting into the set of CSV tables.

OCR example. On the left is a part of the original PDF-document; on the right is the same part with recognized text. [66].
Figure 7 shows an example of the original minutes for a plenary session on the left. In general, the minutes consist of items (or topics), marked here in bold (except the row
Each source corpus 1–3 format differs in terms of the metadata included in the minutes. However, all formats contained the following core metadata elements about the session, speaker, and the speech: (1) Session data: session identifier, session date, session ending and starting times (2) Speaker data: last name, speaker’s role/title (3) Speech data: speech content, speech type, related documents, and debate topic. In the final speech CSV tables, each row contains an individual speech with the content and metadata elements represented in columns.
The structure of the CSV Tables 1907–1999 and the CSV tables based the HTML-formatted minutes in 2000–2014 are fairly similar with over 20 metadata fields, such as speech identifies, session, data, start and end times, name of the speaker, his/her party and so on. The CSV table format based on the XML files 2015– contains the following columns for metadata about speeches: party, topic, content, speech_type, status, version, link, lang, name_in_source, speaker_id, speech_start, speech_end, speech_status, and speech_version. More documentation about the data can be found at the Allas Store site.
In addition to metadata about a speech, the speech text itself contains mark-up metadata about possible interruptions of the speech using a special bracketed notation. In data 1907–1999 interruptions are marked with parentheses “(
The practises on how minutes of plenary sessions should be recorded are described in a lengthy 147-page document of the Minutes Office of the PoF (“pöytäkirjatoimisto” in Finnish) [32]. It is not fully known what kind of changes in practice there have been at different times. These changes may have implications on data analyses in some cases. For example, in 2021 it was decided that if the speaker only gives the floor to the next speaker without other content in his/her speech, then this is not recorded as a distinct speech of the speaker for simplicity. If the number of all kind of speeches in different times is analyzed, a change in the recording practise of course may skew results statistically.
The CSV tables are published as files that were created on a parliamentary session basis, one file per parliamentary session (valtiopäivät) with the name
The CSV tables are available openly under the CC BY 4.0 license at the Allas data repository.33 The folder there includes (1) a zip file that contains the CSV data files of all parliamentary sessions, (2) the parliamentary session files as separate CSV files, and (3) a link to documentation. The last file of the current parliamentary session is updated daily. The CSV data of the past years is stable but can be updated on an irregular basis when, e.g., OCR errors etc. are found in the data. Information about the updates will be stored in the
The XML TEI-based Parla-CLARIN [11] schema is an attempt to define a common XML-based annotation model for parliamentary debates on an international level.34 For example, the Slovene parliamentary corpus siParl (1990–2018) has been encoded with the Parla-CLARIN schema [55]. Currently, the Parla-CLARIN schema is implemented in the ParlaMint project,35 which establishes a comparable and interoperable corpus of European parliamentary corpora for comparative research. This format is a specialization of Parla-CLARIN extending it with, for example, linguistic and named entity mention annotations.
Parla-CLARIN format includes not only speeches but also means for representing data about the context of the debates including data about the speakers, parties, related organizations, and places in a systematic way using XML identifiers for cross-reference. A benefit of using XML-based formats is the possibility of validating documents syntactically based on their schema definition.
The Parla-CLARIN version of the
Publication as Linked Open Data
The LOD version of the speech data was created from the CSV tables, too [66,67]. The latest corpus 2015– has been annotated semantically using Natural Language Processing (NLP) techniques as discussed in [69]:
In the S-KG, the speeches of the most recent parliamentary term 2019–2022 were automatically classified using the EKS categories. These categorizations are not exclusive but multi-label: a document may belong to different categories. In order to carry out the classification, the keywords were used as the basis for the internal text representation of the system, as described in more detail in [40]. The keywords were transformed into word embeddings via the corresponding pre-trained fastText model [50], which are then pooled together to create the document representation. The NLP-based annotations have been published as part of the
Using the ParliamentSampo data
This section discusses briefly different ways of using the
Exporting the data for external use
A simple way for a researcher to use
An example of using the

Number of speeches in different languages (y-axis) on the timeline (x-axis).
SPARQL is a flexible way to query RDF data. The search result is presented in a tabular format that can be examined as it is and be visualized and used for application-specific analyzes. For example, Figure 8 shows a visualization of the number of Finnish (FI), Swedish (SV) and all (Kaikki) speeches (y-axis) in the S-KG graph on a timeline from 1907 to 2021 (x-axis). Before the WW2, there have been more speeches in Swedish than today, but the number remains very small. The graphic was created using the Yasgui editor48 [63], which can be used to edit SPARQL queries, target them to an online SPARQL endpoint, and to show the results using visualizations.
Data-analysis by scripting

Average annual lengths of all (kaikki), male (mies), and female (nainen) speakers, and the raising proportion of speeches by female speakers (naisten osuus).
The PoF data can be examined computationally using, for example, Python scripting and Jupyter notebooks in the Google Colab49 environment. Then one can use the simple HTTP protocol to perform SPARQL queries and after this analyze and visualize query results using tools provided by the programming environment, e.g., by Python libraries. An example analysis of using Google Colab is presented in Figure 9. It shows the yearly (x-axis) average lengths (y-axis) of speeches of all speakers (Kaikki), male speakers (Mies), and female speakers (Nainen), as well as the raising proportion of speeches by female speakers (Naisten osuus).

Ten MPs with highest hub and authority values based on the HITS algorithm. The darker red, the larger authority value, and the darker blue, the larger hub value.
In [57], examples of analysing networks of MPs referencing to each other in their speeches during the electoral term 2015–2019 were given using the Python package NetworkX [14]. Such a reference network has MPs as nodes and arcs point from the speaker to the mentioned person. The weight of the link corresponds to the total number of speeches with at least one mention. The network has in total 214 MPs that have been mentioned or have mentioned someone. The total number of mentions to other MPs extracted from the speeches is over 25000. Mentions of people who were not MPs or ministers at the chosen electoral term were filtered out of the result set. Analyses of this kind of reference networks can reveal, e.g., most active and influential MPs in parliamentary debates and help to recognize possible disputes between MPs or parties.
To study and visualize the network, hub and authority values were calculated using the HITS algorithm [33]. Ten MPs with highest authority values and ten nodes with highest hub values are shown in Figure 10. From the MPs with the highest authority values, Juha Sipilä, Timo Soini, and Petteri Orpo were ministers and leaders of their parties during the 2015–2019 term. During the same years, Jari Lindström was also a minister and Antti Rinne from the opposition served also as leader of his party. Five MPs, Eero Heinäluoma, Timo Harakka, Timo Heinonen, Pia Viitanen, and Ben Zyskowicz, are both top hubs that make references as well as top authorities often mentioned by other MPs. None of the MPs with highest hub values were ministers.
The
The Linked Data service is powered by the Linked Data Finland55 publishing platform that along with a variety of different datasets provides tools and services to facilitate publishing and re-using Linked Data. All URIs are dereferenceable and support content negotiation by using HTTP 303 redirects. The data is available as an open SPARQL endpoint.56 As the triplestore, Apache Jena Fuseki57 is used as a Docker container, which allows efficient provisioning of resources (CPU, memory), portability, and scaling. Varnish Cache web application accelerator58 is used for routing URIs, content negotiation, and caching.
The data services and the SPARQL endpoint can be used for developing applications. To investigate and test these opportunities, the semantic portal
The ParliamentSampo portal
The
Application perpectives for speeches and MPs
Based on the Sampo-UI framework, the landing page of the portal contains

Using faceted search to filter and analyze speeches about NATO.
For example, in Figure 11 the user has selected the Speeches perspective. Ten search facets, such as Content, Speaker, Party, (Speech) Type are available on the left. The search result, i.e., the speeches found, is shown by default in traditional tabular form on the right, but the result can also be visualized in other forms by selecting one of the five tabs on the top. Here the timeline visualization (AIKAJANA) is used. The user has written a query “NATO*” in the Content text facet, the speech type facet is set to regular speeches, and then 3622 regular speeches that mention the word “NATO” in its various inflectional forms have been filtered into the search result starting from 1959. In addition to the timeline visualization, by clicking on the pie chart visualization button on the Party facet, the distribution of NATO speeches in terms of parties is shown: the most active party with 722 speeches has been the right wing National Coalition Party Kokoomus.

Using faceted search to study MPs, here using life charts of the members of the Centre Party from their places of birth to places of death.
A similar kind of application perspective with faceted search and tabs for visualizing the results is available for studying the MPs. Here 16 facets, such as Name, Gender, Party, Occupation etc. are available for filtering a target group of MPs and other speakers that can then be visualized on tabs as a result table (TAULUKKO), using statistic pie charts and histograms (PIIRAKKA/PYLVÄSKAAVIO), using a timeline of births and deaths of the people (ELINVUODET), by life charts on maps (ELINKAARI), or by showing events related to the speakers on a map (KARTTA). In Figure 12, the facets are shown on the left and the result set on the different tabs on the right. Here the 507 members of the Centre Party were selected using the facet Party of the Speaker (Puhujan puolue) on the left and the life chart visualization tab is used. It shows arcs from the places of birth of the speakers (blue end) to places of death (red end). The MPs of this party, focusing on country side farming matters, have clearly moved from all over Finland to Helsinki for their old age. By clicking on an arc, links to the homepages of the corresponding people can be found for close reading.
One user group of a system like Sonia Zaki: Nearly million speeches. The most active members of the Parliament have given amazingly many speeches. Here they are. (Melkein miljoona puhetta. Eduskunnan puheliaimmat kansanedustajat ovat pitäneet täysistunnoissa hengästyttävän määrän puheita. Tässä he ovat.) Helsingin Sanomat, June 26, 2022.60 Veera Paananen: Even if the Parliament has been equalized the men interrupt other speakers much more often than women. (Vaikka eduskunta tasa-arvoistui miehet keskeyttävät muita puhujia yhä paljon enemmän kuin naiset.) Helsingin Sanomat, Dec. 26, 2022.61 Veera Paananen: Minister Ville Tavio has spoken about population change many time at the Parliament. (Ministeri Ville Tavio on puhunut väestönvaihdosta useita kertoja eduskunnassa.) Helsingin Sanomat, July 3, 2023.62 Alli Hallonblad: Members of the True Finn party have spoken about Islam and Africa much more often than members of the other parties. (Perussuomalaiset ovat puhuneet islamista ja Afrikasta eduskunnassa huomattavasti enemmän kuin muut puolueet.) Helsingin Sanomat, Aug. 7, 2023.63
The portal UI was implemented using a new declarative version of the Sampo-UI framework64 [61]. Here the UI with its components can be created on top a SPQRQL endpoint by using only SPARQL to access the data and with little programming by using a set of configuration files in JSON format in three main directories: (1)
Discussion and future work
This paper presented, discussed, and illustrated principles for publishing and using parliamentary textual and prosopographical data as Linked Open Data, using the PoF as a case study. The first experiments of using the data presented are promising in filtering patterns of possibly interesting phenomena in Big Data using distant reading [52]. However, traditional close reading by a human is needed as before in interpreting the results. The system presented provided several novelties in relation to the related works discussed in Section 2.
A major challenge in creating data analyses like the ones shown in this paper is related to the quality of the data produced. Historical (meta)data can be incomplete and our knowledge about it is uncertain. Also using more or less automatic means for transforming and linking the data leads to problems of incomplete, skewed, and erroneous data [46]. This as well as difficulties in modeling complex real world ontologies become sometimes embarrassingly visible when using and exposing the knowledge structures to end-users. For example, it is difficult to categorize historical occupations and historical places as they change in time, and the methods of network analysis can be very sensitive to even small errors in the data or biases in the sampling schemes. The same problems exist in traditional systems but are hidden in the non-structured presentations of the data. In general, more data literacy [35] is usually needed from the end-user when using data analytic tools.
The
In traditional close reading, the researcher is forced to delimit the data studied on, e.g., temporal or thematic grounds. Digital methods applied to big data, such as that of the
Planned future development of
The
In this paper, Finnish parliamentary data was used as a case study. However, the approach, methods, tools, and lessons learned presented are more general and can be re-used and adapted also to other parliamentary datasets in other countries in the future, and on an international level for, e.g., publishing and studying the speeches and other data of the European Parliament.
