Abstract
Why Applying Text Mining and Predictive Modeling to Quantitative Research on Digitalization in Aesthetic, Arts, and Cultural Education (QRD-ACE)?
This article focuses on harnessing big data methods for the synthesis of QRD-ACE as a pivotal field of informal education. Research syntheses involve systematic gathering, screening, appraising, and analyzing literature corpora. They are the method of choice for identifying hot spots and worthwhile venues for future research. Beyond that, they enable the development of best practice examples and guidelines for informed decision-making in education and practice (Petticrew & Roberts, 2008). Research syntheses are widely accepted and conducted in a variety of research disciplines including medicine (Higgins & Green, 2011) and education (Hattie, Biggs, & Purdie, 1996; Hattie & Marsh, 1996). However, within educational research, their application has mainly focused on studies investigating issues of formal education, while studies on nonformal and informal education have seldom been a target of research synthesis (Lavranos, Kostagiolas, Korfiatis, & Papadatos, 2016; Takacs, Swart, & Bus, 2015).
Research syntheses of QRD-ACE are increasingly important, as it encompasses large parts of everyday life, especially leisure activities and media consumption. At the same time, research syntheses that transgress disciplinary boundaries are lacking entirely in this area. The field of QRD-ACE consists of various topics that are only loosely related, such as video games, music, art, or theater. Moreover, it is fragmented into various communities from different disciplines including information and communication technology, education, psychology, and sociology, which investigate similar research questions while using different terminologies and methodologies. This comes along with transdisciplinary jinglejangle fallacies (Kelley, 1927; Marsh, Craven, Hinkley, & Debus, 2003): There are various terms for phenomena relevant to QRD-ACE in different disciplines, and terms that are used for phenomena relevant to QRD-ACE in one field are used for irrelevant phenomena in other disciplines. This results in an inflation of raw results in database searches, to a degree that makes manual screening a very resource-intensive endeavor (Mulrow, 1994). Thus, the multiple disintegration of QRD-ACE severely hampers the realization of research syntheses (Kröner, Christ, & Penthin, 2019): It impedes the identification of hot spots of current research by obscuring them and blurs worthwhile avenues for future study (Petticrew & Roberts, 2008).
Fortunately, the study of literature corpora as a basis for quantitative research syntheses can tremendously benefit from the application of text mining (O’Mara-Eves, Thomas, McNaught, Miwa, & Ananiadou, 2015; Silge & Robinson, 2017) and predictive modeling (Gandomi & Haider, 2015), as explained in the following section. Thus, for the present article, we applied text mining and predictive modeling (Ramamohan, Vasantharao, Chakravarti, & Ratnam, 2012) to QRD-ACE as an example for identifying hot spots in broad, ill-defined research areas.
To begin with, the effects of digitalization on cultural activities will be outlined, followed by a closer look at the relevance of research syntheses and at the necessity of identifying gaps and foci of current research in the fragmented field of QRD-ACE. We will then present our iterative approach to the application of text mining and predictive modeling in literature reviews. Finally, after presenting the hot spots of QRD-ACE, considerations for future utilizations of text mining and conduction of research syntheses in fragmented research fields will be discussed.
The Field to Be Mapped: Digitalization in Aesthetic, Arts, and Cultural Education (D-ACE)
While mapping existing research related to the field at the intersection of digitalization, culture, arts, and education, we started out with working definitions of these concepts.
While qualitative research on the aforementioned issues may be important for diverse purposes, we focused on existing
In conclusion, QRD-ACE covers studies on the effects of digitalization and those on digital tools related to self-regulated participation in cultural activities. This may happen both in formal and informal educational settings and may foster self-induced personal development. As QRD-ACE covers a multitude of cultural activities and various research disciplines, determining central findings and current research trends in this field is a nontrivial undertaking. Accordingly, the corpus of documents that are potentially related to the field of QRD-ACE may be considered as representing big data.
QRD-ACE as a Source of Big Data
QRD-ACE and the Three Vs of Big Data
Big data is defined as “extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions” (Big Data, 2019). It is characterized by the three Vs: volume, velocity, and variety (Laney, 2001). As we will show in further detail in the Method section, there are over 55,000 titles that are potentially relevant to QRD-ACE in the Scopus database (Elsevier, 2019), accounting for the aspect of volume. Moreover, high velocity results from the rapid growth of the body of research, triggered by the development and dissemination of aesthetically and culturally relevant entities including social media platforms (Angelovska, 2019) and virtual or augmented reality technology (Shih, 2015; Youm, Seo, & Kim, 2019). Finally, as already mentioned, variety of QRD-ACE is considerable, due to its disciplinary fragmentation, which comes with jinglejangle in terminology.
Text Mining and Predictive Modeling in Research Syntheses: The Promise
Among the first steps in the compilation of a synthesis is the identification of hot spots and worthwhile avenues for future research across disciplinary boundaries. Facing the task of analyzing voluminous, velocious, and variable big data from research databases, researchers may particularly benefit from applying text mining and predictive modeling (Gandomi & Haider, 2015; O’Mara-Eves et al., 2015; Silge & Robinson, 2017).
Text mining may be utilized to prioritize noisy and unstructured text corpora and involves the quantification of natural language texts. In combination with screening according to relevance (priority screening), its application to research synthesis promises dramatically reducing the time required for manual screening (O’Mara-Eves et al., 2015; Wu, Zhu, Wu, & Ding, 2014). It can reduce the screening workload by up to 70%, while losing less than 5% of relevant studies (O’Mara-Eves et al., 2015). While this loss may present a challenge to comprehensive research syntheses, it does not substantially affect the identification of hot spots, as it merely requires the identification of large clusters of relevant studies, not finding every single publication.
Application of Text Mining and Predictive Modeling in Research Syntheses: Concepts and Procedures Explained
As mentioned above, text mining has been applied in various fields, mainly in medical research (O’Mara-Eves et al., 2015). However, we are not aware of any application within QRD-ACE, which, as a quite heterogeneous field, should particularly benefit from utilizing text mining and predictive modeling. Its application to classify and synthesize research on various phenomena across multiple fragmented disciplines and across varying terminologies should be especially beneficial to QRD-ACE. The concepts and procedures relevant to text mining and predictive modeling are described in the following paragraphs.
Text Mining Statistics
The length of a document may be measured in word count (WC) and unique word count (UWC): For WC, multiple occurrences of words or groups (e.g., of 2 or 3 words, i.e., bi- or trigrams) within a document are counted several times. For UWC, they are counted only once across the entire corpus. To determine the relevance of each word in a document, term frequency,
The
Stop Words
Most words with high
Cleaning and Stemming
Cleaning increases the efficiency of subsequent analyses and reduce noise within a corpus. This can be achieved by deleting irrelevant strings such as non-English titles or copyright statements at the end of the abstracts and removing all stop words within every text of a corpus. It is followed by stemming, which reduces different forms of a word to their common stem by removing common suffixes such as “-ing,” “-s,” or “-er” (Silge & Robinson, 2017). For example, “culture,” “cultures,” and “cultural” are all reduced to “cultur*.” This facilitates subsequent quantitative analyses by massively reducing WC and UWC.
Significant Words
The cleaned and stemmed documents may be compared to lists of significant words that are known to be indicative of the facets of QRD-ACE. For an initial list of significant words, one may use the terms of the initial search query. Regarding QRD-ACE, there may indeed be one list for each of its four facets, that is, (a) digital; (b) aesthetic, arts, and culture; (c) education; and (d) quantitative research (cf. Online Appendix A). For the facet of “aesthetic/arts and culture” (AC), representing the letters A and C of QRD-ACE, significant words may be tagged not only to this facet but also to their respective AC subfacet, that is, to culture, visual arts, museum, music, performing arts, literature, photography, movies, or video games. For example, “games” may not only be assigned to AC but also to the AC subfacet “video games,” and “musical” may be assigned to AC as well as to two of its subfacets, that is, “performing arts” and “music.” Additionally, a second list of words indicative for the exclusion of a publication is helpful, the so-called negative significant words. Regarding QRD-ACE, this might contain terms such as “health,” “clinical,” “nurse,” or “engineering,” which may be flagged up from irrelevant papers identified during previous searches. Both the initial lists of positive and negative significant words may be further expanded throughout the analyses by adding words determined by the
Significance Scores
Based on the list of significant words for each (sub-)facet, the proportion of a document’s WC related to the (sub-)facets may be computed. This will result in multiple significance scores for each document, which may be utilized for both priority screening via predictive modeling and identification of hot spots.
Effective Screening via Predictive Modeling
The significance scores determined from text mining can be used as predictors for the binary screening decision of already screened documents (the “training set”) with logistic regressions. The regression weights gained from this process can be used to predict inclusion probabilities of so far unscreened search results (the “test set”). This process is called predictive modeling (Kwartler, 2017; Miner et al., 2012; Weiss, Indurkhya, Zhang, & Damerau, 2010). By sorting the test set by the resulting inclusion probability, manual screening can be prioritized (priority screening).
Identification of Hot Spots via Topic Modeling
Based on the procedure of Steyvers and Griffiths (2005), Griffiths and Steyvers (2004) showed that topic modeling is an efficient and appropriate method to identify hot spots within large literature corpora. The authors used abstracts from the
Research Goals
With this article, we set out to provide the basis for research syntheses on QRD-ACE, using text mining and predictive modeling to identify hot spots and possible avenues for further research. This will enable and inform future original studies as well as research syntheses from the perspective of QRD-ACE as an overarching concept, connecting various disciplines. Consequently, we want to answer the following research questions:
To answer these questions, we focused on publications from 2013 onward included in Scopus, which provides a sufficiently large source of bibliographic data for this purpose. Simultaneously, we adapted methods of text mining and predictive modeling to be utilized with bibliographic databases characterized by the three Vs of big data.
Method
In this section, we describe the sample of publications resulting from our literature search and the variables computed to characterize it, followed by outlining how we applied text mining and predictive modeling to devise a research synthesis on QRD-ACE, building on the concepts and procedures outlined in the section QRD-ACE as a Source of Big Data section. This procedure involved the following four steps: (1) cleaning and stemming the literature corpus, (2) scoring all contained publications, (3) iteratively applying priority screening and predictive modeling, and (4) identifying hot spots via topic modeling (cf. Figure 1).

Flowchart of all methods applied from Step 1: Cleaning and Stemming to Step 4: Identification of Hot Spots.

Fit metrics of Gibbs sampling (
Corpus
We used
To conduct an appropriate search, we collected relevant concepts to be used as search terms for each of the four QRD-ACE facets based on the classification of aesthetics, culture, and education by Liebau et al. (2013), previous own research (Penthin, Christ, & Kröner, 2018), and screening of known prototypical examples from the literature. Resulting lists of terms for all four facets were expanded by adding synonyms from thesaurus.com, to take into account heterogeneous terminologies of various research communities. All search terms were either shortened using wildcards, (i.e., blog*, artist*, or music*), or—if the stems were too general—added in all relevant forms (such as “app” and “apps,” but not “application”; cf. Online Appendix A for the search query).
A Scopus search was applied to all publications from 2013 to 2017 from the subject areas “psychology,” “arts and humanities,” and “social sciences.” All resulting publications contained at least one search term related to each of the four facets of QRD-ACE (cf. Online Appendix A) in title, abstract, or keywords. This resulted in a literature
Variables
As described in detail in the procedure section, for every single publication as a whole as well as for every separate text object (i.e., title, abstract, source title, or keywords), the following three variables were computed.
WC
WC has been operationalized as the total WC of all words in a text object, counting each one of multiple occurrences of a word.
UWC
UWC was operationalized as the number of different words in a text object, counting multiple occurrences only once.
Significance scores
Significance scores were operationalized as the proportion of significant words within a text object’s WC. We computed a score for each of the four facets of QRD-ACE: digitalization, AC, education, and quantitative research. These scores were summed up to an overarching
Procedure
Our aims differed from those in previous applications of big data methods for literature synthesis (Cohen, 2008; Cohen, Ambert, & McDonagh, 2009, 2012; Griffiths & Steyvers, 2004; Shemilt et al., 2014; Zhu, Liang, Li, Yu, & Liu, 2019). As opposed to previous studies, it was not our aim to identify all documents related to a specific research question in a narrow research area (Cohen, 2008; Cohen et al., 2009, 2012; Shemilt et al., 2014). Neither did we aim at assessing or increasing the efficiency of applying a document classification approach to an already screened and categorized literature corpus (Cohen, 2008; Cohen et al., 2009, 2012) or at identifying hot spots across the entirety of a literature corpus (Griffiths & Steyvers, 2004; Zhu et al., 2019). Rather, our goal was to identify hot spots of current research within a large, ill-defined, and fragmented multidisciplinary research area. Therefore, we aimed at first identifying documents relevant to our research question and then moved forward to identify the aforementioned hot spots. If at all, the literature corpus can be considered homogeneous only regarding the relationship of its constituents to digitalization, education, and quantitative research. However, it was highly heterogeneous regarding AC. Thus, while the approach utilized in our study is similar to that applied to document classification by Shemilt et al. (2014), it was necessary to develop an explorative procedure to cope with our heterogeneous corpus using building blocks from the related literature. We present this approach in the subsequent paragraphs.
Step 1: Cleaning, Stemming, and Determining Initial Significant Words and Stop Words
Prior to further analyses, we
To facilitate subsequent analyses, we drastically reduced the amount of unique words via cleaning, stemming, and removing all words that were not part of our list of significant words and also occurred less than 5 times within all text objects (for results of cleaning, cf. Online Appendix B).
Initial Stop Words and Significant Words From Prior Research
As an initial list of
Refinement of Stop Words and Significant Words via Text Mining Statistics
To refine both the initial list of stop words and of significant words, we temporarily excluded all already identified words from both lists from all text objects. Then, we computed WC,
Step 2: Significance Scoring
In this step, all text objects were scored (a) according to their relevance to the four facets of QRD-ACE and (b) according to the AC subfacets (i.e., museum, music, etc.). This
Step 3: Iterative Procedure of Priority Screening and Predictive Modeling
Previous research, as well as preliminary literature searches and precursory analyses of the literature corpus, suggests that the largest fraction of potentially relevant documents was related to the AC subfacets of social media and video games. Therefore, random sampling of documents for screening would have led to almost exclusively screening papers related to those large subfacets. This would have biased the predictive modeling approach toward identifying documents about video games and social media, as the training set would have almost exclusively contained documents of those subfacets. Thus, we decided to not only screen documents with the highest positive overall scores but also a substantial amount of documents scoring high on each AC subfacet. This resulted in a training set involving the
Prioritizing Publications for Title Screening Predictive modeling
We used the already
Those regression weights were then used for all documents in the test set to predict the inclusion probability of each document not yet subjected to priority screening. Utilizing the predictors on the (sub-)facet level enabled us to identify documents for priority screening based on the documents within the test set that feature similar facet weights.
A cutoff in inclusion probability was used to select eligible publications for subsequent priority screening processes, while limiting their number to manageable amounts. This resulted in the
Priority screening
During
Inclusion and Exclusion Criteria for the Priority Screening Processes.
In ambiguous cases, a coarse abstract screening was done as well. The main facets of QRD-ACE (digitalization, aesthetic/arts and culture, education, and quantitative methods) had to be stated explicitly for a publication to be
Expansion of Training Sample and Refinement of Significant Words
Based on the priority screening process, we (a) expanded the training sample by all documents screened during the priority screening process and (b) identified additional significant words for all facets and for the subfacets of AC. To start with, we split the manually screened publications of this iteration into two groups according to the screening decision. We computed the log-ratio of
Step 4: Identification of Hot Spots via Descriptive Analyses and Topic Modeling
To determine hot spots of current QRD-ACE, we first conducted descriptive analyses of the frequencies of included publications for each facet of QRD-ACE and for each subfacet of AC to determine whether their distribution already allows the identification of hot spots of research. Second, we analyzed the most common words, bigrams, and trigrams within our corpus.
Subsequently, we applied
Results
Descriptives
WC and UWC before and after cleaning and stemming
The cleaning processes provided drastic reductions in document WC and UWC (WC: 51.37%, UWC: 90.89%), which was reflected in similar changes in abstracts (WC: 53.43%, UWC: 90.38%). This resulted in
Stop words and significant words and their refinement
We started with
Papers Included Through the Iterative Process
Table 2 shows how the number of documents selected for priority screening, the number of documents included, and the inclusion ratio developed over the nine iterations. From the
Together with the initially selected “top,” “flop,” and random publications, a total of
Numbers of Publications Identified, Screened, and Included in the priority Screening Processes, With Inclusion Ratio, Mcfadden’s
a Less than
To ensure we did not miss a substantial amount of publications in our corpus by terminating the iterative procedure, we screened another
In total, we screened approximately 15% (
Identification of Hot Spots (Step 4)
To determine hot spots of current research on D-ACE, we inspected the proportion of documents that featured significance scores larger than zero for all AC subfacets. We found that video games were a striking hot spot of current quantitative research on D-ACE, as 50.1% of all included publications contained words significant to this AC subfacet, followed by the substantially smaller AC subfacet culture, with only 19.9% of all publications containing words significant to it (cf. Online Appendix E).
The identification of video games as a hot spot of current research was supported by the analyses of bi- and trigrams. Nine of the top 10 bigrams and 8 of the top 10 trigrams consisted of words indicating the subfacet of video games (cf. Table 3).
Top 10 Most Common Bigrams and Trigrams in
Hot spots of QRD-ACE beyond video games
To determine additional hot spots of current quantitative research on D-ACE, we excluded all publications containing significant words for the subfacet of video games from the included documents, which resulted in
Top 10 Most Common Bigrams and Trigrams in
To empirically relate the
To determine the best suited number of topics within our corpus, we inspected all four potentially fitting topic model solutions by analyzing the top
As we aimed at modeling topics at the interface of digital and AC activities, we settled for
Top 10 Ranked Words by Word–Topic Probability β of Each of
a Social media and social network are fixed terms that have been grouped together via parsing.
Discussion
Conclusion
RQ1: Hot spots
We were able to identify two major hot spots of QRD-ACE within the
RQ2: Significance of QRD-ACE facets beyond hot spots
We identified a substantial amount of research on digitalization in the AC activity subfacets related to museums, literature, music, art, and film. However, these facets were represented by much fewer studies than the aforementioned hot spots. This difference in cluster size comes as little surprise as research on video games, and social media will hardly be unrelated to digitalization, in contrast to other AC activities that can, and frequently do, occur in nondigital settings. While it may be safely assumed that there are sufficient original articles for research syntheses on specific questions related to the identified hot spots, this may be different for the other topics: The smaller topics may also be assumed to consist of studies investigating various research questions. Thus, in spite of being related to a substantial amount of identified studies, topic-specific research syntheses may still not be applicable.
Applying the big data methods of text mining and predictive modeling to cope with literature on QRD-ACE
In the present article, text mining and predictive modeling turned out to be efficient tools for screening documents and to detect underlying structures of the included publications even in ill-defined, heterogeneous, and fragmented research fields such as QRD-ACE. We can safely assume screening just the
Limitations
This article reached its 2-fold aim of identifying hot spots of QRD-ACE and providing further insights into the application of text mining algorithms for research synthesis. Nevertheless, there are several limitations to be considered. While all research syntheses have to cope with noisy corpora, this has been a particular issue in our approach of identifying hot spots in an ill-defined and fragmented research area: Due to our broad search query with numerous search terms, the extracted corpus may contain more noise than other data sets used a for text mining. However, we have taken care of this by thoroughly inspecting potentially significant words and have excluded several frequent
For the predictive modeling approach, we had initially assumed that the mean document significance scores would be sufficient to identify enough documents for priority screening throughout the whole procedure. Only after switching to text object-specific significance scores for the sixth iteration, it became apparent that further relevant documents could be identified, which would otherwise have been overlooked. Regardless, those documents relevant to QRD-ACE according to the mean document significant scores would have been identified via the text object-specific significance scores anyway. Thus, while the application of text object-specific significance scores halfway through the iterative procedure might have had some effects on the order in which the documents may have been screened, it can be assumed to be neutral with regard to the results. Nevertheless, we suggest, for the sake of clarity, that future studies utilizing predictive modeling for study selection should apply text object-specific significance scores right from the start.
Concerning the resulting selection of documents, it has to be assumed that there are more than
Implications for Future Research on D-ACE
Given the skewed distribution of existing studies, with many publications on video games and AC-related social media activities, future QRD studies regarding education in classical facets of AC activities, such as performing and visual arts, museums, or music, are warranted. These studies should go beyond the topic of social media activities, for example, by including issues like post-Internet arts or virtual museums (Jörissen et al., 2019).
By applying big data tools, we were able to identify many studies from disciplines that are peripheral to the field of QRD-ACE and its pivotal theoretical approaches. This may enrich the discussion on both available evidence and research gaps in the field, and it provides the groundwork for quantitative ACE-related systematic reviews on specific questions on the identified hot spots. Building on the present study, such syntheses may consider the phenomena investigated in the identified studies from a solid, theoretical, ACE-focused basis. This may help to avoid the pitfalls of being distracted by superficial technological aspects, a shortfall of many studies researching digital tools like learning/educational games, music software, or digital museums. This way, future quantitative syntheses on hot spots of QRD-ACE may prevent people in education, practice, and policy from sticking to surface features and having to reinvent the wheel with every small technological change.
Supplemental Material
Supplemental_Material - Big Data and Digital Aesthetic, Arts, and Cultural Education: Hot Spots of Current Quantitative Research
Supplemental_Material for Big Data and Digital Aesthetic, Arts, and Cultural Education: Hot Spots of Current Quantitative Research by Alexander Christ, Marcus Penthin and Stephan Kröner in Social Science Computer Review
Footnotes
Data Availability
Declaration of Conflicting Interests
Funding
Software Information
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
