Abstract
Introduction
Colorectal cancer (CRC) is a world-wide healthcare problem with high morbidity and mortality. It is the second leading cause of cancer-related death throughout the world. 1 The annual expenditures associated with CRC are substantial. They are estimated at 14 billion US dollars per year in the US alone. 2
Several factors make CRC applicable for population screening. These include high incidence, a prolonged disease development course, and effective endoscopic treatment options in premalignant stages. 3 CRC screening has therefore been adopted on a wide scale and has been incorporated into international guidelines.4,5 CRC screening has significantly reduced the incidence and mortality rates of the disease in the last two decades.6,7
Extensive research in the field of CRC screening has been published. These studies assessed various aspects, including available screening tools (endoscopic, fecal, radiologic, and blood tests), screening programs, high-risk populations, and cost-effectiveness. Manual summarization of this large amount of research data is both infeasible and impractical. When examining the literature in other medical fields, we note that endeavors have been made to identify research trends. These studies have been narrowed down to a defined number of cited articles, 8 to specific journals, 9 or to a limited number of years. 10
Current computational power and machine learning advancements have prompted a technique termed “text-mining.” This technique extracts information from texts using computational statistical methods. 11 Text-mining can be applied to identify trends and to investigate the dynamics in a research field.12–15
The aim of our study was to apply a text-mining technique to evaluate published literature for CRC screening in the last 25 years. We performed trend analysis to discover patterns in CRC screening publications.
Methods
An institutional review board (IRB) approval was granted for this study. Informed consent was waived by the IRB committee.
Search strategy
The US National Library of Medicine (NLM) produces an annual version of MEDLINE/PubMed data which is freely available for download.16–18 We used the 2018 MEDLINE/PubMed baseline dataset in this study. We retrieved all available MEDLINE/PubMed annual datasets from 1992 to the end of 2017 (25 years). Data lock and citation retrieval were performed on 1 August 2019.
Data processing
The data processing and results visualization were written in Python (version 3.6.5, 64 bits). We used the open-source Pandas library (version 0.24.2) for data handling, open-source Geopandas (version 0.4.1) for geographical visualization, open-source SCIPY (version 1.3.0), open-source NLTK (version 3.4.4) for text handling, and open-source MatPlotLib (version 3.1.0) for results visualization.
For text-mining, each title, abstract, and the first author’s affiliation were tokenized. All punctuation and double spaces were removed, and each word became a single entry in a list.
Inclusion and exclusion criteria
For creating a subset of papers which were relevant to our desired topic, we used Medical Subject Headings (MeSH). MeSH terms are used as a supervised glossary for searching in the PubMed database. 13 The following keywords were selected from MeSH to create the subset of articles relating to the colon and rectum: “colorectal,” “colon,” “rectal,” “rectum,” “colonic,” and “CRC.” These terms were matched to the tokenized title list and a subset of records was retrieved.
We then included papers which had one of the terms “cancer,” “carcinoma,” “adenocarcinoma,” “adenoma,” “polyp,” or “mass” in the abstract, and also one of the following terms: “screening,” “surveillance,” or “screen.” Abstracts shorter than 50 words were excluded.
Data extraction
The following data was extracted from each of the included articles: PubMed unique article ID (PMID), title, journal, publication date (year and month), abstract text, article type (e.g. review, randomized control trial), article language, and authors (including the first author’s affiliation, if available). We then used a free-for-use application provided by the National Center for Biotechnology Information (NCBI) to retrieve the number of times each article was cited, based on its PMID. 19
The first author’s country was retrieved from the affiliation data, if available. The first author’s affiliation was compared with a country list extracted from the Geopandas library. We normalized the number of publications and the number of citations for each country according to its population by extracting the yearly population size of each country from the World Bank Catalog. 20
Topic modeling
All included studies were divided into topics using the following methodology: each study’s title was analyzed after omitting stop words such as “the,” “a,” “an,” and “in”, which are detailed in NLTK version 3.4.4 StopWords corpus. The 1000 most frequent two-word combinations in all titles were listed in descending order of frequency. A gastroenterologist specialist physician (KU) defined 10 topics in the field of CRC screening. Topics included: Screening and surveillance programs, risk stratification, non-invasive screening, epidemiology, inflammatory bowel disease (IBD) screening, quality assurance, racial disparities, treatment, quality of life, and cost-effectiveness.
Each word combination in the list was manually labeled as either non-specific or related to 1 of the 10 topics. Each study record was then matched to one of the 10 topics by comparing the words in the title with the topic list.
Data analysis
We used Pearson correlation to evaluate normalized trends in topics for intervals of 5 years. We used univariable linear regression to evaluate country growth rate trends. The slope statistical significance is presented through the
For article type and country analysis, the citation rate was calculated by dividing the overall number of times articles were cited by the overall number of publications.
Results
A total of 19,657,610 records were retrieved from the NLM database between 1992 and 2017. Of these, 14,119 publications were related to CRC screening. A flow diagram of the search is provided in Figure 1. Almost all papers were in the English language (93.5%).

Flow diagram of included studies using MEDLINE/PubMed search.
Time trend analysis
The number of annual publications relating to CRC screening increased between 1992 and 2014 (Figure 2), with a slight decline since 2014. The overall number of annual publications increased from 124 publications in 1992 to 992 publications in 2017.

Trends in the number of colorectal cancer screening research from 1992 to 2017.
Article type analysis
MEDLINE/PubMed article type was specified for 2862/14,119 (20.3%) papers. Among those, 1429/2862 (50.0%) were review articles, 519/2862 (18.1%) were randomized controlled trials, and 412/2862 (14.4%) were multi-center studies. The article type with the highest citation rate (number of citations/number of publications) was guideline papers (69.2) followed by multi-center studies (27.4) and randomized controlled trials (27.3). Figure 3 shows the distribution of article types and their corresponding citation rate.

Distribution of colorectal cancer screening by article type, indicating (a) publication volume and (b) citation rate (i.e. number of citations per number of publications).
Country analysis
Publications on CRC screening originated from 97 countries, mainly from North America and Europe. The US had the highest number of papers (

Distribution of colorectal cancer screening by country, indicating (a) publication volume, (b) citation frequency, and (c) citation rate (i.e. number of citations per number of publications). The left axis presents absolute numbers and right axis presents normalization by country population size.
Figure 5 shows the growth rate in number of annual publications in respect to the country of origin, China (0.14,

World map indicating colorectal cancer screening publication growth rate by country. The color index represents the calculated growth rate. Countries with less than 100 overall publications were omitted from the growth rate analysis (demonstrated by the white color).
Topic analysis
The most researched topic is “screening and surveillance programs” (38%). Yet, a continuous decrease in research attention is shown for this topic over the past 25 years (

Published article topic popularity and trends over time.
Most frequently cited articles
Table 1 lists the top 20 most cited articles published on CRC screening in the past 25 years. The mean number of citations per article is 98. The top 20 most cited articles were published in five journals, with the greatest number in
The top 20 most frequently cited articles published on colorectal cancer screening in the past 25 years.
Discussion
In our study, we applied a text-mining approach to present an overview of 14,119 CRC screening publications over the past 25 years.
The number of CRC screening publications has increased over the years. In 2017, the number of published papers in CRC screening was eight times greater than in 1992. This increase in the number of published articles coincides with the general trend of increased global publications in the medical field.
41
Several factors can be attributed to this particular trend in CRC screening publications. This growth can be a result of the expansion of CRC screening programs and the implementation of population-based programs.42,43 The awareness of CRC screening is consistent with the worldwide endeavors that have focused on cancer prevention.44–46 Furthermore, the increase in CRC screening publications could be linked to the rise in CRC incidence, particularly in countries in Eastern Europe, Asia, and South America.
47
Another possible factor is the emergence of new technologies. For example, in 1994 computed tomographic colonography was introduced by Vining
Research in the field of CRC screening started with several seminal publications 25 years ago.3,22,25 These papers established the understanding that CRC screening can effectively reduce CRC mortality rate. They showed that colonoscopic polypectomy resulted in a lower-than-expected incidence of CRC and that annual fecal occult-blood test decreased mortality from CRC. These papers have likely promoted interest in CRC screening research and added momentum to the production of publications.
When analyzing the type of articles, the most frequently cited were guideline articles. Guidelines are usually composed of the accumulation of a large research body that can influence the clinical setting. 49 Over the past 25 years, guidelines for CRC screening have been composed by professional groups and by a panel of expert gastroenterologists. They offer recommendations to assist practitioners and patients in decisions regarding screening variables such as average-risk persons, high-risk family history, screening tools, and quality indicators.23,28,50,51 The beneficial effects of guidelines depend on the successful adaptation to clinical settings. The high citation rate of CRC screening guidelines reflects their contribution to the field.
Most of the CRC screening research studies have been performed in North America and Europe. In these countries, greater resources are available and screening is more frequently implemented. 4 CRC incidence and the implementation of CRC screening differs among continents and countries. 4 The US leads in the number of publications and citations in CRC screening, which reflects the prominent role of the US gastroenterology community and its dominant position in international CRC screening research. The advancements of screening programs in the US can also be attributed to the endeavors of various national societies including the U.S. Preventive Services Task Force, American Cancer Society, American Gastroenterological Association, American Society for Gastrointestinal Endoscopy, and National Colorectal Cancer Roundtable. The extensive research in this field as well as the progression of screening programs in the US have resulted in a decrease in CRC incidence and mortality over the past two decades, as reported by the American Cancer Society. 52
Over the last few decades, there is an increasing trend in CRC incidence and mortality in Asia. 53 We have demonstrated a high growth rate of CRC screening publications in China. The screening programs in this country are still relatively lacking.54,55 Hopefully, the rising trend in CRC screening publications can promote the understanding of screening significance, which will ultimately influence screening behavior for the wide population.
In our study, we performed a text-mining analysis of two-word combinations. This allowed us to study “hot topics” in the field of CRC screening. Naturally, the most researched topic in the field of CRC screening was “screening and surveillance programs.” This topic has remained relatively stable over 25 years. We found that “quality assurance” was the most commonly trending topic over the last 5 years. This may help predict, to a definite extent, future trends in CRC publications. “Quality assurance” defines optimization of the benefit to risk ratio of colonoscopy screening.36,56 Initially, research focused on the implementation of CRC screening programs but, with time, an emphasis has also been placed on the quality of screening.
The topic of “non-invasive tests” is a prominent subject with a slight non-significant increase in the number of studies during the last decade. New laboratory tests include DNA, RNA, and protein biomarker stool and blood tests. 57 Novel imaging tests include colon capsule endoscopy 58 and magnetic resonance colonography. 59 The focus of research on this topic can be attributed to the attempts to develop and implement non-invasive tests, thereby reducing the need for colonoscopy for low-risk populations.
Although a relatively small number of studies have focused on “race disparities,” in 2002 race related research showed an increase in interest and has plateaued since then. Disparities in CRC screening are experienced by minority groups. Screening rates remain low for African Americans, Hispanics, and Asians.60–62 The research accumulated on “racial disparities” can promote effective intervention designed to decrease gaps in CRC screening.
The research topics of “screening among IBD patients” and “risk stratification” have declined over the years. This may indicate that a foundation for recommendations for screening high-risk groups has already been effectively formulated.
When observing the 20 most cited articles, we can note that these studies have been published in the top-ranking world medical journals. In total, 11 have been published in
Our research has several limitations. First, this is a comprehensive study that includes 25 years of research conducted in 97 countries. As such, it can only provide a representation of CRC screening research on a global level. Second, the citation frequency was extracted from data provided by NCBI, while other options such as google scholar might have produced different results. Lastly, we used two-word combinations for topic modeling. Other approaches are available, such as latent Dirichlet allocation, but were found to be less effective in our study.
In conclusion, the number of publications devoted to CRC screening is steadily rising, with high-quality research reaching top-tier journals. A surge in the number of publications on the topics has been increasing in countries previously much less involved in academic research in the field. Screening programs remain the most researched topic, and quality indicators in screening colonoscopy has been attracting attention in recent years. A text-mining analysis of CRC screening research contributes to the understanding of current publication trends and topics. This technique has predictive value in illuminating future trends in CRC publications.
