Abstract
Keywords
Introduction
These days, much of our daily activities are digitized and recorded in some form. Analysis of such data presents the opportunity to gain insights into our daily activities, including health-related activities. Digitized records pertaining to health-related information can originate from various sources. The primary source of such information has been software systems that are designed to collect health-related information, such as electronic medical records. There has been vibrant data/text mining research on such formal health information.1,2 Although not created for collecting health information, blogs and other social media have recently emerged as another unique complementary source of health-related information.3––6 In this study, we focus on health-related information in consumer product reviews, which are still an underexplored data source of such information.
Many online vendors collect product ratings and reviews, which can serve as word-of-mouth advertising as well as feedback to the product manufacturer/merchant. As online consumer spending continues to grow, 7 the amount of consumer-generated reviews has also increased. Given the variety of goods sold online, it is not uncommon that consumer reviews discuss health-related issues. For example, a consumer may write a review about an adverse effect caused by the product or justification of choosing the product to avoid/alleviate a health issue, eg, “[This product is a] Major migraine trigger!” or “It's supposed to help literally pull gingivitis out.” The number of reviews can easily reach millions on prominent retail sites, such as Amazon.com. Owing to advancements in software and hardware, processing of a huge dataset, which used to take weeks or months, can now be completed in hours or real time.
We were motivated to investigate health-related information in consumer product reviews based on two assumptions: (1) Given the vast amount of reviews available, we should be able to collect infrequent but valuable pieces of information by leveraging efficient big data techniques; (2) Even if the health issues mentioned in product reviews are not novel discoveries in themselves, it is still useful to summarize the different types and aspects of illness discussed on consumer products – ideally with discovering incidences/patterns to complement formally collected health data.
In this exploratory study, we conducted quantitative and qualitative analysis on the types of health issues found in consumer product reviews. We processed 1.3 million Amazon.com reviews on Grocery and Gourmet Food products using a scalable natural language processing (NLP) system based on Apache Unstructured Information Management Architecture (UIMA) 8 Asynchronous Scaleout (UIMA-AS). A subset of the concepts extracted were manually reviewed and annotated as relevant or irrelevant to health-related issues. With this dataset, a machine learning classifier was trained using
Apache Spark 9 for screening the relevant reviews. Descriptive statistics and manual inspection were conducted to analyze the results.
The three deliverables from this study were as follows: (1) quantitative and qualitative analysis on the types of health issues found in the reviews; (2) a machine learning classifier that can screen for reviews containing health-related issues; and (3) insights about the task characteristics and challenges for text analytics that will guide future research. In terms of practical impact, the study contributes to biomedical informatics by exploring the value of consumer product reviews as a complementary information source for the purpose of public health monitoring.
Background
An increasing number of studies are being published addressing mining of health-related information from nontraditional data sources. For example, Corley et al analyzed over 2 years' blog posts to detect influenza epidemic signals. 3 Ofoghi et al investigated the classification of emotion expressed in Twitter posts for disease outbreak detection and monitoring. 4 Aphinyanaphongs et al applied text classification for the detection of e-cigarette use and smoking cessation in Twitter. 5 Sarker et al. 6 conducted a literature survey on studies exploiting social media data for pharmacovigilance. Several publications have also focused on analyzing consumer reviews, such as automated summarization, 10 opinion/sentiment analysis, 11 and evaluation of the helpfulness of reviews. 12
To our knowledge, however, there are few studies focusing on consumer product reviews as a source of mining health-related information. A recent study by Sullivan et al investigated adverse reaction to dietary supplements reported in Amazon user reviews of nutritional supplements. 13 The results suggest that product reviews can be an information source for monitoring adverse reactions reported on dietary supplements. Extrapolating these findings, mining a large set of reviews across many products can provide information on diverse health-related issues. Until recently, mining information from a large set of product reviews was difficult because of hardware and software limitations. Significant advancements in scalable analytics offers unprecedented computing efficiency at moderate cost14––16 and enables large-scale data mining.
Material and Tools
Amazon reviews
Amazon.com is one of the major online retailers in the United States. In 2005, it had more than 304 million active customer accounts and $107 billion net sales. 17 Product reviews provided by customers available on the Amazon website contain highly valuable information. A recent CNET article writes: “Customer reviews have been a crucial part of Amazon's websites for over 20 years, with the written reviews and 5-star rating system becoming an important form of accountability and sign of popularity and quality for items buyers often cannot touch or test out before purchasing.” a
In this study, we used customer product reviews on the Amazon.com website that had been previously obtained by McAuley et al.
18
The dataset was originally collected for data mining studies and then shared with the research community. The original collection contains 143.7 million product reviews between May 1996 and July 2014. The reviewed products, and hence the corresponding review text, are divided into 24 high-level categories, such as books, electronic, and movies and TV, as well as Grocery and Gourmet Food. The review and product information are stored in JSON
19
format, with fields containing review text, review date, product name, and product category, among other information. In our study, we used the category Grocery and Gourmet Food (or
nQuiry and UIMA-AS
The nQuiry system is a comprehensive NLP pipeline developed by the Medical Informatics team of the Kaiser Permanente Southern California Medical Group. The nQuiry system uses Apache UIMA 8 and allows flexible decomposition of NLP tasks into modules. The core modules of the nQuiry system include tokenization, typo correction, sentence chunking, part-of-speech tagging, syntactic parsing, phrase extraction, concept candidate selection, concept searching, sense disambiguation, and negation/modality classification. The nQuiry system has been incorporated into several applications: automated diagnosis coding, evaluation and management coding, risk screening for aortic aneurysm, and cardiovascular risk factor identification. 20 The nQuiry deployment leverages the UIMA-AS framework. Multiple nQuiry processes are launched as UIMA-AS Service Instances on server machines, and they can be used in a parallel manner to handle a large collection of input text submitted by UIMA-AS Clients, where load balancing is facilitated by the asynchronous middleware using Apache ActiveMQ 21 implementation of Java Messaging Services. This framework can achieve high throughput by scaling out the workload linearly.
Apache Spark and MLlib
Big data technologies, such as Apache Hadoop 22 and Pig, 23 are powerful and convenient platforms for handling large datasets. We used Pig to build our post-NLP analytic pipeline. In Pig, the data flows are described using Pig Latin language, which then gets translated into MapReduce 24 jobs that exploit data parallelism. To facilitate analyzing millions of reviews processed by the nQuiry system, we used Hadoop SequenceFile – a flat file consisting of binary key/value pairs. In our case, we extracted filename/entities as the key/value pairs from a large number of outputs generated by the nQuiry system and aggregated them into several SequenceFiles, which could then be handled easily by calling user-defined functions in Pig Latin script.
http://www.cnet.com/news/amazon-updates-customer-reviews-with-new-machine-learning-platform/
Apache Spark 25 is an open-source cluster computing framework that employs in-memory primitives for performance. Resilient Distributed Datasets are the key programming abstraction in Spark. Resilient Distributed Dataset is essentially a logical collection of data partitioned across machines that can be manipulated in parallel. Spark MLlib is a distributed machine learning framework on top of Spark Core. MLlib consists of common learning algorithms and utilities, including classification that we used in our study.
Methods
Data sampling and preprocessing
As an exploratory study, we chose the grocery products from the Amazon product categories that were available in the aforementioned dataset. This subset contains 1.3 million reviews that cover diverse product types, ranging from drinks and snacks to dietary supplements. We processed the reviews by the nQuiry system, followed by additional filters to narrow down selection of reviews containing health-related issues. Figure 1 shows an overview of the data processing workflow.

An overview of the data processing workflow.
The following modules of the nQuiry system were essential in the first step of the workflow: tokenization, sentence chunking, part-of-speech tagging, syntactic parsing, phrase extraction, concept candidate selection, and concept searching. Specifically, the phrases extracted by the upstream modules were screened by the
After applying the nQuiry system, as additional filtering of concept phrases detected, the
Corpus annotation
It was known that the concept phrases from the NLP and semantic type filtering still contained many false extractions. In order to further weed out the false extractions, we manually annotated a subset of the concept phrases for training a machine learning classifier that could determine whether a phrase really represented a health-related issue (Fig. 1). Since each review can contain more than one possible health issue, the annotation was performed on phrases, rather than on sentences or reviews. The granular unit of classification at the phrase level is more informative and can be easily interpreted at the review level if needed – that is, any review text that contains at least one
Three of the authors (MT, SD, and JF) read through ~1,700 phrases each (5,077 phrases in total) with given surrounding contexts and classified into two classes,
After the entire dataset was annotated, 100 phrases annotated by each of the three annotators were annotated by the other two annotators (50 phrases each), and the total of 150 phrases were doubly annotated so that the agreement of the annotator pairs could be estimated. The agreement rate calculated for pairs of annotators was a mean Cohen's kappa of 0.751, which might be considered as
As byproduct of the annotation, we identified phrases that were not of our interest but frequently detected by the nQuiry system, and incorporated ad hoc filters into the postprocessor of nQuiry. For example, after observing that the majority of occurrences of
Classifier training and testing
A machine learning classifier was developed to determine health relevance of candidate concept phrases identified by the nQuiry system. A Pig-based pipeline was created to process each phrase along with its context to generate the feature vector and train a classifier. Specifically, we used MLlib to train a classification model of logistic regression. A model was trained with standard feature scaling and L2 regularization. We initially tested classification algorithms other than logistic regression, such as support vector machine and multinomial and Bernoulli naïve Bayes, which are commonly used for text classification. Observing that our preliminary tests conformed to previously reported results on these classification methods (eg, logistic regression and support vector machine yielded comparable performance, and they outperformed naïve Bayes 29 ), we decided to use logistic regression in our study, which can provide well-calibrated probability scores for predicted classes. When training a logistic regression model, we used the default parameters in MLlib and employed basic classification features, as discussed next, so that we could first establish a general workflow to analyze diverse reviews.
As for classification features, bags of words were used. This general approach was selected based on our experience in training similar classification models, specifically classifiers used in the nQuiry system for negation/modality detection. 30 Three bags of words were created from within-sentence context: (1) words left to the target phrase, (2) words right to the phrase, and (3) words in the phrase itself. A trained classifier predicts the relevance class with a probability score, and typically a threshold of 0.5 was used to differentiate the relevant class apart from the irrelevant class.
To study the sufficiency of training data size and also the appropriate ratio of relevant to irrelevant instances in the training dataset, we conducted two experiments with the classifier:
We varied the number of annotated instances in the training set so as to examine the impact of the training data size, ie, how the accuracy would improve as the training data size is increased. In particular, we were interested to see whether the improvement would hit plateau at a certain point, which indicates even adding more annotations would not significantly help the accuracy.
We varied (down-sampled) the number of irrelevant instances in the training data and examined how the ratio of the two classes would affect the classification accuracy. Note that in this experiment each classifier might have different optimal probability thresholds, given that they were trained with differently manipulated class distributions. Therefore, we computed the precision-recall curve as more objective evaluation to show the tradeoffs.
To obtain robust accuracy estimation in the experiments, we performed resampling evaluation tests, in which we randomly split the annotated training instances into training and testing subsets: 80% for training and 20% for testing. The process was repeated 50 times to compute averaged metrics of precision, recall, and
Precision = True positive/(True positive + False positive)
Recall = True positive/(True positive + False negative)
After the above two experiments, we trained a final classifier using all the annotated data, which was found to have the best ratio of relevant to irrelevant instances as reported in the “Results” section. The final classifier was applied to screen for relevant concept phrases from a large unseen dataset of reviews processed by the nQuiry system. Similar to evaluating information retrieval systems, it was impractical to calculate the recall of the developed workflow on this large unseen dataset, besides that our interest in the study was to discover and examine any health-related information buried in customer review data. Therefore, on this large unseen dataset, we focused on the precision of extracting health-related information within the cases that were judged as relevant per classifier confidence. Instances assigned with prediction probability of 1.0 were collected, and 100 samples among them were manually reviewed. To further examine the classification results, the remaining predictions in the probability range of [0.0,1.0] were divided into six bins: [0.0,0.2], [0.2,0.4], [0.4,0.6], [0.6,0.8], [0.8,1.0], and [1.0,1.0]. The precision of each probability stratum was estimated by manually reviewing 200 samples and annotating each as either true positive or false positive.
Content analysis of the reviews and predictions
To quantitatively summarize the health-related issues, we computed the most frequently mentioned diseases/symptoms in the reviews. For each major disease/symptom, a couple of top associated product types were also provided to demonstrate the potential value of identifying such relations. To qualitatively analyze the review contents, we went over 100 random manual relevant annotations and categorized them based on the nature of the health issues. We also reviewed the false positive disease/symptom extractions by NLP to identify challenges and opportunities for future improvement.
Results
Classifier tuning and prediction accuracy
The experiments were conducted using the manually annotated dataset, 5,077 phrases/instances, of which 35% were annotated as relevant. Results of the three major evaluations are summarized as follows:
Effect of training data size
The first experiment (Fig. 2) showed that the

Classifier performance for different training data sizes.
Effect of class ratio
The second experiment (Fig. 3) showed that slightly better performance could be achieved when using the training data with the ratio of irrelevant to relevant of 2:1 (the

Classifier performance for different ratios of irrelevant to relevant instances.
Precision of the fully trained classifier
The precision of the final classifier per probability stratum is shown in Figure 4. Overall, the classifier learned reasonably well to generate prediction probability that correlated with true reliability. For example, the highest probability bin had a precision of ~82% while the lowest probability bin only had ~37%.

A fraction of relevant instances per bin (score range) on an unseen data set.
Health-related issues found in the reviews
The most frequent health issues observed in the reviews are summarized in Table 1. While these health issues were frequently observed in the fully annotated dataset and also in the large dataset used for the final experiment, the frequency counts shown in the table were obtained in the former dataset as the health issues were manually confirmed in that dataset. The top issue
Frequent health issues in the customer reviews.
Examples of product types per health issue.
Categories of health issues in the grocery reviews.

Proportions of the health issue categories.
Discussion
Health issues in the reviews
The most frequent health problems found in the grocery reviews (Table 1) are commonly encountered in our daily life. Given that many products are foods and drinks, a majority of the problems are symptoms related to the gastrointestinal system. Being the two explicit diseases in the list, diabetes and allergy both represent leading chronic conditions that concern people's diet decisions. As an example of significance, it was estimated that diabetes and prediabetes cost America $322 billion every year, 31 including direct medical expenditure and indirect loss of productivity. Based on the reviews, a good sign is that consumers do pay attention to prevention/management of diabetes in their shopping choices. On the other hand, there are still abundant opportunities to better integrate patient education into shopping applications especially for disease-specific population.
To give readers a feel on the products that frequently cause the health problems, in Table 2 we provide a couple of product types (instead of original maker/product names) for the top five health conditions of Table 1. Since we considered both positive and negative reviews as relevant, the products could be either beneficial or detrimental. A potential application along this line is to systematically collect/organize the peer-recommended products per health condition and share with the concerned communities. Note that this study focuses on grocery goods, and we expect that reviews on other product categories, such as sports, would reveal an even wider variety of health issues that involve injuries or ergonomics.
The categories we summarized in Table 3 confirm with general intuition on the issues consumers usually write about products. Adverse effect is probably the most serious that can be reported about a product, and it constitutes a substantial portion (~20%) among the health-related reviews. Although adverse reaction can be confounded by inappropriate use or an existing health problem, the reviews may serve as a valuable surveillance source and complement formally collected information. For example, the U.S. Department of Health and Human Services published a report 32 that pointed out limitations of the Food and Drug Administration (FDA) adverse event reporting system for non-prescription dietary supplements. Since there is no premarket approval regulation, the FDA mostly relies on the consumer-initiated reporting. However, the mechanism is passive and not effective in tracking product information, prevalence, and trends. In contrast, reviews from major e-commerce sites have two advantages: (1) they are created actively by a large consumer population as part of regular business process, and (2) they cover diverse products and are usually already integrated with useful information about the products and consumers. Therefore, we believe automated syndromic surveillance over massive consumer reviews will benefit sensitivity, amass stronger signal, and avail richer attributes for analytic inquiry.
It is not surprising that a majority (~40%) of the health-related reviews comment about the beneficial effects of using the product. However, one caveat with those positive reviews is the difference between personally verified benefit versus benefit merely based on belief or layman knowledge passed among consumers. A closely related category is reviews that warn of certain risk to health, which in many cases may not be based on personal experience but reference to second-hand medical knowledge of unknown sources. The credibility of such health information and its influence over consumer decisions are interesting research topics to investigate. 33 The other two categories relate to existing health issues of the consumer her/himself or a close person the purchase was for. It is reasonable to see that such issues together constitute a substantial 25% of health-related reviews, and they justify the purchase decisions to resolve symptoms or avoid adverse effects given the underlying health condition. As discussed above, such problem-specific warning/recommendation may be systematically collected and shared among population with pertinent needs.
Challenges in mining health-related information from reviews
The data size can be tamed with advancement of technology, but the intricate characteristics of the contents remain most challenging. During annotation of our training data, we debated over what should be defined as health-related (relevant). For example, when a review mentioned absence of side effects from a product, it could actually suggest the relative benefit of avoiding a health problem that concerns the consumer who had negative experience with using another symptom-triggering product. It can be difficult to differentiate between normal and abnormal findings. For instance,
We observed considerable unexpected errors by the NLP engine, which had been tuned toward processing clinical documents. The general English of consumer reviews appear more diverse than the clinical sublanguage and can easily cause the engine to make mistakes. We summarized the common types of false extractions in Table 4. There are many ambiguous terms overlapping with medical usage, for example, idioms (
False positives (FPs) of NLP extraction.
Future Work
One of the major challenges noticed for the NLP engine was sense ambiguity as in Table 4. In our current study, we used bags of words features in the machine learning classifiers, but it would be of great interest to explore additional features, such as those based on word embedding. Apache Spark MLlib provides the implementation of word2vec, a word embedding technique, which computes distributed vector representation of words. Word embedding has been successfully employed in different NLP applications.34––36 In terms of the classification categories, we did not differentiate customer reviews reporting positive effect of a product from those reporting negative one in the current study. We will seek developing a finer classifier that can predict the sentiment polarity. To detect weak signal of health conditions that are hypothesized with certain association or conflicting management concerns, we plan to mine co-occurring health issues mentioned in the reviews and inspect for any possible valid manifestation. Additionally, we are interested in applying our methods to other product categories, such as sports/outdoors and beauty. Leveraging such knowledge harvested from massive and diverse customer reviews, a website summarizing the reported health issues can be created to guide consumers in their decision making, for example, creating a website analogous to ConsumerLab.com, 37 which summarizes test results of products related to health and nutrition.
Limitations
Although we set up and employed a scalable text analysis framework for our further exploration, we have not taken full advantage of the provided scalability in the reported work. In terms of the granularity of the target information, our study design of treating both positive and negative reviews as one single relevant class may not align with general interest, and our definition of the relevant cases was not free from subjectivity. Additionally, we did not perform systematic cleaning or reconciliation on the manual annotations. Due to limited resources, we did not perform evaluation of individual NLP modules, such as the phrase extractor and the concept searcher. Because our aim was to establish a general framework, we did not explore and customize features specific to the current data or thoroughly tune parameters of machine learning classifiers.
Conclusions
The existence of data recording many diverse aspects of our daily activities has opened new opportunities for informatics research on public health monitoring/promotion. In this study, we extracted health-related information from Amazon grocery product reviews by leveraging scalable analytic technologies. Benefiting from a big data scale-out framework, the NLP system completed processing 1.3 million reviews in 6 hours, despite several computationally expensive steps such as sentence parsing and concept searching. A random subset of about 5,000 disease/symptom phrases was manually annotated to train a logistic regression classifier based on Apache Spark MLlib. The high-confidence predictions of the classifier achieved a precision of 0.82. Such a classifier could be used to screen for health-related data in new consumer reviews. The health issues we found in the reviews are useful in terms of (1) complementing existing public health data sources, (2) empowering consumers for better-informed decisions, and (3) providing feedback to product manufacturers for improvement. On the technical side, our content and error analyses pointed out challenges in using NLP to process massive product reviews and in extracting health-related information from unconventional information sources. The study delivered useful insights for future research.
