Abstract
Keywords
Introduction
The advent of Web 2.0 technology has revolutionized traditional industries by providing new platforms and channels for promotion, including the healthcare industry. Over the past decade, there has been a growing demand for Internetization in healthcare, as evidenced by the increasing number of adults in the United States using the Internet to access health information.1,2 This demand has spurred the creation and development of online healthcare platforms, which combine Internet technologies with the traditional medical industry to facilitate doctor–patient communication and provide medical services. Well-known examples of such platforms include
Similar issues are common in other types of online platforms, such as online hotel and restaurant platforms, which have been extensively researched in recent years.4,5 In particular, researchers have focused on identifying fake reviews, which mislead readers by offering unobjective and unjust evaluations of target objects. 6 While numerous studies have identified fake reviews on platforms such as Yelp and TripAdvisor, relatively few have focused on identifying fake reviews on online medical platforms. Early research in this area used a dataset of physician reviews constructed by Li et al. 7 but only a few studies have explored the performance of machine learning and deep learning models in detecting fake physician reviews. Moreover, the existing research is limited by the small size of the dataset and the reliance on classical machine learning models such as Support Vector Machine (SVM) for the detection task. 8
In the wake of the COVID-19 pandemic, the role of online medical services has become even more crucial in addressing the uneven distribution of medical resources. To address the gap in the literature on fake physician review detection, this study proposes to construct a new dataset of fake physician reviews using a crowdsourcing approach and real user review data from a well-known online medical platform. The study will then develop a fake physician review detection model using both classical machine learning methods and deep learning methods.
Literature review
Online health community
The use of the Internet to access health care information has become increasingly popular among patients with diseases. 9 The emergence of Medicine 2.0 or Health 2.0 applications, which use Web 2.0 technologies, has enabled Online Health Communities (OHCs) to provide informational and social support for patients.10–13 In addition to patients, physicians also participate in OHCs and provide counseling services to patients. 14
Physicians who participate in OHCs may receive social and financial rewards.15,16 Studies have been conducted on the factors that influence physicians’ rewards from OHCs, 17 as well as on the profitability model of physicians in OHCs from a professional capital perspective. 15 Patients have also been the focus of studies, exploring the positive effects of OHCs on patients, the ways in which patients obtain information they need from OHCs, and how reviews on physicians influence patients’ decisions when choosing a physician to consult within the doctor–patient community.18,19
Physician rating websites
In addition to providing social support, OHCs can also provide medical information and consultation services, including online consultation services, creating an online patient–physician relationship. 20 Patients are consumers in this relationship and evaluate physicians’ services in the same way they evaluate products on e-commerce platforms. 21 Physician Rating Websites (PRWs) have become increasingly popular, with over 40 websites such as Yelp and Angie's List offering patients reviews of healthcare providers. 21 Many patients consult PRWs before choosing a doctor. 22
Initially, physicians were concerned that PRWs would contain inappropriate and untrue negative reviews that would damage their reputation and work. 21 However, research has shown that online physician ratings are generally around 90/100. 23 Studies on PRWs have focused on the impact of online reviews on patients’ choice of physicians, 19 how physicians can increase patient inquiries and profit possibilities through reputation management services or by better building their homepage, and how physicians’ reviews differ between regions and departments. 24
Research has shown that the use of PRWs has increased over the last decade. 25 For PRW users, online reviews strongly influence their decisions, 26 with 65% of German PRW users consulting a doctor based on the ratings provided by these sites. 27 The younger generation is relying more on the Internet when choosing a doctor, with more than a quarter of young parents in the United States reporting that they had selected a pediatrician for their child on the Internet.28,29 Physicians are using reputation management services to construct and defend their online reputation, 30 with some spending more money to achieve higher ratings 31 or encouraging satisfied patients to write positive reviews. 32
While reviews on PRWs have a significant impact on patients’ choice of care, the authenticity and professional validity of reviews on PRWs need to be verified. 33 Identifying suspicious online physician reviews is meaningful for helping patients make medical choice decisions and for the long-term development of online physician review websites. 34
Fake review detection
The phenomenon of fake reviews has become increasingly prevalent in online shopping and review websites, leading to significant concerns about the reliability and authenticity of user reviews. Fake reviews can be categorized into three types: untruthful opinions, reviews on brands only, and nonreviews, as proposed by Jindal and Liu. 6 Detecting fake reviews is a challenging task, mainly due to the difficulty in distinguishing between genuine and fake reviews. Therefore, two research focuses have emerged: the construction of the dataset and the development of fake review detection methods.
In terms of dataset construction, Li et al. 7 proposed two primary methods for constructing fake review datasets, namely, manual annotation and crowdsourcing platforms. Research by Ott et al. 35 demonstrated that it is challenging to identify human-written fake reviews manually, leading to lower accuracy in labeling. To address this issue, Ott et al. 35 created a gold-standard dataset containing 800 reviews, half of which were real and the other half were fake. Subsequently, Li et al. 7 expanded the dataset to include fake reviews from three areas: hotels, restaurants, and doctors, totaling 3032 reviews. Additionally, some studies have used review datasets filtered by review websites.36,37 Table 1 shows the datasets used in previous studies on fake review detection.
Datasets for fake review detection.
In terms of fake review detection techniques, previous research has focused on the analysis of fake review texts and the identification of fake reviewers’ behavior. Li et al. 7 found that true reviews contain more nouns, adjectives, and prepositions, while fake reviews constructed through crowdsourcing contain more verbs, adverbs, and pronouns. Based on text features obtained through syntactic analysis, many researchers have combined SVM models or neural network models for fake review detection tasks, achieving better detection accuracy,5,7,35,38,44 among which SVMs perform better than other models in detection tasks with small sample sizes.
Semantic similarity computation is a common approach used in the study of fake review text detection based on semantic analysis. Lau et al. 45 concluded that fake reviews have a tendency to copy each other, and fake review detection can be performed by identifying semantic duplicate reviews. Linguistic Inquiry and Word Count (LIWC), a commonly used tool in research on fake review detection based on stylistic features, is capable of extracting multiple text features, including stylistic features, and mapping 4500 keywords into an 80-dimensional vector. This tool was used by Ott et al. and Li et al. in their studies7,35 and was combined with bag-of-words features to improve the detection effect. The metadata of reviews, including features of attributes other than textual content such as publication time and the number of likes or comments, has been shown to effectively improve the accuracy of fake review detection when combined with textual features.39,41
In conclusion, while existing research has made significant contributions to fake review detection in fields such as hotels and restaurants, research targeting the detection of fake reviews on online healthcare platforms is still underdeveloped. In addition, models used to identify fake reviews in domains such as hotels and restaurants often struggle to achieve better performance when identifying fabricated reviews in medical domains.8,46,47
This study aims to address the research gap by building a specialized dataset and integrating deep learning models to achieve better results in the detection of fake physician reviews. Most of the existing studies on the detection of fake physician reviews are based on the gold-standard dataset constructed by Li et al., 7 which is of high quality but contains only 432 physician reviews. Therefore, this study employs the dataset construction method of Li et al. 7 to build a more sizable and more specialized dataset for the online medical context. By doing so, we can mine more representative features of fake physician reviews and achieve better detection results. Second, in terms of detection methods, previous research has focused on the features of the review text and classical machine learning methods such as SVM and k-nearest neighbor.8,46,47 However, these methods often struggle to achieve better performance in detecting misinformation compared with deep learning methods.48–50 Therefore, this study employs both machine learning models such as Logistic Regression (LR), SVM, Random Forest (RF) and Ridge Regression (Ri), and deep learning models such as Bidirectional Encoder Representations from Transformers (BERT), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN) to achieve better results in the detection of fake physician reviews.
Methods
To construct a dataset of fake online physician reviews, we employed a crowdsourcing approach where writers were hired to produce fake reviews. In addition, we obtained true online physician reviews by crawling data from a well-known online medical platform (haodf.com) in China using a web crawler tool. We integrated the fake and true reviews to create an experimental dataset for this study. Then, we employed both classical machine learning algorithms (LR, SVM, and RF) and deep learning algorithms (BERT, CNN, and RNN) to develop the fake online physician review detection model. We evaluated the model's accuracy, recall, loss value, and other relevant metrics to compare the performance of the different algorithms. Figure 1 shows the process of the dataset construction and model development.

Technical route of this study.
Dataset construction
The experimental dataset comprised two parts: the fake review dataset and the true review dataset. To construct the true review dataset, we used a web crawler to collect reviews of physicians in the four departments of internal medicine, surgery, dentistry, and oncology from the homepage of an online medical platform. We randomly selected 4000 records from the crawled physicians’ reviews to construct the true review dataset, using a stratified sampling method to ensure balance between positive and negative cases. The number of reviews for each department in the sample set (465 reviews of internal medicine, 1092 reviews of surgery, 1114 reviews of dentistry, and 1299 reviews of oncology) was assigned based on the ratio of the number of physicians in each department provided by the platform (internal medicine: surgery: dentistry: oncology = 9700: 22764: 23840: 27079).
To construct the dataset of fake online physician reviews, we drew inspiration from Li et al.'s 7 study. We established several rules to recruit qualified fake review writers. Only experienced users of online medical platforms were invited to write fake reviews. We asked them to create several convincing reviews for online physicians, similar to the approach taken by writers hired to produce fake reviews. This enabled them to simulate fake medical reviews on the platforms as closely as possible.
We solicited both ordinary and expert users to write fake reviews for physicians in four common departments: internal medicine, surgery, dentistry, and oncology. Expert users, who were practicing physicians, produced 100 of the 400 fake reviews, with the remaining 300 written by general users. We screened the fake reviews for review length, detail, and relevance. The threshold for the review length was set at 30 words, based on the average review length of 29.325 words calculated from the reviews of physicians in the four departments on the platform. We screened out fake reviews that were shorter than the threshold. Table 2 presents examples of genuine reviews collected from the platform alongside fake reviews created by writers.
Examples of true and fake reviews.
We screened the fake reviews based on their sentiment tendency (positive or negative) as well. The ratio of positive reviews of each physician counted by the platform was used to determine the number of positive and negative reviews in the fake review dataset. The weighted average of positive reviews was calculated to be 99.28%.
After constructing the fake and true review datasets separately, we assigned labels to create a dataset for subsequent classification learning. The composition of experiment datasets is presented in Table 3.
Experiment datasets composition.
Feature extraction
The feature extraction method employed in this study is Term Frequency-Inverse Document Frequency (TF-IDF), which is a commonly used weighting technique in the fields of information retrieval and text mining.
51
Term Frequency-Inverse Document Frequency assigns importance to a term based on its frequency within a specific document and inversely proportional to its frequency across the entire corpus. This technique is derived from TF and IDF values, where TF represents the frequency of a term's occurrence within a given document, as shown in equation (1):
Based on the feature importance provided by RF, 10 textual features were selected, specifically the TF-IDF features. This process preserves TF-IDF features with high value for model recognition while reducing dimensionality. These important terms include test (e.g., ordinary blood test), electrocardiogram, expectation, abundant, lumpy, chemotherapy, age, will, improve, and medical insurance. Figure 2 shows the importance of each TF-IDF feature provided by RF. As illustrated in Figure 2, terms such as “test,” “electrocardiogram,” “expectation,” “abundant,” “lumpy,” “chemotherapy,” “age,” “improve,” and “medical insurance” emerge as highly important. These terms typically represent more detailed evaluations of physicians and thus appear more frequently in comments written by actual patients. Additionally, the length of each review was calculated, normalized, and combined with the TF-IDF features to create the final set of text features.

Filtered TF-IDF features.
The patient ratings obtained from the platform include evaluations of treatment effectiveness and the physician's attitude during the consultation. To facilitate model input requirements, we assigned noncontinuous values ranging from 0 to 1 to represent the rating levels. Specifically, “very satisfied” corresponds to 0.9, “satisfied” corresponds to 0.7, “average” corresponds to 0.5, “unsatisfactory” corresponds to 0.3, and “very unsatisfactory” corresponds to 0.1.
In this study, we utilized two sets of features: the filtered TF-IDF features (referred to as filtered TF-IDF features) obtained through RF, and the comprehensive feature set (referred to as comprehensive features) that incorporates the filtered TF-IDF features, review length, and patient rating feature and features extracted by LIWC package. Table 4 provides a detailed description of the features used in this study, and the set comprising all the features listed in Table 4 is defined as comprehensive features.
Comprehensive feature set.
Deep learning methods
The application of CNN in text classification research by Kim 55 was a major breakthrough in the application of deep learning techniques in text classification tasks. His proposed CNN model consists of four parts: input layer, convolutional layer, pooling layer, and fully connected layer. Convolutional Neural Network has been widely used in text classification tasks, and its efficacy and maturity make it an ideal choice as a detection model. Recurrent Neural Networks have a more versatile application than ordinary neural networks, as its basic structure includes an input layer, hidden layer, and output layer. The value of its hidden layer is determined by the input layer of this moment and the hidden layer of the last moment, which takes into account the role of context. Consequently, RNN is chosen as one of the comparative models in this paper.
Bidirectional Encoder Representations from Transformer is a pretrained linguistic representation model that achieved state-of-the-art performance in 11 distinct natural language processing tasks. 56 The BERT model's input layer consists of three types of embeddings: Token Embeddings, Segment Embeddings, and Position Embeddings. These embeddings serve the purpose of converting words into fixed-dimensional vectors, distinguishing sentence pairs, and encoding the sequential nature of input sequences, respectively (Figure 3).

Structure of BERT-based text classification model.
Experimental design and analysis
Experiment preparation
The experiment primarily relied on Python 3.8, with the integrated development environment Jupyter Notebook employed for development purposes. The model construction was accomplished using the scikit-learn and Pytorch frameworks. The hardware environment featured an Intel Core i5-7200U 2.50 GHz CPU and 8GB of memory.
In evaluating binary classification models, Accuracy, Precision, Recall, and F-Score are commonly used. These metrics can be calculated using a Confusion Matrix.
57
Given the objective of identifying fake medical reviews in this study, the fake reviews were assigned as positive cases, while the true reviews were designated as negative cases for constructing the confusion matrix. In this study, the detection of fake reviews focused on identifying as many fake reviews as possible to minimize the impact of fake reviews on patients’ choice of physicians. As a result, the F2-Score evaluation index was constructed with a higher weight value of recall R to better suit the actual task. The weighted F2-Score is calculated accordingly. The original F-Score is shown in equation (5):
To evaluate the models, a 10-fold cross-validation approach was employed. In each validation iteration, 90% of the data served as the training set, while the remaining 10% constituted the test set. This method ensures that each review is used for both training and testing, providing a robust evaluation of the model's performance.
Experimental construction
Both machine learning and deep learning models were used in this study to compare and evaluate their performance. Four machine learning methods, namely, LR, SVM, RF, and Ri, were applied to the fake physician review detection task. Each machine learning model was tested with multiple sets of feature set inputs.
Convolutional Neural Network models were constructed with an input layer, embedding layer, 1D convolutional layer, 1D pooling layer, and fully connected layer using Python. The training epochs were set to 6, and 30% (1920/6400) of the data from the training set were discarded to prevent overfitting. The CNN model was fine-tuned in the pooling layer using two methods, MaxPool1D and AveragePool1D, for pooling. The pooling window size was adjusted to the number of rows of the word vector matrix minus 3, 4, and 5, respectively. The experimental settings are as follows: workers (the number of CPU compute cores used for parallelized training) = 4, vector_size(word vector dimension) = 100, min_count(minimum frequency of the considered words) = 3, and window(word context window size) = 4.
Additionally, we constructed standard bidirectional RNN and CNN models. The CNN model consisted of an input layer, an embedding layer, a 1D convolutional layer, a 1D pooling layer, and a fully connected layer. The RNN model comprised an input layer, an embedding layer, a bidirectional GRU layer, and a fully connected layer. These models were combined using the concatenate method.
For the BERT model, only certain parameters were fine-tuned, and the adjustable parameters based on the Chinese pretrained BERT model are listed as follows. Considering the mean value and distribution of review text length in the dataset, the max_len parameter was set to 256, and the fill_paddings method was used to complete the sentence length. The batch_size parameter was set to 16, the learning_rate parameter was set to 2e-5, and the epochs parameter was set to 3.
Results
Fake review detection models
The detection results of the machine learning models using filtered TF-IDF features (FF) and comprehensive features (CF), as well as the performance of the deep learning models, are summarized in Table 5. Models with better performance are highlighted in italics.
Model performance.
Table 5 demonstrates that the BERT model achieved impressive results, with significantly higher precision and accuracy compared to other methods. This highlights the feasibility and effectiveness of the BERT model in detecting fake physician reviews. Among the machine learning models, the RF model exhibited the best performance (90.27%), while the remaining models achieved moderate performance (>70%) as evaluated by the F2-Score. However, most models demonstrated poor performance (<65%) in terms of precision and accuracy when using only the filtered TF-IDF features. The incorporation of constructed features significantly enhanced the performance of the machine learning model across all metrics, albeit resulting in a slight decrease in precision for the RF model.
Compared to the machine learning models, the deep learning models exhibited a more balanced performance across all metrics, without any excessively low scores. Among the deep learning models, BERT and CNN achieved the highest performance (F2-Score > 90%), while the remaining deep learning models showed more moderate performance. Although machine learning models required less training time and yielded satisfactory results, there is still considerable room for improvement in their detection performance (F2-Score ranging from 79.82% to 90.27%). In contrast, BERT showcased excellent performance across all metrics and maintained a clear lead of approximately 7.7–18.15% over other machine learning algorithms in the overall evaluation metric, F2-Score. When compared to other deep learning models, BERT consistently outperformed them across all metrics.
Feature importance
Fake review detection studies commonly employ two main types of features: linguistic features of reviews and behavioral features of reviewers. However, since the platform hides users’ personal information, we were unable to obtain the behavioral features of reviewers. Consequently, in this study, we constructed two feature sets primarily based on the linguistic features of reviewers as inputs for the machine learning models. These feature sets include the TF-IDF feature set filtered by RF (filtered TF-IDF features) and the comprehensive feature set, which incorporates patient rating features, review length feature and linguistic features extracted by LIWC into the filtered TF-IDF features.
For each machine learning model, we inputted these two feature sets and adjusted the model parameters to obtain the best-performing versions. Our experiments revealed that when using the comprehensive feature set as input, each machine learning model exhibited better performance. To further examine the impact of each feature on the detection performance of machine learning models, we employed the SHAP plot
58
for the RF model on the comprehensive features. We employed the Python package

Feature importance by SHAP.
Distribution disparities between true and fake reviews
To provide further insights into the differences between true and fake reviews, we visually depicted the distribution of all constructed features, as shown in Figure 5. Notably, the distribution of text lengths exhibited a distinct pattern, with the peak of fake reviews skewed toward longer text lengths compared to true reviews (Figure 5(d) and (e)). This finding corroborates previous research conducted by Jindal and Liu, 6 who discovered that spam tends to have significantly longer lengths than normal emails. Similarly, Rout et al. 59 found that deceptive reviews tend to be lengthier than truthful reviews. In our study, fake reviews exhibited text lengths ranging between 30 and 300, while true reviews had a broader range spanning from 1 to 4500 characters. Text length is a widely employed feature in the detection of fake reviews 42 and has demonstrated its significance in the task of distinguishing between genuine and deceptive reviews. 60 Moreover, text length assists consumers in making more informed judgments regarding the authenticity of reviews.

Distribution of each feature.
Furthermore, we examined the distribution disparities in other characteristics. Specifically, in terms of Efficacy Satisfaction and Attitude Satisfaction indicators, true reviews displayed a higher concentration of the highest ratings, while fake reviews exhibited a more even distribution (Figure 5(a) and (b)). This observation can be attributed to the tendency of patients to hold doctors in high regard and to post their reviews on the platform only when they are highly satisfied with their experience. 61 Additionally, the distribution of other features revealed notable variations. For instance, we observed a wider distribution interval for third-person pronouns in true reviews, indicating a higher frequency of their usage. This usage pattern may be positively correlated with the authenticity of the comments. 62 It is reasonable to infer that an increased utilization of third-person pronouns signifies a closer proximity to an objective statement and a higher degree of truthfulness.
Discussion
Dataset construction
Table 1 demonstrates that previous studies on fake review detection have primarily been centered around fake product reviews or fake hotel reviews, with limited research conducted on fake physician reviews due to the scarcity of established datasets in this domain. This issue was addressed in 2014 when Li et al. 7 constructed a dataset specific to fake physician reviews, a resource that has since been utilized in numerous studies. However, a study by Hao et al. 24 highlighted the differences in review characteristics between Chinese and American PRWs, implying that the findings from existing studies may not be directly applicable to the Chinese context. Consequently, the fake online physician review dataset we have constructed offers a valuable complement to research in this area.
With respect to dataset construction methods, many studies have employed manual annotation to create fake review datasets. However, identifying fake physician reviews through manual annotation could be challenging. To address this concern, we opted for the use of the crowdsourcing method. Furthermore, in terms of dataset composition, we invited both platform users and physicians to participate during dataset construction, a practice that sets our approach apart from some related studies that did not specifically require participant expertise. 36 This strategy serves to enrich the data sources and enhances the dataset's resemblance to the actual scenario of reviews on PRWs.
Classification methods for online fake physician review detection
Previous studies have extensively utilized both machine learning methods and deep learning methods for fake review detection tasks.63,64 Notably, RF, SVM, and CNN have demonstrated commendable performance in fake review detection tasks.65,66 RF is advantageous due to its robustness and ability to handle large datasets with higher dimensionality. 67 It is less likely to overfit compared to individual decision trees, and it provides insights into feature importance. However, RF can be computationally intensive and may not perform well with very high-dimensional sparse data, typical in text analysis. SVM is known for its effectiveness in high-dimensional spaces and its ability to find the optimal separating hyperplane between classes. SVM is particularly effective when the number of dimensions exceeds the number of samples. 68 Nonetheless, it can be less effective with large datasets and may require significant tuning of hyperparameters.
In this study, we selected four machine learning methods (RF, LR, RI, and SVM) and three deep learning methods (BERT, CNN, and RNN). The experimental results revealed that BERT exhibited the best classification performance in this task, surpassing the other methods employed in this experiment. It aligns with previous studies on fake review and fake news detection which have also demonstrated the excellent performance of BERT,69,70 underscoring its ability to effectively extract relevant features from text.
Conclusion
The rapid advancement of online healthcare platforms has revolutionized the process of seeking and receiving medical advice for patients. However, with the increasing number of online reviews for physicians, the prevalence of fake reviews has also risen, posing a threat to the perception of physicians’ quality and undermining trust in online healthcare platforms. Therefore, the detection of fake reviews is of utmost importance in maintaining the credibility of these platforms.
To address the need for effective fake review detection, we employed a crowdsourcing approach to construct a dataset specifically designed for online medical platforms. The dataset created through crowdsourcing exhibited higher accuracy compared to datasets manually labeled or identified through pattern detection methods. Subsequently, we developed a detection model utilizing both classical machine learning and deep learning models. For evaluating the performance of the models in the context of online medical platforms, we utilized the F2-Score evaluation index, which is particularly suitable for fake review detection. Our experimental results revealed that the deep learning model outperformed traditional machine learning models in identifying fake physician reviews on online medical platforms. The deep learning model achieved a remarkable precision of 98.36% and an impressive F2-Score of 97.97%. This represents an 8.16% improvement in precision and a 7.7% increase in F2-Score compared to the traditional machine learning model. Therefore, our study makes a valuable contribution by providing an effective fake physician review detection model tailored for online medical platforms.
While this study advances the development of an effective fake physician review detection model, there still remains some limitations that need to be acknowledged. Firstly, one primary limitation is the relatively small size of the fake review dataset. Although oversampling techniques were employed to mitigate this issue, the synthetic data may not fully capture the complexity of real-world fake reviews. Expanding the scale of the dataset would enhance the generalizability of the model and its ability to detect fake reviews across a wider range of scenarios. Additionally, the study's dataset is sourced from a single online medical platform, which may limit the model's applicability to other platforms with different user behaviors and review characteristics. Future research should consider collecting and integrating datasets from multiple platforms to ensure the model's robustness and generalizability across diverse contexts. Investigating the differences in user behavior and review characteristics across platforms can also provide valuable insights into how these factors influence the detection of fake reviews. These future endeavors hold the potential to further refine and enhance the capabilities of the fake physician review detection model on online medical platforms.
