Abstract
Introduction
Diquat (DQ) (6,7-dihydrodipyrido [1,2-a: 2, 1′-c] pyrazine-5, 8-diium dibromide) is non-selective bipyridine herbicides widely used around the world. Following the ban on paraquat, DQ has increasingly been adopted as an alternative in the market, leading to a surge in poisoning cases. 1 Similar to paraquat, DQ is highly toxic to humans. 2 Once poisoned, patients require early interventions such as hemoperfusion to prevent rapid progression to liver and kidney damage 3 and subsequent multiple organ failure. 4
In the face of increasing DQ poisoning cases, clinicians often confront dual challenges: identifying the specific type of herbicide poisoning when patients cannot provide accurate information about the poisoning incident, and assessing the severity of the patient’s poisoning. Although there have been reports on detecting DQ levels,5,6 it remains unclear whether plasma concentrations can predict the severity of poisoning early on.
Currently, diagnosing DQ poisoning primarily relies on measuring blood concentrations using HPLC-DAD, 7 gas chromatographic-mass spectrometry (GC-MS), 8 and LC-MS. 9 However, those methods depend on expensive and complex equipments. Consequently, measuring DQ concentrations is not a routine practice in the majority of hospitals. Thus, it highlights the urgent need to develop diagnostic and assessment methods.
Complete blood count (CBC) test is one of the most fundamental clinical examination procedures performed in various hospitals, providing a comprehensive overview of an individual’s hematological health. It plays a critical role in diagnosing and monitoring numerous medical conditions. For example, it used to support triaging of primary versus secondary headache patients, 10 bacteremia detection, 11 and the prediction of positive blood culture results. 12 Machine learning algorithms, particularly those designed for binary classification tasks, are powerful tools that have been widely applied across various domains.13–15 In healthcare, machine learning has been utilized effectively in predicting patient outcomes, such as prognosis evaluation in patients suffering from paraquat poisoning using clinical data. 16
To address the diagnostic gap in DQ poisoning, our study integrates clinical laboratory practices with advanced analytics. We employed a constructed LC-MS method to monitor plasma concentrations of DQ in patients suffering from DQ poisoning. By analyzing these concentrations alongside changes in CBC parameters, we utilized random forest algorithms, a robust ensemble learning method for binary classification, to establish a method for early diagnosis and prognosis of DQ poisoning.
Subjects and methods
Ethics statement and subjects
This study was approved by the Medical Ethics Committee of The First Affiliated Hospital of Wenzhou Medical University (approved number:KY2021-R121).The DQ poisoned patients who had a history of contact with DQ hospitalized in our emergency intensive care unit from January 1, 2019, to Dec 31, 2023 were included. A similar number of healthy subjects were included as the control group. All data were recorded and standardized in a Microsoft Excel spreadsheet by two medical students who did not know the purpose of this investigation. 2.2 Determination and analysis of DQ and CBC.
The determination of DQ plasma concentration was conducted as follows: 0.1 mL plasma was precipitated with 0.3 mL acetonitrile, after centrifugation, DQ was separated on a BEH C18 column (2.1 mm × 50 mm, 1.7 μm) using a mobile phase consisting of 0.1% formic acid in water and acetonitrile. Detection was performed in multiple reaction monitoring (MRM) mode at M/Z 183.1/157.1 with an ESI ion source in negative-positive mode. For comprehensive details, please refer to the relevant published articles.17,18 The CBC indices were measured using a BC-5500 Automatic Blood Cell Analyzer.
The timing of DQ poisoning and the initial plasma DQ concentrations, which were measured prior to any hemodialysis or perfusion treatments, were included in this study. Both the initial CBC indices and second days after treatments were included.
The study’s inclusion criteria comprised individuals with a confirmed contact history of DQ poisoning, who arrived at the emergency room within 24 hours post-exposure, and had not undergone any previous invasive treatments, including hemoperfusion or intravenous drug administration. Conversely, exclusion criteria consisted of cases with an ambiguous history of DQ exposure, those involving mixed pesticide poisoning or poisoning through skin contact, as well as patients with a history of hematological disorders or presenting symptoms of bacterial or viral infections. 19
Based on treatment outcomes assessed via vital signs (temperature, respirations, heart rate, blood pressure) and blood indices (arterial blood gas analysis, liver and kidney function), DQ poisoned patients were categorized into either a survival or a deceased group. The correlation between DQ concentration and CBC parameters was investigated using Spearman’s correlation analysis.
Dataset construction and processing
The overall dataset includes the DQ concentration, the time of DQ poisoning, age, and CBC indices. The CBC consists of three series: white blood cell (WBC), red blood cell (RBC), and platelet(PLT), encompassing 24 indices. Due to the potential interrelations among these indices, to enhance model predictions and more accurate analysis, we have added the following ratios to the dataset: the ratio of WBC count to RBC count (WBC/RBC), the ratio of WBC count to platelet count (WBC/PLT), the ratio of platelet count to RBC count (PLT/RBC), the ratio of neutrophil count to lymphocyte count (AVNG/AVLC), the ratio of red cell distribution width to hematocrit (RDW/HCT), and the ratio of granulocyte percentage to lymphocyte percentage (PNG/PLC). The continuous variables in the data set were normalized, and the missing values were replaced by the mean. The full name and abbreviations of the CBC indices were listed in Supplementary Table 1.
Prognostic discriminant model of random forest
In terms of dataset composition, three types of datasets were used separately: DQ concentration, CBC, and a combination of DQ + CBC. All three datasets utilized the patient’s prognosis status as the target variable. In model selection, the random forest algorithm was used, which is an ensemble learning method that makes the final decision by constructing and combining the predictions of multiple decision trees. It can effectively reduce the risk of overfitting in individual models, has strong generalization ability, and is robust.
During the modeling process, the entire dataset was divided in an 70% training and 30% testing ratio, with the random seed set to 42 to ensure good reproducibility of the experiment results. To determine the optimal parameter configuration for the random forest model, we employed grid search techniques, examining parameters such as n_estimators (the number of decision trees), max_depth (the maximum depth of a single tree), min_samples_split (the minimum number of samples required to split an internal node), and min_samples_leaf (the minimum number of samples required to be at a leaf node).
Feature selection and model evaluation
The evaluation indicators of the model included accuracy, recall, precision, F1 score, and the area under the receiver operating characteristic curve (AUC-ROC). Since there are many indicators of CBC, considering the redundancy of indicators, we used feature selection to optimize the key CBC indicators in the modeling based on the original data. The previously developed random forest model was applied to evaluate the predictive power of the feature dataset.
For the reasonableness of the evaluation, the k-fold cross-validation (CV) 20 is used to evaluate the performance of algorithms. The initial sample is divided into K subsamples, and a single subsample is reserved as the data for the validation model, and the other K-1 samples are used for training.
Model comparison
A comparative analysis was conducted with alternative algorithms, including k-Nearest Neighbors (k-NN), 21 support vector machines (SVM), 22 and decision trees (DT), 23 to assess the relative performance of the random forest model in the context of the prognostic discriminant task. These algorithms were trained and tested on the same dataset partitions and evaluated using the same metrics as the random forest model.
The empirical experiment is conducted on Intel(R) Core(TM) i7-1260P CPU @ 2.10 GHZ with 16 GB of RAM on the Windows 11 operation system. All the algorithms are coded and run in Jupyter Notebook.
Results
General information and statistic analysis
Blood biochemical indexes of deceased and survival DQ-poisoned patients.

Typical mass spectrometric chromatogram of DQ (A) and plasma DQ concentrations of 84 DQ poisoning patients (B).
Spearman’s correlation analysis revealed strong correlations between CBC indices and DQ concentrations, with the WBC and neutrophile granulocyte (AVNG) count being the most two correlated index (Figure 2). This indicates that CBC indices have diagnostic value in DQ poisoning. Spearman’s correlation between WBC and AVNG with DQ plasma concentrations.
Random forest prediction model
According to the CBC indices, a random forest model was developed. The repeated and cross-validation strategy were adopted to ensure the proportion of different categories of samples remained consistent in both training and validation sets, all experiments were repeated 5 times with 5-fold stratified sampling. The experimental results showed that when the number of decision trees in the random forest exceeded 150, or when the maximum depth of a single tree, max_depth was greater than 7, there was no observed trend of further improvement in prediction accuracy (Figure 3(a), (b)). Through the parameter optimization process, the final optimal parameter combination was determined to be: ‘n_estimators': ‘200, max_depth’: 4, ‘min_samples_leaf’: 4, and ‘min_samples_split’: 2. The correlation of model accuracy with different number of decision trees (A) and depth of a single tree (B).
Average classification performance of three different original dataset.
Average classification performance of CBC original dataset in different days.
Feature selection for CBC
Considering the potential presence of redundant information in the original CBC dataset, we adopted the information gain method for feature selection. This involved calculating the mutual information between each feature and the target variable of prognostic status, to assess the importance of each feature for the classification task. Additionally, we utilized the random forest algorithm, leveraging its built-in feature importance scores during the training process to further determine feature weights. The results showed consistency in the important features identified by both methods(Figure 4(a)), with WBC、AVNG、WBC/RBC emerging as the most critical feature for the predictive model. Feature selection (A) and model prediction performance for different feature sets (B).
Regarding the impact of the number of features, the study found that increasing the dataset to include 12 features did not continue to enhance prediction accuracy as the number of features increased (Figure 4(b)). Results of feature importance ranking and the performance of models with different numbers of features have been illustrated in graphs for a more intuitive presentation.
Average classification performance of CBC dataset after feature selection.
Diagnostic model for CBC
Based on the established random forest prognostic model, to further evaluate its diagnostic capability, 84 healthy individuals (31 males and 53 females) were enrolled, with an average age of 40.96 ± 13.30 years. The developed random forest model also suitable for the diagnosis of DQ poisoning, the learning curve and classification boundary of train set were showed in Figure 5. The model accurately distinguished between patients with DQ poisoning and healthy individuals based on the original CBC dataset. Feature selection analysis on this dataset revealed that the best-performing features were AVNG, WBC, and the ratio of WBC to RBC, which aligned with the feature selection outcomes from the prognostic model. Learning curve of random forest classifier (A) and classification boundary (B) for healthy subjects and DQ poisoning patients.
Furthermore, the random forest model consistently demonstrated superior classification performance across all datasets, outperforming other K-NN, SVM, DTs models, when applied to both the CBC dataset and the combined CBC + DQ dataset. Its stability and highest accuracy attest to its effectiveness in the prognostic discriminant task. Figure 6. Average classification performance of k-Nearest Neighbors (k-NN), support vector machines (SVM), and decision trees (DT) and random forest (RF) based on CBC and CBC + DQ datasets.
Discussions
The clinical significance of plasma concentration in DQ poisoning has not been fully elucidated. In this paper, a novel UPLC-MS/MS method for detecting DQ concentration in blood was developed, the correlation between DQ concentration and patient prognosis was investigated for the first time. Our findings indicate a strong correlation between DQ concentration and patient outcomes. Specifically, when the DQ concentration exceeds 5 μg/mL, there is a significant decrease in patient survival rates, with 40/46 individuals in the deceased group having concentrations above 5 μg/mL. It is noteworthy that in the survival group, 3/38 individuals had concentrations exceeding 5 μg/mL but were successfully treated. These three individuals had a short duration of poisoning (<4 hours) and received immediate treatment with blood perfusion and other treatments upon hospital admission. Additionally, they were young, aged 17, 27, and 29 years, respectively. This suggests that early and rapid removal of DQ is critical to the successful treatment of DQ poisoning, with younger patients having a better chance of recovery.
When the body is exposed to toxins or harmful substances, such as paraquat, an acute inflammatory response occurs. WBC, especially neutrophils, which play a crucial role in the immune system, rapidly increase in number to combat infection or toxic stimuli. 24 Correlation analysis indicates that the levels of WBCs and neutrophils are highly correlated with DQ plasma concentration, providing a statistically significant basis for using CBC data to evaluate the prognosis of DQ poisoning.
During the modeling phase, given the complexity of binary classification problems and the outstanding performance and robustness of the random forest algorithm in handling such tasks, a comparative analysis was conducted against algorithms like k-NN, SVM, and DT. The k-NN algorithm assigns a new instance to the class most commonly represented among its k nearest neighbors in the feature space. SVM is a powerful supervised learning algorithm known for constructing hyperplanes that maximize the margin between classes.
Decision Trees is a popular interpretable machine learning method, recursively partition the feature space into subsets based on the most informative features and associated thresholds. The results demonstrated that the random forest maintained the highest level of prediction precision and accuracy, regardless of whether the CBC + DQ concentration set or the CBC data set was used. Consequently, random forest was selected as our primary modeling tool.
Through meticulous optimization and adjustment of model parameters, we successfully built an efficient predictive model suitable for the raw dataset of CBC. This model can fully utilize CBC indicators (such as WBC and AVNG, etc.), accurately predicting the prognosis of patients poisoned by DQ. Particularly after feature selection, by eliminating redundant data in the original dataset, the predictive performance of the model was significantly enhanced.
Additionally, we found that although CBC data contain redundant information, the predictive performance of the dataset improves with continuous measurements. Clinically, multiple measurements of CBC can enhance their predictive accuracy. The combination of CBC with DQ plasma concentration can predict patient outcomes with an accuracy rate exceeding 90%. However, in scenarios where some hospitals are unable to conduct DQ concentration monitoring, the predictive accuracy can be increased by repeatedly measuring the patient’s CBC.
Based on this model, we further evaluated the diagnostic value of CBC. Results show that the model’s diagnostic accuracy significantly surpassed its prognostic accuracy. This indicates that applying CBC tests is entirely suitable for clinical auxiliary diagnosis. The model has the potential to predict the severity of poisoning and likely outcomes more effectively, thereby facilitating prompt intervention and improving management strategies for DQ poisoning patient in clinical practice.
Conclusions
The prognosis of patients poisoned by DQ is closely related to the plasma concentration, with a poor prognosis when the concentration exceeds 5 μg/mL. In CBC tests, WBC and neutrophil counts are highly correlated with the plasma concentration of DQ. In the random forest model, the CBC dataset can be used to accurately determine the presence of DQ poisoning and predict the patient’s prognosis. Repeatedly testing CBC tests or including the DQ concentration can enhance the prediction accuracy.
Supplemental Material
Supplemental Material - Diagnostic and prognostic value of diquat plasma concentration and complete blood count in patients with acute diquat poisoning based on random forest algorithms
Supplemental Material for Diagnostic and prognostic value of diquat plasma concentration and complete blood count in patients with acute diquat poisoning based on random forest algorithms by Hui Hu, Xiaofang Ke, Fangfang Zheng, Minjie You, Tao Zhou, Yanwen Xu, Jiaiying Wu, Shuhua Tong, and Lufeng Hu in Human & Experimental Toxicology
Footnotes
Patient Consent
Funding
Declaration of conflicting interests
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
