Abstract
Introduction
Gonorrhea is a sexually transmitted infection (STI) caused by the bacterium
Annual screening is currently recommended for sexually active women under age 25 and for those over age 25 who are at increased risk of infection to identify STIs such as
Risk prediction models for STIs have been developed to predict chlamydia and/or gonorrhea infection among asymptomatic women screened in the sexual health clinic setting, and factors such as younger age, nonwhite ethnicity, multiple sexual partners, and previous infection show reasonable predictive performance during screening.9,10 In the United States, however, only 4% of gonorrhea cases among women were reported from STI clinics in 2022. 11 The remaining 96% were diagnosed by private physicians and in a range of diverse clinical settings, where information on sexual behavior may not be routinely recorded. Other prediction models have been developed utilizing data from a more general population, but have focused specifically on patients who already experienced an STI event.12,13 Improved methods for the identification of women at high risk of contracting gonorrhea (both initial and repeat infections) in the general population could facilitate preventive interventions such as a targeted vaccine roll-out program. A preventive vaccine is currently undergoing clinical trial (ClinicalTrials.gov Identifier: NCT04350138), though none is currently licensed.
Our aim was to use routinely collected administrative claims data to develop and validate a model to predict the subpopulations of young women in the general population who are likely to contract
Methods
Data
Data source
The data were extracted from two sources: the Merative™ MarketScan® Commercial Database (Commercial Claims And Encounters; CCAE) and the Merative™ MarketScan® Medicaid Database (Medicaid). CCAE data include health insurance claims across the continuum of care (e.g., inpatient, outpatient, prescription drug, carve-out healthcare) as well as enrollment data from large employers and health plans across the United States who provide private healthcare coverage for more than 20 million employees and their spouses and dependents yearly and are representative of the employed population (and their dependents) in the United States.14,15 Medicaid is a federal and state program that provides health coverage to people of lower socioeconomic means. 16 The Medicaid database includes over 14 million people and captures the healthcare use of individuals enrolled in Medicaid in 8–12 states in the United States (depending on the year). 15 Both datasets comprise integrated, individual-level healthcare claims from all care providers, including those seen in inpatient, outpatient (including emergency room), and pharmacy settings. The study adheres to the REcording of studies Conducted using Observational Routinely collected health Data statement for PharmacoEpidemiology checklist 17 and the Transparent Reporting of multivariable prediction model for Individual Prognosis Or Diagnosis statement checklist. 18 The databases are Health Insurance Portability and Accountability Act compliant and deidentified; as such, this study is exempt from Institutional Review Board approval.
Study population
A single, retrospective cohort was defined. All women in the databases that were aged 16 to 35 years (inclusive) on 1 January 2018, and who had at least two years of continuous enrollment between 1 January 2017 and 31 December 2018 were eligible for inclusion. The 12-month period prior to 1 January 2018, during which predictor variables were defined, was considered the observation period. The 12-month period after and including this date was considered the risk period, during which the outcome was evaluated.
Outcome
The outcome was defined according to diagnostic
Predictors
Candidate predictors were identified from the literature.
20
In brief, within the claims database, codes directly reflecting sexual behaviors associated with STIs such as number of sex partners, use of condoms, sexual orientation, and specific sexual behaviors are not available. However, other individual-level attributes, behaviors, and social determinants that contribute to the risk of STI can be identified. Available predictors were characterized as proximal or distal based on the gonorrhea risk framework originally proposed by Gomez et al. (2014) for sex workers,
21
and adapted here in the absence, to our knowledge, of a more general framework (Figure 1). To maximize reproducibility and generalizability, predictors were explicitly defined according to

Conceptualizing candidate predictors as proximal or distal.
Furthermore, we hypothesized that degrees of social deprivation at the community level could also be predictive of gonorrhea risk at the individual level. For patients in the CCAE dataset, an exploratory social deprivation variable was derivable at the level of the core-base statistical area (CBSA). 25 The CBSA code was used as a geolocator, and the CCAE database was supplemented with the Social Deprivation Index (SDI) linked at the CBSA level. The SDI is a composite measure of area level deprivation based on seven demographic characteristics collected in the American Community Survey and used to quantify socioeconomic variation in health outcomes. It is a percentile, ranging from 1 to 100.26,27 It was possible to derive the SDI from Zone Improvement Plan (ZIP) codes for 78% of the patients in the CCAE database. For the Medicaid database, ZIP code information is not available, so linkage to the SDI at the CBSA level could not be performed. Therefore, when geolocation data were missing, for the remaining 22% of subjects in the CCAE database and for the totality of subjects in the Medicaid database, we set the SDI to its median of 50, to be able to develop models that would not accept missing values as predictors. Details of this exploratory objective, the SDI, and the strategy for data linkage to the geolocator are in Supplemental Appendix S2.
Finally, 29 predictors were identified: 14 proximal predictors and 15 distal predictors. All of the predictors, as well as their categorization as proximal or distal, are summarized in Supplemental Appendix S1 and Table 1.
Study population characteristics, for each dataset.
Abbreviations: CCAE: Commercial Claims And Encounters; HIV: human immunodeficiency virus; IQR: interquartile range; PID: pelvic inflammatory disease; PrEP: preexposure prophylaxis; SDI: social deprivation index; STI: sexually transmitted infection.
Statistical analysis methods
Model development
Predictors were first described in terms of their prevalence in both CCAE and Medicaid datasets (Table 1). We developed machine learning (ML) classification models constructed with increasing levels of complexity by training and optimizing algorithms based on logistic regressions and classification trees.
We used the following logistic regression-based algorithms: simple (a standard logistic regression without regularization), least absolute shrinkage and selector operator (LASSO), and ridge logistic regression. 28 Compared to the simple logistic regression, the LASSO and ridge algorithms combat overfitting by using L1 and L2 regularization, respectively, to shrink the parameters toward zero for a sparser model (i.e., a form of feature selection). Logistic regression models were implemented with and without the transformation of numerical variables (age and SDI) using natural cubic splines at the preprocessing stage. 29
We used the following tree-based algorithms: a simple decision tree, random forest (a set of decision trees), and extreme gradient boosting (XGBoost), which allows for a different form of gradient tree boosting. 30
The data were unbalanced. The outcome, gonorrhea infection, was present in only 0.15% of the CCAE population. To optimally train the different models to distinguish between cases (the minority class) and noncases (the majority class), we introduced at the preprocessing stage two additional steps to balance the CCAE development set at each iteration of model optimization: first, oversampling of the minority class to a proportion
All the models were trained and tested with a small number of proximal predictors as well as with the addition of distal predictors (Figure 1 and Table 1).
Model evaluation and validation
CCAE data were first stratified and randomly split into a development set (80%) and a hold-out test dataset (20%) with the same proportion of the outcome in each set. Then, for each algorithm, hyperparameters were tuned on the development set and the performance of the resulting model evaluated using stratified 10-fold cross-validation (CV) with the receiver operating characteristic (ROC) area under the curve (AUC) score. For each algorithm, the best model was the one achieving the highest 10-fold CV ROC AUC. The hyperparameter search was done using Tree of Parzen Estimators included in the Hyperopt package.32,33
The best models for each algorithm were compared using 10-fold CV ROC AUC scores, first using only proximal predictors, and then adding distal predictors as well. Models were then evaluated on CCAE data using ROC AUC scores for the full development and hold-out CCAE datasets. Finally, models were externally validated on the Medicaid dataset.
Contributions of each single predictor to predictions by best models were evaluated using Shapley values. 34 Evaluation was done on a random sample of 100,000 patients in the CCAE development dataset.
Results
Approximately 1.95 million women in the CCAE database and 0.81 million in the Medicaid database met inclusion criteria (Table 1). The predictive outcome, diagnosis of gonococcal disease in 2018, was present in 0.15% (
The average age of patients was lower in the Medicaid cohort than in the CCAE cohort: 22.4 years (interquartile range [IQR]: 16–28) for Medicaid and 23.5 years (IQR: 18–29) for CCAE (ages in 2017). Furthermore, average age was lower among patients with diagnosed gonococcal disease versus the overall cohorts: 22.0 years (IQR: 17–26) versus 22.4 years (IQR: 16–28) for Medicaid and 22.2 years (IQR: 19–24) versus 23.5 years (IQR: 18–29) for CCAE. Social deprivation measured through the SDI was higher in patients with diagnosed gonococcal disease versus overall for CCAE; SDI data were not available for Medicaid. Compared to the overall cohorts, gonococcal cases also had a higher frequency of previously diagnosed STIs, acute PID, and high-risk sexual behavior (Table 1). While screening and antibiotic treatments for chlamydia, gonorrhea, and other STIs were more common in patients with diagnosed gonococcal disease versus overall, vaccination against STIs (e.g., hepatitis A and B) was slightly less frequent in these patients. Additional predictors are summarized in Table 1.
Table 2 reports 10-fold CV ROC AUC scores for the respective optimal set hyperparameter values for all models tested. As regression models without splines and the decision tree model produced markedly smaller ROC AUC values when using proximal predictors, they were not tested among the models with both proximal and distal predictors.
Best performing model by algorithm.
Abbreviations: AUC: area under the curve; CCAE: Commercial Claims And Encounters; CV: cross-validation; LASSO: least absolute shrinkage and selection operator; ROC: receiver operating characteristic; XGBoost: extreme gradient boosting.
Using both proximal and distal predictors improved CV scores by 2–3%. The best performance was achieved using an XGBoost model (10-fold CV ROC AUC: 79.47%). The second-best performing model was the LASSO logistic regression (10-fold CV ROC AUC: 79.11%). Table 2 shows that ROC AUC scores recalculated on the full development set confirm the models’ performance hierarchy. When prediction models were applied to the hold-out test set, ROC AUC scores decreased slightly (less than 1%) and conserved a similar hierarchy between the models, with XGBoost performing better than other models (ROC AUC: 78.63%). When applied to the Medicaid dataset, which lacked the SDI variable, performance dropped by nearly 4% (ROC AUC for XGBoost: 74.94%). ROC curves for the XGBoost method and LASSO logistic regression with splines are plotted in Figure 2 for the development CCAE set, the hold-out CCAE set, and the Medicaid dataset. Discrimination on the Medicaid dataset was similar, though somewhat lower, than the hold-out CCAE dataset.

ROC curves for the best models based on XGBoost and LASSO logistic regression with splines. (a) XGBoost, (b) LASSO logistic regression with splines.
Table 3 shows how the discriminatory power of the method varies when tuning the discriminatory threshold to make the model more selective. For example, when the threshold for identifying predicted gonococcal cases in the CCAE development set was raised to select only 0.1% of the population, corresponding to 1562 patients, 161 actual gonorrhea cases in 2018 were identified out of these (risk of gonorrhea = 10,307 per 100,000 person-years, nearly 70 times higher than in the original population, where the risk was 148 per 100,000 person-years). With this threshold, gonorrhea is detected with a sensitivity of 7.0% and a specificity of 99.9%. In other words, this method would select the 0.1% of the population to be at highest risk of infection predicted by the model, resulting in a positive predictive value (the percentage of patients correctly predicted to have gonorrhea in 2018) jumping from 0.15% in the general population to over 10%. Negative predictive values are uniformly high here due to the low rate of the disease in the considered population.
Performance of best model for detecting actual cases at different percentiles of population at increasing predicted probability of gonorrhea (CCAE development set).
Abbreviations: CCAE: Commercial Claims And Encounters; NPV: negative predictive value; perc.: percentile; discrim. threshold: discrimination threshold, predictions over this value are identified as cases; PPV: positive predictive value; PY: person-years; sens.: sensitivity; spec.: specificity; XGBoost: extreme gradient boosting.
Some of the predictors contributed more than others to predict gonorrhea cases. The most important predictors are displayed in Figure 3 for the XGBoost model (results were nearly identical for the LASSO logistic regression with splines), ranked by their absolute impact in determining the model's outcome (gonorrhea cases). Age was the most important predictor, followed by having a screening for chlamydia/gonorrhea and the SDI.

Impact of variables on predicted risk of gonorrhea.
Discussion
The models presented here were developed to identify women in the general population who are at risk of contracting gonorrhea and to understand which characteristics are key for their detection. Indeed, clinical and behavioral research on STIs is often conducted in sexual health clinics, where incidence and prevalence of disease is high and predictive factors for gonorrhea are well characterized during screening.35–37 Outside of this setting, however, established risk behaviors may not be queried or recorded (e.g., the family physician may not ask a woman about number of sex partners, or if a sex partner has a history of STI). Limited research has been done to identify behaviors or conditions that may not be directly on the risk pathway but that could reasonably be expected to be associated with an augmented or reduced risk of gonorrhea. To capture the potential of such behaviors and conditions to predict a gonorrhea infection among young women in the general population, ML models were developed, cross-validated, and tested on a sample of nearly 2 million women aged 16–35 years from a database that is representative of the employed population in the United States. Results of the models obtained were further tested in a distinct population of low-income women with a dataset (Medicaid) that was not used in any part of model development. As such, the study population captured the wide range of socioeconomic and demographic groups of women accessing healthcare in the United States. The potential to use large, representative databases of routinely collected data that reflect real-world utilization patterns is appealing, particularly as it allows for identification of an adequate sample of gonorrhea cases diagnosed outside the setting of the STI clinic.
In this study, we confirmed the feasibility of predictive modeling applied to administrative claims data to identify young women at risk of gonorrhea in the United States within a 12-month timeframe. The ROC AUC values obtained from the different models given the predictors available in these databases on the hold-out test set were, as could be expected, lower than those obtained on the development dataset. However, the decrease was quite low (less than 1% for the models with distal and proximal predictors) and the ROC AUC values remained above 75% for the best models, indicating a strong effect of the combination of the variables considered on the subsequent risk of gonorrhea. The decrease was slightly more substantial whenever the models were applied to the Medicaid population. Beyond the fact that the model was built from the CCAE population, our exploratory objective, to include sociodemographic diversity, which as shown earlier was an important predictor, could not be assessed for the Medicaid population, as the SDI information was not available in the Medicaid dataset. However, the ROC AUCs obtained were still around 75%, indicating that the predictive factors used in the models may also be applicable for this population.
The best predictive results were obtained using XGBoost, a complex “black box” ML model, but it is notable that models based on simple logistic regressions performed very close to XGBoost (10-fold CV score: 76.81% vs 76.93%), with the great advantage that logistic regressions are much more explainable and transparent. The transformation of continuous variables (age and SDI) through splines was determinant to the good performance of logistic regression, increasing the 10-fold CV scores from 73.06% to 76.81% without affecting model explainability.
Use of the best model allowed for easy identification of populations of women at increased risk of gonorrhea. For example, the 0.1% of women where the model-predicted risk of gonorrhea is highest had a nearly 70-fold higher incidence of gonorrhea (10.31%) during the follow-up period than in the general population (0.15%). Pending further external validation with other data sources, findings from these models may be leveraged to identify opportunities for targeted public health interventions that may be most effective in protecting the general United States population from gonorrhea. By identifying older adolescent and young adult women at an increased risk for contracting gonorrhea, measures can be implemented to proactively reach these women, such as at routine healthcare visits, either before infection occurs (e.g., through vaccination) or progresses to symptomatic stages (e.g., through screening).
The prediction models implemented here were based on an a-priori selection of 29 predictors, divided between proximal and distal determinants. The inclusion of the distal predictors improved model performance, and indeed among the top 10 most influential predictors, four were distal: SDI, vaccination for non-STIs, use of other contraceptives, and tobacco use (Figure 3). Even if distal predictors are not directly related to the outcome, they may be correlated with other actual determinants that are not observable. For example, vaccination for non-STIs (ranked as the fourth most influential predictor) increases the chances of being a noncase and could therefore be associated with factors that reduce the risk of contracting gonorrhea. Tobacco use (ninth most influential predictor) was strongly predictive of cases, possibly because of its association with actual determinants that were not included in the data or not observed for some subjects. Interestingly, and even if it was collected from an external data source and assigned based on geolocation, SDI was the third most important predictor, possibly indicating that living in socially deprived areas decisively increases the risk of gonorrhea. Age was clearly the most impactful factor, with women between 19 and 24 years of age being associated with being a case, while younger or older age was more predictive of noncases. Other important predictors were all related to sexual activity, like screenings and medications for gonorrhea and chlamydia, use of contraceptives, and being diagnosed with bacterial vaginosis. A strength of this study is the large size of the databases, and their representativeness of diverse target populations (insured and Medicaid populations) which allowed for an adequate sample of gonorrhea cases diagnosed outside the setting of the STI clinic and reflected real-world utilization patterns among patients with access to healthcare.
This study was conducted using retrospective, routinely collected, administrative claims data and as such is subject to several limitations. Populations most at risk of gonorrhea may be more likely to attend a sexual health clinic and as such are not within the scope of this analysis. Similarly, considering the issue of fairness, asymptomatic, undiagnosed, or unreported gonorrhea, including among patients without healthcare coverage or who do not seek professional care, is also not captured. As the objective of this study was to identify potential intervention points during routine contact with the healthcare system, however, these populations, though likely at elevated risk of gonorrhea, are beyond the scope of the study and would require targeted study and intervention. Misclassification of both predictors and the outcome due to misdiagnosis, inaccuracy of a submitted claims code, or failure to submit a code is a potential source of bias. Furthermore, there may be instances in which providers use vague billing codes to provide testing and treatment services due to confidentiality concerns. While the sensitivity of claims diagnoses for gonorrhea has been shown to be low (9.7%), the presence of a diagnostic code likely reflects a real diagnosis. Additionally, in a low prevalence population as in this study, the specificity for the outcome is likely very high (99.9%), 19 and so the relative risk estimates are likely to be unbiased. 38 To maximize the specificity, we used validated outcome codes, as demonstrated by Ho et al. (2021). 19
Claims relate to specific services, procedures, and prescriptions rendered, and unlike an electronic health record, detailed clinical or behavioral information is not available and predictive factor information may be inadequately recorded. Compared to patients without gonorrhea, patients with gonorrhea could be more or less likely to have a claim submitted related to a predictor or predictive factor (e.g., record of tobacco use or a smoking cessation program). Similarly, frequent testing may indicate higher sexual risk behaviors and correlate with an increased probability of gonorrhea infection, or, conversely, may be representative of those who are most health aware and at lower risk of infection who opt to test frequently due to screening recommendations. The resulting bias, and the direction of this bias, is thus unclear, and we attempted to control for measured confounding through inclusion of a wide range of proximal and distal predictors in the multivariable models. We limited the study to patients who had a full two years of continuous registration to ensure that the outcome was adequately captured. It is possible that patients lost to follow-up were more likely to be infected with gonorrhea; however, the average continuous enrollment period in the CCAE database is over three years, indicating that exclusion rates should be relatively low.
Conclusions
The models tested in this study have the potential to support public health policy-making and planning for gonorrhea prevention. These methods could facilitate the identification of a preventive gonorrhea vaccine target population beyond the population going to STI clinics, should a vaccine become available. Results indicate that important predictive factors are available via routine care observation which, through these models, could determine subpopulations with increased risk of gonorrhea. ML models such as XGBoost provided the best discriminatory results, but simpler models such as ridge regressions with splines also achieved reasonable discrimination, with the advantage to be more transparent and interpretable.
Supplemental Material
sj-xlsx-1-dhj-10.1177_20552076251331895 - Supplemental material for Beyond the STI clinic: Use of administrative claims data and machine learning to develop and validate patient-level prediction models for gonorrhea
Supplemental material, sj-xlsx-1-dhj-10.1177_20552076251331895 for Beyond the STI clinic: Use of administrative claims data and machine learning to develop and validate patient-level prediction models for gonorrhea by Lorenzo Argante, Germain Lonnet, Emmanuel Aris and Jane Whelan in DIGITAL HEALTH
Supplemental Material
sj-docx-2-dhj-10.1177_20552076251331895 - Supplemental material for Beyond the STI clinic: Use of administrative claims data and machine learning to develop and validate patient-level prediction models for gonorrhea
Supplemental material, sj-docx-2-dhj-10.1177_20552076251331895 for Beyond the STI clinic: Use of administrative claims data and machine learning to develop and validate patient-level prediction models for gonorrhea by Lorenzo Argante, Germain Lonnet, Emmanuel Aris and Jane Whelan in DIGITAL HEALTH
Footnotes
Acknowledgements
Ethical considerations
Author contributions/CRediT
Funding
Conflicting interests
Data availability
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
