Sage Journals: Discover world-class research

Abstract

Background

Gonorrhea is a sexually transmitted infection (STI) that, untreated, can result in debilitating complications such as pelvic inflammatory disease, pain, and infertility. A minority of cases are diagnosed in STI clinics in the United States. Gonorrhea is often asymptomatic and presumed to be substantially underdiagnosed and/or undertreated.

Objectives

To generate and compare predictive machine learning (ML) models using administrative claims data to characterize young women in the general United States population who would be most likely to contract gonorrhea.

Methods

Data were extracted from the Merative™ MarketScan^® Commercial and Medicaid databases containing routinely collected administrative claims data. Women aged 16–35 years with two years of continuous observation between 1 January 2017 and 31 December 2018 were included. ML classification models were constructed based on logistic regression and tree-based algorithms.

Results

Models constructed using tree-based algorithms such as XGBoost provided the best discriminatory results, but simpler ridge regressions models with splines also achieved reasonable discrimination, allowing for the identification of population subsets at increased risk of gonorrhea infection. A subset of 0.1% of the population identified by the XGBoost model had a 70-fold higher risk of gonorrhea than the general population. External validation applying the different models on a Medicaid dataset that was not included in developing the original models was checked and deemed acceptable.

Conclusions

The models and methods presented here could facilitate the identification of women at high risk of contracting gonorrhea for whom targeted preventive measures may be most beneficial.

Keywords

Machine learning public health sexual health prevention risk factors

Introduction

Gonorrhea is a sexually transmitted infection (STI) caused by the bacterium Neisseria gonorrhoeae (Ng). Left untreated, Ng can ascend the urogenital tract and result in debilitating complications, particularly among women. These include pelvic inflammatory disease (PID) and chronic pelvic pain; sustained inflammation over time causes scarring of the fallopian tube, which can result in ectopic pregnancy and tubal factor infertility.¹ In women, the primary infection is frequently asymptomatic or minimally symptomatic and may go undiagnosed until symptoms or complications worsen. Onward transmission and complications can occur irrespective of the presence or absence of symptoms.¹ In 2022 in the United States, 255,566 cases of gonorrhea among women were reported to the Centers for Disease Control and Prevention.² However, there is likely substantial underdiagnosis and/or underreporting; for instance, the true incidence of gonorrhea in women was estimated to be approximately 853,000 cases in 2018, compared with the reported number of 241,074.^2,3

Annual screening is currently recommended for sexually active women under age 25 and for those over age 25 who are at increased risk of infection to identify STIs such as Chlamydia trachomatis (Ct) and Ng infection.⁴ Screening for chlamydia in women has been shown to potentially reduce the subsequent rate of PID, though evidence on the effect of screening for gonorrhea has not been described.^5–7 Despite the recommendation, screening rates in the United States are suboptimal, with up to 85% of eligible patients not being screened.⁷ Once the infection is diagnosed, complex, resource-intensive interventions such as partner services are frequently recommended despite variable evidence of their effective implementation.⁸

Risk prediction models for STIs have been developed to predict chlamydia and/or gonorrhea infection among asymptomatic women screened in the sexual health clinic setting, and factors such as younger age, nonwhite ethnicity, multiple sexual partners, and previous infection show reasonable predictive performance during screening.^9,10 In the United States, however, only 4% of gonorrhea cases among women were reported from STI clinics in 2022.¹¹ The remaining 96% were diagnosed by private physicians and in a range of diverse clinical settings, where information on sexual behavior may not be routinely recorded. Other prediction models have been developed utilizing data from a more general population, but have focused specifically on patients who already experienced an STI event.^12,13 Improved methods for the identification of women at high risk of contracting gonorrhea (both initial and repeat infections) in the general population could facilitate preventive interventions such as a targeted vaccine roll-out program. A preventive vaccine is currently undergoing clinical trial (ClinicalTrials.gov Identifier: NCT04350138), though none is currently licensed.

Our aim was to use routinely collected administrative claims data to develop and validate a model to predict the subpopulations of young women in the general population who are likely to contract Ng in the United States. If successful, this method could be used to identify potential opportunities for intervention during routine interactions with the healthcare system and to effectively target high-risk women with preventive measures, potentially including vaccination.

Methods

Data

Data source

The data were extracted from two sources: the Merative™ MarketScan^® Commercial Database (Commercial Claims And Encounters; CCAE) and the Merative™ MarketScan^® Medicaid Database (Medicaid). CCAE data include health insurance claims across the continuum of care (e.g., inpatient, outpatient, prescription drug, carve-out healthcare) as well as enrollment data from large employers and health plans across the United States who provide private healthcare coverage for more than 20 million employees and their spouses and dependents yearly and are representative of the employed population (and their dependents) in the United States.^14,15 Medicaid is a federal and state program that provides health coverage to people of lower socioeconomic means.¹⁶ The Medicaid database includes over 14 million people and captures the healthcare use of individuals enrolled in Medicaid in 8–12 states in the United States (depending on the year).¹⁵ Both datasets comprise integrated, individual-level healthcare claims from all care providers, including those seen in inpatient, outpatient (including emergency room), and pharmacy settings. The study adheres to the REcording of studies Conducted using Observational Routinely collected health Data statement for PharmacoEpidemiology checklist¹⁷ and the Transparent Reporting of multivariable prediction model for Individual Prognosis Or Diagnosis statement checklist.¹⁸ The databases are Health Insurance Portability and Accountability Act compliant and deidentified; as such, this study is exempt from Institutional Review Board approval.

Study population

A single, retrospective cohort was defined. All women in the databases that were aged 16 to 35 years (inclusive) on 1 January 2018, and who had at least two years of continuous enrollment between 1 January 2017 and 31 December 2018 were eligible for inclusion. The 12-month period prior to 1 January 2018, during which predictor variables were defined, was considered the observation period. The 12-month period after and including this date was considered the risk period, during which the outcome was evaluated.

Outcome

The outcome was defined according to diagnostic International Classification of Diseases, Tenth Revision (ICD-10) codes that have been previously validated for gonorrhea diagnoses in administrative claims data.¹⁹ For each woman, the outcome was a diagnosis of at least one episode of gonorrhea during the risk period (1 January 2018 to 31 December 2018). It was coded 1 if present, or 0 if absent.

Predictors

Candidate predictors were identified from the literature.²⁰ In brief, within the claims database, codes directly reflecting sexual behaviors associated with STIs such as number of sex partners, use of condoms, sexual orientation, and specific sexual behaviors are not available. However, other individual-level attributes, behaviors, and social determinants that contribute to the risk of STI can be identified. Available predictors were characterized as proximal or distal based on the gonorrhea risk framework originally proposed by Gomez et al. (2014) for sex workers,²¹ and adapted here in the absence, to our knowledge, of a more general framework (Figure 1). To maximize reproducibility and generalizability, predictors were explicitly defined according to ICD-10 Clinical Modification (ICD-10-CM) codes,²² prescription and procedures codes,²³ and/or Clinical Classifications Software Refined codes,²⁴ as detailed in Supplemental Appendix S1. The presence of a claim was coded as 1 and the absence of that claim as 0. Patient age in years on 1 January 2018 was standardized to range between 0 and 1, where 0 corresponds to 16 years and 1 corresponds to 35 years. There were no missing data among the above predictors.

Figure 1.

Conceptualizing candidate predictors as proximal or distal.

Furthermore, we hypothesized that degrees of social deprivation at the community level could also be predictive of gonorrhea risk at the individual level. For patients in the CCAE dataset, an exploratory social deprivation variable was derivable at the level of the core-base statistical area (CBSA).²⁵ The CBSA code was used as a geolocator, and the CCAE database was supplemented with the Social Deprivation Index (SDI) linked at the CBSA level. The SDI is a composite measure of area level deprivation based on seven demographic characteristics collected in the American Community Survey and used to quantify socioeconomic variation in health outcomes. It is a percentile, ranging from 1 to 100.^26,27 It was possible to derive the SDI from Zone Improvement Plan (ZIP) codes for 78% of the patients in the CCAE database. For the Medicaid database, ZIP code information is not available, so linkage to the SDI at the CBSA level could not be performed. Therefore, when geolocation data were missing, for the remaining 22% of subjects in the CCAE database and for the totality of subjects in the Medicaid database, we set the SDI to its median of 50, to be able to develop models that would not accept missing values as predictors. Details of this exploratory objective, the SDI, and the strategy for data linkage to the geolocator are in Supplemental Appendix S2.

Finally, 29 predictors were identified: 14 proximal predictors and 15 distal predictors. All of the predictors, as well as their categorization as proximal or distal, are summarized in Supplemental Appendix S1 and Table 1.

Table 1.

Study population characteristics, for each dataset.

	CCAE database		Medicaid database
	Overall	Gonococcal cases in 2018	Overall	Gonococcal cases in 2018
Population
Number of patients	1,951,454	2882 (0.15%)	813,301	5471 (0.67%)
Proximal predictors, year 2017
Age, years	23.5 (18–29)	22.2 (19–24)	22.4 (16–28)	22.0 (17–26)
Gonococcal infection	2737 (0.14%)	265 (9.20%)	5468 (0.67%)	469 (8.57%)
Chlamydia infection	11,043 (0.57%)	239 (8.29%)	12,099 (1.49%)	539 (9.85%)
Acute PID	4476 (0.23%)	40 (1.39%)	6211 (0.76%)	153 (2.80%)
HIV infection	729 (0.04%)	10 (0.35%)	1482 (0.18%)	28 (0.51%)
Hepatitis virus infection	2992 (0.15%)	13 (0.45%)	7536 (0.93%)	83 (1.52%)
Bacterial vaginosis	109,193 (5.60%)	631 (21.89%)	78,838 (9.69%)	1443 (26.38%)
Other STIs	22,181 (1.14%)	176 (6.11%)	29,635 (3.64%)	703 (12.85%)
High-risk sexual behavior	12,555 (0.64%)	144 (5.00%)	19,276 (2.37%)	530 (9.69%)
Screening for chlamydia and/or gonorrhea	593,952 (30.44%)	1769 (61.38%)	304,823 (37.48%)	3675 (67.17%)
Screening for other STIs	234,627 (12.02%)	827 (28.70%)	140,006 (17.21%)	1961 (35.84%)
Vaccinations for STIs	141,553 (7.25%)	190 (6.59%)	53,036 (6.52%)	316 (5.78%)
Antibiotic treatment for chlamydia and/or gonorrhea	251,126 (12.87%)	679 (23.56%)	113,514 (13.96%)	1475 (26.96%)
PrEP	782 (0.04%)	6 (0.21%)	1183 (0.15%)	19 (0.35%)
Distal predictors, year 2017
Social deprivation index (SDI)	49.7 (35–64)	54.5 (44–68)	Not available	Not available
Use of intrauterine contraceptives	86,335 (4.42%)	186 (6.45%)	33,262 (4.09%)	358 (6.54%)
Sterilization	6912 (0.35%)	12 (0.42%)	12,132 (1.49%)	62 (1.13%)
Use of other contraceptives	288,270 (14.77%)	884 (30.67%)	173,759 (21.36%)	2171 (39.68%)
Procreative management	36,152 (1.85%)	39 (1.35%)	6376 (0.78%)	58 (1.06%)
Counseling for sexual behavior and orientation	4079 (0.21%)	23 (0.80%)	1447 (0.18%)	61 (1.11%)
Vaccinations for non-STIs	608,882 (31.20%)	668 (23.18%)	202,569 (24.91%)	1381 (25.24%)
Treatments for opiate addiction	1315 (0.07%)	5 (0.17%)	6995 (0.86%)	48 (0.88%)
Tobacco use	43,262 (2.22%)	241 (8.36%)	133,961 (16.47%)	1434 (26.21%)
History of tobacco use	14,547 (0.75%)	36 (1.25%)	32,184 (3.96%)	376 (6.87%)
Alcohol and drugs use	50,044 (2.56%)	217 (7.53%)	81,205 (9.98%)	1002 (18.31%)
Diagnoses for mental health status	423,060 (21.68%)	747 (25.92%)	276,595 (34.01%)	2162 (39.52%)
Maltreatment/abuse	2299 (0.12%)	20 (0.69%)	6052 (0.74%)	117 (2.14%)
Antibiotic resistance	272 (0.01%)	1 (0.03%)	307 (0.04%)	1 (0.02%)
Individual level socioeconomic factors	12,857 (0.66%)	57 (1.98%)	18,438 (2.27%)	261 (4.77%)

Abbreviations: CCAE: Commercial Claims And Encounters; HIV: human immunodeficiency virus; IQR: interquartile range; PID: pelvic inflammatory disease; PrEP: preexposure prophylaxis; SDI: social deprivation index; STI: sexually transmitted infection.

Note. “Overall” columns report number of patients in the study population from Merative™ MarketScan^® Commercial (CCAE) and Medicaid datasets, their age and associated SDI (mean, IQR), the number and prevalence (percentage) of patients with the indicated diagnoses, screenings or vaccinations, for year 2017. The other two columns report the same characteristics for patients who had a gonococcal infection in 2018 (the outcome). For Medicaid, SDI is not available because data did not include geographical location.

Statistical analysis methods

Model development

Predictors were first described in terms of their prevalence in both CCAE and Medicaid datasets (Table 1). We developed machine learning (ML) classification models constructed with increasing levels of complexity by training and optimizing algorithms based on logistic regressions and classification trees.

We used the following logistic regression-based algorithms: simple (a standard logistic regression without regularization), least absolute shrinkage and selector operator (LASSO), and ridge logistic regression.²⁸ Compared to the simple logistic regression, the LASSO and ridge algorithms combat overfitting by using L1 and L2 regularization, respectively, to shrink the parameters toward zero for a sparser model (i.e., a form of feature selection). Logistic regression models were implemented with and without the transformation of numerical variables (age and SDI) using natural cubic splines at the preprocessing stage.²⁹

We used the following tree-based algorithms: a simple decision tree, random forest (a set of decision trees), and extreme gradient boosting (XGBoost), which allows for a different form of gradient tree boosting.³⁰

The data were unbalanced. The outcome, gonorrhea infection, was present in only 0.15% of the CCAE population. To optimally train the different models to distinguish between cases (the minority class) and noncases (the majority class), we introduced at the preprocessing stage two additional steps to balance the CCAE development set at each iteration of model optimization: first, oversampling of the minority class to a proportion r of the majority class (r ≤ 1), and second, undersampling of the majority class to match the numerosity of the oversampled minority class. The hyperparameter r is therefore the proportion of the majority class that is retained after the two steps and was fitted together with the other hyperparameters to mitigate the risk that, by undersampling, potentially useful data are discarded, thus affecting model performance.³¹

All the models were trained and tested with a small number of proximal predictors as well as with the addition of distal predictors (Figure 1 and Table 1).

Model evaluation and validation

CCAE data were first stratified and randomly split into a development set (80%) and a hold-out test dataset (20%) with the same proportion of the outcome in each set. Then, for each algorithm, hyperparameters were tuned on the development set and the performance of the resulting model evaluated using stratified 10-fold cross-validation (CV) with the receiver operating characteristic (ROC) area under the curve (AUC) score. For each algorithm, the best model was the one achieving the highest 10-fold CV ROC AUC. The hyperparameter search was done using Tree of Parzen Estimators included in the Hyperopt package.^32,33

The best models for each algorithm were compared using 10-fold CV ROC AUC scores, first using only proximal predictors, and then adding distal predictors as well. Models were then evaluated on CCAE data using ROC AUC scores for the full development and hold-out CCAE datasets. Finally, models were externally validated on the Medicaid dataset.

Contributions of each single predictor to predictions by best models were evaluated using Shapley values.³⁴ Evaluation was done on a random sample of 100,000 patients in the CCAE development dataset.

Results

Approximately 1.95 million women in the CCAE database and 0.81 million in the Medicaid database met inclusion criteria (Table 1). The predictive outcome, diagnosis of gonococcal disease in 2018, was present in 0.15% (n = 2882) of patients in the CCAE dataset and 0.67% (n = 5471) of patients in the Medicaid dataset. Similar incidences were observed in 2017: 0.14% and 0.67% for CCAE and Medicaid, respectively. For both datasets, approximately 9% of patients who had diagnoses of gonococcal infection in 2018 had a diagnosis of gonococcal infection in the prior year.

The average age of patients was lower in the Medicaid cohort than in the CCAE cohort: 22.4 years (interquartile range [IQR]: 16–28) for Medicaid and 23.5 years (IQR: 18–29) for CCAE (ages in 2017). Furthermore, average age was lower among patients with diagnosed gonococcal disease versus the overall cohorts: 22.0 years (IQR: 17–26) versus 22.4 years (IQR: 16–28) for Medicaid and 22.2 years (IQR: 19–24) versus 23.5 years (IQR: 18–29) for CCAE. Social deprivation measured through the SDI was higher in patients with diagnosed gonococcal disease versus overall for CCAE; SDI data were not available for Medicaid. Compared to the overall cohorts, gonococcal cases also had a higher frequency of previously diagnosed STIs, acute PID, and high-risk sexual behavior (Table 1). While screening and antibiotic treatments for chlamydia, gonorrhea, and other STIs were more common in patients with diagnosed gonococcal disease versus overall, vaccination against STIs (e.g., hepatitis A and B) was slightly less frequent in these patients. Additional predictors are summarized in Table 1.

Table 2 reports 10-fold CV ROC AUC scores for the respective optimal set hyperparameter values for all models tested. As regression models without splines and the decision tree model produced markedly smaller ROC AUC values when using proximal predictors, they were not tested among the models with both proximal and distal predictors.

Table 2.

Best performing model by algorithm.

	CCAE dataset			Medicaid dataset
Algorithm	Dev. set 10-fold CV ROC AUC	Dev. set ROC AUC	Test set ROC AUC	ROC AUC
Using only proximal predictors
XGBoost	76.93%	77.27%	75.72%	73.80%
LASSO logistic regression with splines	76.81%	76.95%	75.42%	73.66%
Ridge logistic regression with splines	76.81%	76.96%	75.41%	73.60%
Simple logistic regression with splines	76.81%	76.87%	75.48%	73.60%
Random forest	76.42%	77.30%	75.57%	73.56%
Decision tree	75.61%	77.18%	75.14%	72.68%
Ridge logistic regression without splines	73.06%	73.08%	71.63%	71.23%
LASSO logistic regression without splines	73.06%	73.12%	71.50%	71.19%
Simple logistic regression without splines	73.06%	73.12%	71.63%	71.23%
Using proximal and distal predictors
XGBoost	79.47%	81.12%	78.63%	74.94%
LASSO logistic regression with splines	79.11%	79.58%	78.42%	74.75%
Random forest	79.08%	83.23%	78.38%	75.06%
Ridge logistic regression with splines	78.77%	79.07%	78.31%	74.79%
Simple logistic regression with splines	78.70%	79.07%	78.19%	74.77%

Abbreviations: AUC: area under the curve; CCAE: Commercial Claims And Encounters; CV: cross-validation; LASSO: least absolute shrinkage and selection operator; ROC: receiver operating characteristic; XGBoost: extreme gradient boosting.

Note. The tested algorithms here are ranked by the best 10-fold CV ROC AUC score, obtained by optimizing their hyperparameters to maximize the 10-fold CV ROC AUC on the development set. The procedure has been repeated using the full panel of proximal and distal predictors and using only proximal predictors. The last three columns report ROC AUC scores obtained when internally testing the best models on the full CCAE development set and on the CCAE hold-out test set, and when externally validating the models on the Medicaid data.

Using both proximal and distal predictors improved CV scores by 2–3%. The best performance was achieved using an XGBoost model (10-fold CV ROC AUC: 79.47%). The second-best performing model was the LASSO logistic regression (10-fold CV ROC AUC: 79.11%). Table 2 shows that ROC AUC scores recalculated on the full development set confirm the models’ performance hierarchy. When prediction models were applied to the hold-out test set, ROC AUC scores decreased slightly (less than 1%) and conserved a similar hierarchy between the models, with XGBoost performing better than other models (ROC AUC: 78.63%). When applied to the Medicaid dataset, which lacked the SDI variable, performance dropped by nearly 4% (ROC AUC for XGBoost: 74.94%). ROC curves for the XGBoost method and LASSO logistic regression with splines are plotted in Figure 2 for the development CCAE set, the hold-out CCAE set, and the Medicaid dataset. Discrimination on the Medicaid dataset was similar, though somewhat lower, than the hold-out CCAE dataset.

Figure 2.

ROC curves for the best models based on XGBoost and LASSO logistic regression with splines. (a) XGBoost, (b) LASSO logistic regression with splines.

Table 3 shows how the discriminatory power of the method varies when tuning the discriminatory threshold to make the model more selective. For example, when the threshold for identifying predicted gonococcal cases in the CCAE development set was raised to select only 0.1% of the population, corresponding to 1562 patients, 161 actual gonorrhea cases in 2018 were identified out of these (risk of gonorrhea = 10,307 per 100,000 person-years, nearly 70 times higher than in the original population, where the risk was 148 per 100,000 person-years). With this threshold, gonorrhea is detected with a sensitivity of 7.0% and a specificity of 99.9%. In other words, this method would select the 0.1% of the population to be at highest risk of infection predicted by the model, resulting in a positive predictive value (the percentage of patients correctly predicted to have gonorrhea in 2018) jumping from 0.15% in the general population to over 10%. Negative predictive values are uniformly high here due to the low rate of the disease in the considered population.

Table 3.

Performance of best model for detecting actual cases at different percentiles of population at increasing predicted probability of gonorrhea (CCAE development set).

Perc. (%)	Discrim. threshold	Patients	Observed cases	Incidence per 100,000 PY	Sens. (%)	Spec. (%)	PPV (%)	NPV (%)
95	0.097	1,483,404	2294	154.6	99.5	5.0	0.15	99.98
75	0.192	1,170,990	2238	191.1	97.1	25.0	0.19	99.98
50	0.326	780,623	2067	264.8	89.6	50.1	0.26	99.97
25	0.488	390,401	1634	418.5	70.9	75.1	0.42	99.94
10	0.652	156,124	1133	725.7	49.1	90.1	0.73	99.92
5	0.744	78,062	850	1088.9	36.9	95.1	1.09	99.90
2	0.838	31,231	589	1885.9	25.5	98.0	1.89	99.89
1	0.889	15,612	433	2773.5	18.8	99.0	2.77	99.88
0.5	0.928	7806	329	4214.7	14.3	99.5	4.21	99.87
0.2	0.962	3123	227	7268.7	9.8	99.8	7.27	99.87
0.1	0.981	1562	161	10,307.3	7.0	99.9	10.31	99.86

Abbreviations: CCAE: Commercial Claims And Encounters; NPV: negative predictive value; perc.: percentile; discrim. threshold: discrimination threshold, predictions over this value are identified as cases; PPV: positive predictive value; PY: person-years; sens.: sensitivity; spec.: specificity; XGBoost: extreme gradient boosting.

Note. Performance statistics on subpopulations of the CCAE development set predicted to be gonorrhea cases by the best performing XGBoost model, obtained by increasing the discrimination threshold to have select fixed percentiles of population at higher predicted probability to be cases. The original population of the development set (100% percentile) was 1,561,163 with 2306 cases (risk = 147.7 per 100,000 person-years).

Some of the predictors contributed more than others to predict gonorrhea cases. The most important predictors are displayed in Figure 3 for the XGBoost model (results were nearly identical for the LASSO logistic regression with splines), ranked by their absolute impact in determining the model's outcome (gonorrhea cases). Age was the most important predictor, followed by having a screening for chlamydia/gonorrhea and the SDI.

Figure 3.

Impact of variables on predicted risk of gonorrhea.

Discussion

The models presented here were developed to identify women in the general population who are at risk of contracting gonorrhea and to understand which characteristics are key for their detection. Indeed, clinical and behavioral research on STIs is often conducted in sexual health clinics, where incidence and prevalence of disease is high and predictive factors for gonorrhea are well characterized during screening.^35–37 Outside of this setting, however, established risk behaviors may not be queried or recorded (e.g., the family physician may not ask a woman about number of sex partners, or if a sex partner has a history of STI). Limited research has been done to identify behaviors or conditions that may not be directly on the risk pathway but that could reasonably be expected to be associated with an augmented or reduced risk of gonorrhea. To capture the potential of such behaviors and conditions to predict a gonorrhea infection among young women in the general population, ML models were developed, cross-validated, and tested on a sample of nearly 2 million women aged 16–35 years from a database that is representative of the employed population in the United States. Results of the models obtained were further tested in a distinct population of low-income women with a dataset (Medicaid) that was not used in any part of model development. As such, the study population captured the wide range of socioeconomic and demographic groups of women accessing healthcare in the United States. The potential to use large, representative databases of routinely collected data that reflect real-world utilization patterns is appealing, particularly as it allows for identification of an adequate sample of gonorrhea cases diagnosed outside the setting of the STI clinic.

In this study, we confirmed the feasibility of predictive modeling applied to administrative claims data to identify young women at risk of gonorrhea in the United States within a 12-month timeframe. The ROC AUC values obtained from the different models given the predictors available in these databases on the hold-out test set were, as could be expected, lower than those obtained on the development dataset. However, the decrease was quite low (less than 1% for the models with distal and proximal predictors) and the ROC AUC values remained above 75% for the best models, indicating a strong effect of the combination of the variables considered on the subsequent risk of gonorrhea. The decrease was slightly more substantial whenever the models were applied to the Medicaid population. Beyond the fact that the model was built from the CCAE population, our exploratory objective, to include sociodemographic diversity, which as shown earlier was an important predictor, could not be assessed for the Medicaid population, as the SDI information was not available in the Medicaid dataset. However, the ROC AUCs obtained were still around 75%, indicating that the predictive factors used in the models may also be applicable for this population.

The best predictive results were obtained using XGBoost, a complex “black box” ML model, but it is notable that models based on simple logistic regressions performed very close to XGBoost (10-fold CV score: 76.81% vs 76.93%), with the great advantage that logistic regressions are much more explainable and transparent. The transformation of continuous variables (age and SDI) through splines was determinant to the good performance of logistic regression, increasing the 10-fold CV scores from 73.06% to 76.81% without affecting model explainability.

Use of the best model allowed for easy identification of populations of women at increased risk of gonorrhea. For example, the 0.1% of women where the model-predicted risk of gonorrhea is highest had a nearly 70-fold higher incidence of gonorrhea (10.31%) during the follow-up period than in the general population (0.15%). Pending further external validation with other data sources, findings from these models may be leveraged to identify opportunities for targeted public health interventions that may be most effective in protecting the general United States population from gonorrhea. By identifying older adolescent and young adult women at an increased risk for contracting gonorrhea, measures can be implemented to proactively reach these women, such as at routine healthcare visits, either before infection occurs (e.g., through vaccination) or progresses to symptomatic stages (e.g., through screening).

The prediction models implemented here were based on an a-priori selection of 29 predictors, divided between proximal and distal determinants. The inclusion of the distal predictors improved model performance, and indeed among the top 10 most influential predictors, four were distal: SDI, vaccination for non-STIs, use of other contraceptives, and tobacco use (Figure 3). Even if distal predictors are not directly related to the outcome, they may be correlated with other actual determinants that are not observable. For example, vaccination for non-STIs (ranked as the fourth most influential predictor) increases the chances of being a noncase and could therefore be associated with factors that reduce the risk of contracting gonorrhea. Tobacco use (ninth most influential predictor) was strongly predictive of cases, possibly because of its association with actual determinants that were not included in the data or not observed for some subjects. Interestingly, and even if it was collected from an external data source and assigned based on geolocation, SDI was the third most important predictor, possibly indicating that living in socially deprived areas decisively increases the risk of gonorrhea. Age was clearly the most impactful factor, with women between 19 and 24 years of age being associated with being a case, while younger or older age was more predictive of noncases. Other important predictors were all related to sexual activity, like screenings and medications for gonorrhea and chlamydia, use of contraceptives, and being diagnosed with bacterial vaginosis. A strength of this study is the large size of the databases, and their representativeness of diverse target populations (insured and Medicaid populations) which allowed for an adequate sample of gonorrhea cases diagnosed outside the setting of the STI clinic and reflected real-world utilization patterns among patients with access to healthcare.

This study was conducted using retrospective, routinely collected, administrative claims data and as such is subject to several limitations. Populations most at risk of gonorrhea may be more likely to attend a sexual health clinic and as such are not within the scope of this analysis. Similarly, considering the issue of fairness, asymptomatic, undiagnosed, or unreported gonorrhea, including among patients without healthcare coverage or who do not seek professional care, is also not captured. As the objective of this study was to identify potential intervention points during routine contact with the healthcare system, however, these populations, though likely at elevated risk of gonorrhea, are beyond the scope of the study and would require targeted study and intervention. Misclassification of both predictors and the outcome due to misdiagnosis, inaccuracy of a submitted claims code, or failure to submit a code is a potential source of bias. Furthermore, there may be instances in which providers use vague billing codes to provide testing and treatment services due to confidentiality concerns. While the sensitivity of claims diagnoses for gonorrhea has been shown to be low (9.7%), the presence of a diagnostic code likely reflects a real diagnosis. Additionally, in a low prevalence population as in this study, the specificity for the outcome is likely very high (99.9%),¹⁹ and so the relative risk estimates are likely to be unbiased.³⁸ To maximize the specificity, we used validated outcome codes, as demonstrated by Ho et al. (2021).¹⁹

Claims relate to specific services, procedures, and prescriptions rendered, and unlike an electronic health record, detailed clinical or behavioral information is not available and predictive factor information may be inadequately recorded. Compared to patients without gonorrhea, patients with gonorrhea could be more or less likely to have a claim submitted related to a predictor or predictive factor (e.g., record of tobacco use or a smoking cessation program). Similarly, frequent testing may indicate higher sexual risk behaviors and correlate with an increased probability of gonorrhea infection, or, conversely, may be representative of those who are most health aware and at lower risk of infection who opt to test frequently due to screening recommendations. The resulting bias, and the direction of this bias, is thus unclear, and we attempted to control for measured confounding through inclusion of a wide range of proximal and distal predictors in the multivariable models. We limited the study to patients who had a full two years of continuous registration to ensure that the outcome was adequately captured. It is possible that patients lost to follow-up were more likely to be infected with gonorrhea; however, the average continuous enrollment period in the CCAE database is over three years, indicating that exclusion rates should be relatively low.

Conclusions

The models tested in this study have the potential to support public health policy-making and planning for gonorrhea prevention. These methods could facilitate the identification of a preventive gonorrhea vaccine target population beyond the population going to STI clinics, should a vaccine become available. Results indicate that important predictive factors are available via routine care observation which, through these models, could determine subpopulations with increased risk of gonorrhea. ML models such as XGBoost provided the best discriminatory results, but simpler models such as ridge regressions with splines also achieved reasonable discrimination, with the advantage to be more transparent and interpretable.

Supplemental Material

sj-xlsx-1-dhj-10.1177_20552076251331895 - Supplemental material for Beyond the STI clinic: Use of administrative claims data and machine learning to develop and validate patient-level prediction models for gonorrhea

Supplemental material, sj-xlsx-1-dhj-10.1177_20552076251331895 for Beyond the STI clinic: Use of administrative claims data and machine learning to develop and validate patient-level prediction models for gonorrhea by Lorenzo Argante, Germain Lonnet, Emmanuel Aris and Jane Whelan in DIGITAL HEALTH

Supplemental Material

sj-docx-2-dhj-10.1177_20552076251331895 - Supplemental material for Beyond the STI clinic: Use of administrative claims data and machine learning to develop and validate patient-level prediction models for gonorrhea

Supplemental material, sj-docx-2-dhj-10.1177_20552076251331895 for Beyond the STI clinic: Use of administrative claims data and machine learning to develop and validate patient-level prediction models for gonorrhea by Lorenzo Argante, Germain Lonnet, Emmanuel Aris and Jane Whelan in DIGITAL HEALTH

Footnotes

Acknowledgements

The authors thank Costello Medical for editorial assistance and publication coordination,on behalf of GSK,and acknowledge Anna Zolotor,Costello Medical,United States for medical writing and editorial assistance based on authors’ input and direction.

ORCID iD

Lorenzo Argante

Emmanuel Aris

Ethical considerations

The study adheres to the REcording of studies Conducted using Observational Routinely collected health Data statement for PharmacoEpidemiology checklist and the Transparent Reporting of multivariable prediction model for Individual Prognosis Or Diagnosis statement checklist. The databases are Health Insurance Portability and Accountability Act compliant and deidentified;as such,this study is exempt from Institutional Review Board (IRB) approval.

Author contributions/CRediT

Substantial contributions to study conception and design: LA,GL,EA,and JW;substantial contributions to analysis and interpretation of the data: LA,GL,EA,and JW;drafting the article or revising it critically for important intellectual content: LA,GL,EA,and JW;final approval of the version of the article to be published: LA,GL,EA,and JW.

Funding

This analysis was funded by GSK (Analysis identifier: 223766). Support for third-party writing assistance for this article was provided by Anna Zolotor,Costello Medical,United States in accordance with Good Publication Practice (GPP3) guidelines (

Conflicting interests

LA: employed by GSK;GL and JW: employed by GSK at the time of the study;EA: employed by GSK and holds financial equities in GSK.

Data availability

The data generated during the current analysis are available from the corresponding author on reasonable request upon prior approval from Merative™ MarketScan ®,the data owner.

Supplemental material

Supplemental material for this article is available online.

References

Unemo

Seifert

Hook

3rd , et al. Gonorrhoea. Nat Rev Dis Primers 2019; 5: 79.

Centers for Disease Control and Prevention. Sexually transmitted infections (STIs): Table 15. Gonorrhea — Reported cases and rates of reported cases by age group and sex, United States, 2018–2022. 2024, https://www.cdc.gov/sti-statistics/media/pdfs/2024/11/2022-STI-Surveillance-Report-PDF.pdf (accessed January 2025).

Kreisel

Spicknall

Gargano

, et al. Sexually transmitted infections among US women and men: prevalence and incidence estimates, 2018. Sex Transm Dis 2021; 48: 208–214.

Davidson

Barry

Mangione

, et al. Screening for chlamydia and gonorrhea: US preventive services task force recommendation statement. JAMA 2021; 326: 949–956.

Scholes

Stergachis

Heidrich

, et al. Prevention of pelvic inflammatory disease by screening for cervical chlamydial infection. N Engl J Med 1996; 334: 1362–1366.

Oakeshott

Kerry

Aghaizu

, et al. Randomised controlled trial of screening for Chlamydia trachomatis to prevent pelvic inflammatory disease: the POPI (prevention of pelvic infection) trial. Br Med J 2010; 340: c1642.

Nash

. Sexually transmitted infections: compelling case for an improved screening strategy. Popul Health Manag 2017; 20: S1–S11.

Hogben

Collins

Hoots

, et al. Partner services in sexually transmitted disease prevention programs: a review. Sex Transm Dis 2016; 43: S53–S62.

Falasinnu

Gilbert

Gustafson

, et al. Deriving and validating a risk estimation tool for screening asymptomatic chlamydia and gonorrhea. Sex Transm Dis 2014; 41: 706–712.

10.

Latt

Soe

, et al. Identifying individuals at high risk for HIV and sexually transmitted infections with an artificial intelligence-based risk assessment tool. Open Forum Infect Dis 2024; 11: ofae011.

11.

Centers for Disease Control and Prevention. Sexually transmitted infections (STIs): data points. 2024, https://www.cdc.gov/sti-statistics/media/pdfs/2024/11/2022-STI-Surveillance-Report-PDF.pdf (accessed January 2025).

12.

Elder

Gruber

Willis

, et al.

Can machine learning help identify patients at risk for recurrent sexually transmitted infections?

Sex Transm Dis 2021; 48: 56–62.

13.

Saldana

Burkhardt

Pennisi

, et al. Development of a machine learning modeling tool for predicting HIV incidence using public health data from a county in the southern United States. Clin Infect Dis 2024; 79: 717–726.

14.

SOURCE: Truven Health MarketScan® research databases. Accessible at https://theclearcenter.org/wp-content/uploads/2020/01/IBM-MarketScan-User-Guide.pdf.

15.

GSK. Data on file.

16.

Centers for Medicare & Medicaid Services. Medicaid.gov, https://www.medicaid.gov/ (accessed August 2024).

17.

Langan

Schmidt

Wing

, et al. The reporting of studies conducted using observational routinely collected health data statement for pharmacoepidemiology (RECORD-PE). Br Med J 2018; 363: k3532.

18.

Collins

Moons

KGM

Dhiman

, et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. Br Med J 2024; 385: e078378.

19.

Rahurkar

Tao

, et al. Validation of international classification of diseases, tenth revision, clinical modification codes for identifying cases of chlamydia and gonorrhea. Sex Transm Dis 2021; 48: 335–340.

20.

Hogben

Leichliter

Aral

. An overview of social and behavioral determinants of STI. In: Cristaudo

Giuliani

(eds) Sexually transmitted infections: advances in understanding and management. Cham: Springer International Publishing, 2020, pp.25–45.

21.

Gomez

Ward

Garnett

. Risk pathways for gonorrhea acquisition in sex workers: can we distinguish confounding from an exposure effect using a priori hypotheses? J Infect Dis 2014; 210: S579–S585.

22.

World Health Organization. International statistical classification of diseases and related health problems 10th revision. Geneva, Switzerland: WHO, 2019.

23.

American Medical Association. CPT® codes, https://www.ama-assn.org/topics/cpt-codes (accessed August 2024).

24.

Clinical Classifications Software Refined (CCSR). Healthcare cost and utilization project (HCUP). 2021, www.hcup-us.ahrq.gov/toolssoftware/ccsr/ccs_refined.jsp (accessed April 2022).

25.

National Bureau of Economic Research. SSA to federal information processing series (FIPS) core-based statistical area (CBSA) and metropolitan and micropolitan statistical area (MSA) county crosswalk. 2022, https://www.nber.org/research/data/ssa-federal-information-processing-series-fips-core-based-statistical-area-cbsa-and-metropolitan-and (accessed April 2022).

26.

Robert Graham Center - Policy Studies in Family Medicine & Primary Care. Social deprivation index (SDI). 2018, https://www.graham-center.org/maps-data-tools/social-deprivation-index.html (accessed August 2024).

27.

United States Census Bureau. American community survey (ACS). 2024, https://www.census.gov/programs-surveys/acs/ (accessed August 2024).

28.

James

Witten

Hastie

, et al. An introduction to statistical learning. 2nd ed. New York, NY: Springer, 2021.

29.

Hastie

Tibshirani

Friedman

. The elements of statistical learning. Data mining, inference and prediction. 2nd ed. New York, NY: Springer, 2009.

30.

Kufel

Bargiel-Laczek

Kocot

, et al. What is machine learning, artificial neural networks and deep learning? Examples of practical applications in medicine. Diagnostics (Basel) 2023; 13: 1–22. DOI: https://doi.org/10.3390/diagnostics13152582.

31.

Megahed

Chen

Megahed

, et al. The class imbalance problem. Nat Methods 2021; 18: 1269–1272.

32.

Bergstra

Bardenet

Bengio

, et al. Algorithms for hyper-parameter optimization. In: Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS 2011), pp.2546–2554. Red Hook, NY: Curran Associates Inc.

33.

Bergstra

Yamins

Cox

. Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. Proc 30th Int Conf Mach Learn, PLMR 2013; 28: 115–123.

34.

Lundberg

Erion

Chen

, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2020; 2: 56–67.

35.

Ribeiro

de Sousa

Medina

, et al. Prevalence of gonorrhea and chlamydia in a community clinic for men who have sex with men in Lisbon, Portugal. Int J STD AIDS 2019; 30: 951–959.

36.

Gallo

Macaluso

Warner

, et al. Bacterial vaginosis, gonorrhea, and chlamydial infection among women attending a sexually transmitted disease clinic: a longitudinal analysis of possible causal links. Ann Epidemiol 2012; 22: 213–220.

37.

Javanbakht

Gorbach

Stirland

, et al. Prevalence and correlates of rectal chlamydia and gonorrhea among female clients at sexually transmitted disease clinics. Sex Transm Dis 2012; 39: 917–922.

38.

MacLehose

Lash

, et al. Non-differential misclassification of outcome under (near)-perfect specificity: a simulation study. Am J Epidemiol 2024 (online ahead of print). DOI:https://doi.org/10.1093/aje/kwae328

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.07 MB

0.00 MB