Proportional rates models are frequently used for the analysis of recurrent event data with multiple event categories. When some of the event categories are missing, a conventional approach is to either exclude the missing data for a complete-case analysis or employ a parametric model for the missing event type. It is well known that the complete-case analysis is inconsistent when the missingness depends on covariates, and the parametric approach may incur bias when the model is misspecified. In this paper, we aim to provide a more robust approach using a rate proportion method for the imputation of missing event types. We show that the log-odds of the event type can be written as a semiparametric generalized linear model, facilitating a theoretically justified estimation framework. Comprehensive simulation studies were conducted demonstrating the improved performance of the semiparametric method over parametric procedures. Multiple types of Pseudomonas aeruginosa infections of young cystic fibrosis patients were analyzed to demonstrate the feasibility of our proposed approach.
Recurrent event data with multiple categories frequently arise in medical science and population health studies. Different causes of hospitalizations, multiple strains of bacteria infections, and various types of treatment failures all belong to such data type. Taking cystic fibrosis (CF) as an example, recurrent Pseudomonas aeruginosa (PA) infections are commonly observed in patients with CF. PA infection includes mucoid and nonmucoid strains. Without appropriate treatment, the recurrent infections with mucoid strains often become persistent and chronic, causing increased CF mortality and morbidity.1–3 As another example, patients who received renal transplants may have different types of recurrent infections.4 End-stage renal disease patients who received continuous ambulatory peritoneal dialysis may have multiple types of treatment failures that make the patient switch to other dialysis methods.5
When modeling this kind of recurrent event data, a proportional rates model that is conditional only on the current value of covariates is commonly used.6 Let denote the number of recurrent events up to time t for subject i () and event category j (). Let denote the column vector of covariates which are possibly time-varying. A proportional rates model proposed by Cai and Schaubel6 is defined by
where β0 is the regression coefficient and is the baseline rate function for the jth type of the recurrent event. Although β0 is not indexed by event category j, the model is flexible enough to accommodate covariate effects that are specific to the event type in each individual model. For example, if there are two types of recurrent events and one uses q1- and q2-column vector of covariates, and , for the first and second type of recurrent events, respectively, one can define and to specify the individual models, where is a qj-column vector of zeros. Accordingly, one may define , where for j =1, 2.
Let indicate whether subject i with censoring time Ci is under observation at time t, where , and τ is the end of follow-up time. Let denote the observed number of events up to time t. When the event category is fully observed, Cai and Schaubel6 showed that the coefficient β0 in (1) could be consistently estimated by the estimating equations
where with
for d =0, 1, where , and for a column vector a.
However, when the event category is possibly missing, the estimating equation (2) is not feasible since the quantity is not always observable. A naive approach, which uses completely observed data that include only events with known type, can be valid if the event category is missing completely at random, but may give biased results if the missingness depends on the covariates. Schaubel and Cai5 suggested that one rewrite as
where indicates whether the event category is observed, indicates whether the event category is type j, and indicates the total number of events at time t. Note that equals 0 or 1 since they assume events with different types do not occur simultaneously. They further suggested that one replace the unknown quantity with a consistent estimator for , where
One may consider parametric, multinomial logit models for estimation of .5,7 However, the association between the covariates and δij may not be correctly specified, which may lead to inconsistent estimation. This motivates us to develop a more robust method that weakens the impact of model misspecification.
In this paper, we extend the proportional rates method previously developed for estimation of 8 using a semiparametric approach that exploits a special form of the rate proportion of event type j to the overall rate function. Interestingly, under the proportional rates model (1), the ratio of two rate proportions can be expressed as
where and . In fact, model (4) can be viewed as a generalized partially linear model. Under certain regularity conditions, one can estimate β0 and simultaneously via semiparametric regression techniques such as local polynomials,9 generalized additive models,10 and polynomial spline functions.11
When there is no missing data in the event type, one can estimate using all of the data based on model (4). With missing event types, however, one can only estimate using completely observed data, where
Letting denote the probability of non-missingness given that the event type j occurs, one can show
where is the log-ratio of non-missingness for two event types. Since model (5) is time-varying, estimation is complicated. A common assumption in the previous literature is that does not depend on j and for each j.5,7,8 This assumption corresponds to missing at random (MAR) assumption when the missingness does not depend on unobserved information.12 In this paper, we adopt the same assumption and assume .
Theoretical challenge remains when the semiparametric estimator of is substituted for the unknown in the estimating equations for β. Specifically, it is not clear if one can still obtain a convergence rate in the estimation of β since the convergence rate of estimation is generally slower than with a semiparametric approach. We will show that, under mild regularity conditions, our estimator of β converges at a rate to a normal distribution with variance that may be consistently estimated using a simple plug-in formula.
The remaining sections are organized as follows. In Section 2, we exploit a cubic B-spline function for the estimation of and propose general estimating equations for the regression coefficient β0 and baseline mean function in model (1). Consistency and large sample normality of the estimators are shown in Section 3. Finite-sample performances evaluated by comprehensive simulation experiments are studied in Section 4. A real-data analysis on multiple types of PA infections in the United States 2016 Cystic Fibrosis Foundation Patient Registry is presented in Section 5. Conclusions and discussions on future research are presented in Section 6.
2 Estimation method
Assume that can be approximated by a cubic B-spline function
where are basis functions, m is the number of interior knots, and is a vector of spline coefficients. Let , where . One can estimate θ0 by maximizing an approximate log-likelihood function
where
with , where is a column vector with in the jth block for , and for all t. The number of interior knots m can be selected by Akaike information criteria (AIC) that minimizes plus two times the number of parameters in θ. However, other approaches such as generalized cross-validation that approximates the leave-one-out cross-validation may also be considered.13
Letting denote the maximizer of (6), one can estimate by
By solving the estimating equations
where , one can obtain our proposed estimator for β0. With β replaced by in the estimating equation
one can obtain an empirical estimator for the baseline mean function , where
Note that, although β is denoted the same in the log-rate ratio model (4) and proportional rates model (1), the β in the model (4) may not be fully identifiable. The covariate with β in model (4) is , which is the difference between and . If a covariate, for example, age is included and has the common effect on the rate function for all event categories, then the corresponding Xij equals 0, and consequently, the corresponding component in β is not identifiable. Another situation is when a covariate is included in rate model for all event categories, but the effects are different for different event category. In this situation, what is estimable in β in model (4) is the difference between the effects of that covariate for different categories and not the effects themselves. Taking J =2 for example, one can write and , where is the covariate in the rates models for both event categories, for example, gender, and , where βj is the effect of on rate function for event category j for j =1, 2. The difference between and is , and one can write . From this expression, we can see that only the contrast can be estimated from model (4), not β1 and β2 individually. Including some same set of covariates in the rate models for some event categories is common in practice. Therefore, it is not feasible to use the log-likelihood function (6) to estimate β in general. However, even though β in model (4) is not identifiable, our proposed method can still work because can still be consistently estimated using function (6).
Also note that the proportional rates model (1) with a baseline rate function specific for each event type is quite general. One may restrict the model with the baseline rate function to be proportional to a reference category J, meaning , where is the baseline rate function of the reference category. The model can be written as
where is a column vector of 1 in the jth element and 0 otherwise with as the corresponding coefficient. One can show that the formula (5) becomes
and estimate β0 and using completely observed data via parametric estimating equations
where
assuming for .
With and solving equation (11), one can obtain a more efficient estimator for β0 and γ0 using all of the events by replacing the unknown quantity with . This yields the estimating equations
where with
for d =0, 1, and
By comparing models (5) and (10), one can conduct a statistical test for the proportionality of the baseline rate functions by testing if is constant for all t, i.e., testing the null hypothesis for . Using our approach, the null hypothesis is equivalent to for each j, which can be tested via a Wald-type test procedure. Under a more restricted model (9), the fully parametric model (10) for is somewhat different from the one proposed by Schaubel and Cai.5 The fully parametric model for in Schaubel and Cai5 includes more covariates than model (10), such as time of event occurrence t and number of previous events .
3 Asymptotic theory
Large sample properties of the semiparametric estimators and using model (5) will be derived in this section. The developments are more challenging than those in Schaubel and Cai,5 which covers only the more restrictive parametric model (10). We first state our notations. Let
and let
where . The score function of can be written as , where
and the negative Hessian matrix of can be written as
Regularity conditions, especially for the number of interior knots, are outlined here. These conditions are required for the proof of the large sample properties of our estimators.
Variables are independent and identically distributed.
The distribution of censoring time Ci satisfies for each i.
The sample path of the covariates satisfies for every , where is the th element of the covariate Zij.
The limiting matrices and are positive-definite.
The baseline rate functions are bounded away from zero and infinity on .
The second derivative of exists and satisfies Lipschitz condition of order ϵ on for for some .
The number of interior knots satisfies .
Note that Conditions (a)–(e) are regularity conditions for recurrent event processes, outlined in Cai and Schaubel.6 The smoothness condition in (f) is similar to the condition (C1) in Wang et al.11 and enables estimation of using spline functions, with Condition (g) describing the number of parameters used in the spline functions relative to the sample size.
According to Wang et al.,11 the convergence rate of the estimator of β0 in model (5) is under Conditions (c)–(g), while the convergence rate of the nonparametric estimator of is slower than . This makes that the convergence rate of the estimator for is slower than . However, we can show that the convergence rate of our estimator for the regression parameter β0 is . The following theorem describes the large sample theory of our estimator. The detailed proof is given in Appendix 1.
Under Conditions (a)–(g), the estimatoris a consistent estimator of β0 andconverges in distribution to a normal variable with mean 0 and variance Σ, which can be consistently estimated by, where
and
with
and
where .
The consistency of can be proved via consistency of and conventional convex theories. Large sample normality can be established via an approximation to by a summation of independent and identical random vectors, as shown in Appendix 1. Note that the variation of is larger than that for which solves equation (2) assuming that there are no missing data, since the estimation for creates additional uncertainty when the event type is missing. This additional variation can be seen in the second term of . Empirical studies show that the efficiency loss compared to the estimator with no missing data may be rather small.
Let and
The following theorem describes the limiting properties of the baseline mean function estimator for in model (1).
Under the same conditions of Theorem 1, the baseline mean function estimatoris uniformly consistent for, andconverges weakly to a Gaussian process with mean 0 and covariance function, which can be consistently estimated by
where
The proof begins by decomposing as , where and . The uniform consistency of can be proved by showing that both and converge to 0. The uniform convergence of can be proved using a law of large numbers for empirical processes and uniform convergence of to . The uniform convergence of involves some additional assumptions. The details of the proof are provided in Appendix 1. The proof of weak convergence, which establishes tightness and convergence to finite-dimensional distributions, follows the standard tools in Pollard14 and van der Vaart and Wellner;15 see Appendix 1 for details.
4 Simulation study
In this section, we demonstrate the feasibility of our proposed method via comprehensive simulations. We first examine a scenario when the proportional baseline rate model (9) holds; therefore, the general model (1) also holds. We then investigate a scenario when the baseline rate functions are nonproportional, in that model (1) holds, but not model (9). For subject i, two types of recurrent events were simulated from two intensity functions sharing the same latent variable Gi, which was sampled from Gamma with and var. In the first scenario, the intensity functions are assumed and with constants r01 and r02, while, in the second scenario, the intensity functions are and , where . We let α = 0, 0.5, or 1 for different dependencies between two types of recurrent events, where α = 0 indicates two types of recurrent events are independent. We set and in the first scenario and in the second scenario. We let or and . The covariate was randomly drawn from a Bernoulli distribution with probability 0.5. The censoring time was generated uniformly between 0 and 5 for each subject. In summary, there are 1.2–1.4 total number of events on average in these simulation scenarios, with a 1:2 or 1:3 ratio of type 1 events to type 2 events. The maximum number of events in a subject ranges from 6 to 11 events on average among the scenarios.
One can show that the rate functions can be expressed as , where and in the first scenario, and and in the second scenario. Note that the weighted estimating equations method in Schaubel and Cai5 is unbiased in the first scenario if one uses as the covariate in the logistic regression model for . However, it is quite evident that the model is misspecified if one uses the same model in the second scenario.
We assumed that the probability of having a missing category was given by
where . We let , with , and , where indicated that the missingness depends on covariates and the missingness assumption is MAR. Various values of ϵ0 were given to control the percentages of events with missing categories, denoted by . Here, we assume that the missingness does not depend on the event type, i.e., and .
Table 1 shows the simulation results for β1 under 1,000 repetitions of sample size n =200. We present the results of our proposed estimator and the weighted estimating equations method , where was used as a covariate in the logistic regression model for to derive . We also present the results of the estimator assuming that there is no missing event category. This approach is generally not feasible in practice but provides the best possible results in the ideal situation. Note that our proposed estimator is obtained under a more general model (1). Later, we will show that our estimator endures little efficiency loss even when the underlying model has proportional baseline rates. We report the average of biases in our replicated estimates, empirical standard deviation σ1, and the relative mean-squared error to our proposed method, denoted by , where . The result shows that our estimator provides comparable estimates when the weighted estimating equations method correctly specifies the model in the first scenario. The efficiency loss is minimal, as the relative mean-squared errors are all close to 1. When the model is misspecified by the weighted estimating equations method in the second scenario, the estimator has a relatively larger variation, especially when more events with missing type are present in the data. Our estimator on the other hand provides a more robust approach with significant efficiency improvements. Overall, our proposed estimator does not lose much efficiency compared to the ideal solution, while outperforming the current existing parametric estimator.
Simulation results are reported based on scenario 1, where data were generated from model (9), and scenario 2, where data were generated from model (1).
σ1
Ratio
Scenario
β1
α
ϵz
(%)
1
0.69
0.5
0
10
0.008
0.009
0.010
0.230
0.233
0.233
0.97
1.00
20
0.011
0.011
0.237
0.236
0.95
0.99
30
0.010
0.011
0.242
0.241
0.90
0.99
0.69
10
0.010
0.010
0.234
0.234
0.97
1.00
20
0.011
0.011
0.238
0.237
0.93
0.99
30
0.011
0.011
0.241
0.240
0.91
0.99
1.0
0
10
0.010
0.011
0.011
0.257
0.260
0.260
0.98
1.00
20
0.012
0.012
0.260
0.260
0.98
1.00
30
0.012
0.012
0.263
0.262
0.96
0.99
0.69
10
0.012
0.012
0.260
0.259
0.98
1.00
20
0.012
0.012
0.261
0.260
0.97
0.99
30
0.013
0.013
0.261
0.260
0.97
0.99
2
0
0
0
20
–0.001
0.001
–0.003
0.249
0.267
0.279
0.87
1.09
30
0.003
–0.007
0.280
0.300
0.80
1.15
40
0.006
–0.005
0.294
0.322
0.72
1.20
0.69
20
0.001
–0.005
0.265
0.280
0.88
1.11
30
0.005
–0.005
0.278
0.297
0.81
1.14
40
–0.006
–0.008
0.297
0.323
0.71
1.19
0.5
0
20
–0.010
–0.015
–0.018
0.287
0.307
0.314
0.88
1.05
30
–0.013
–0.020
0.321
0.339
0.80
1.12
40
–0.013
–0.027
0.325
0.346
0.78
1.14
0.69
20
–0.017
–0.021
0.307
0.316
0.87
1.06
30
–0.015
–0.024
0.315
0.330
0.83
1.10
40
–0.016
–0.022
0.333
0.357
0.75
1.15
3
0.69
0
0
20
0.002
0.008
0.007
0.222
0.238
0.245
0.87
1.06
30
0.006
0.007
0.251
0.263
0.78
1.10
40
0.005
0.006
0.266
0.281
0.70
1.12
0.69
20
0.005
0.005
0.234
0.244
0.90
1.08
30
0.007
0.008
0.247
0.258
0.81
1.09
40
0.002
0.007
0.257
0.270
0.75
1.11
0.5
0
20
–0.001
–0.003
–0.003
0.248
0.264
0.273
0.89
1.07
30
–0.004
0.000
0.270
0.287
0.85
1.13
40
–0.004
–0.001
0.283
0.302
0.77
1.14
0.69
20
–0.003
–0.004
0.259
0.269
0.92
1.07
30
–0.002
0.000
0.270
0.282
0.85
1.10
40
–0.004
–0.004
0.274
0.287
0.82
1.10
Table 2 demonstrates that the distribution of our estimator can be well approximated by a normal distribution in finite samples. We show the results based on model (1) in the second scenario when n =200 and 400. The results based on model (9) in the first scenario are similar; hence omitted here. As one can see, the average of our standard error estimates, denoted by and , is close to the empirical standard deviation, σ1 and σ2, respectively, and the coverage rate for β1 based on the 95% confidence interval is close to the nominal level. Meanwhile, the size of the Wald-type test for is close to the given significance level at 0.05.
Simulation results of parameter estimations from our proposed method for model (1).
α
ϵz
(%)
n
Bias
σ1
Bias
σ2
(0,0)
0
0
20
200
0.001
0.267
0.272
0.961
–0.006
0.149
0.151
0.035
400
–0.002
0.185
0.191
0.959
–0.005
0.105
0.106
0.041
30
200
0.003
0.280
0.287
0.961
–0.007
0.152
0.154
0.040
400
–0.002
0.190
0.199
0.960
–0.005
0.107
0.108
0.045
40
200
0.006
0.294
0.294
0.957
–0.007
0.153
0.155
0.038
400
0.003
0.198
0.207
0.961
–0.006
0.109
0.109
0.049
0.69
20
200
0.001
0.265
0.272
0.967
–0.006
0.149
0.151
0.041
400
–0.003
0.184
0.191
0.964
–0.004
0.106
0.106
0.048
30
200
0.005
0.278
0.282
0.961
–0.008
0.150
0.153
0.037
400
–0.001
0.190
0.198
0.955
–0.005
0.107
0.108
0.046
40
200
–0.006
0.297
0.295
0.950
–0.005
0.153
0.155
0.041
400
–0.006
0.200
0.207
0.958
–0.004
0.109
0.110
0.050
0.5
0
20
200
–0.015
0.307
0.297
0.948
0.000
0.195
0.188
0.067
400
–0.004
0.212
0.207
0.944
–0.004
0.134
0.133
0.060
30
200
–0.013
0.321
0.306
0.947
0.000
0.196
0.190
0.062
400
–0.004
0.222
0.215
0.941
–0.004
0.135
0.134
0.057
40
200
–0.013
0.325
0.317
0.951
–0.001
0.199
0.192
0.056
400
–0.005
0.228
0.222
0.950
–0.004
0.136
0.136
0.057
0.69
20
200
–0.017
0.307
0.297
0.946
0.000
0.194
0.188
0.062
400
–0.005
0.213
0.207
0.948
–0.004
0.133
0.133
0.056
30
200
–0.015
0.315
0.306
0.952
–0.001
0.197
0.190
0.064
400
–0.003
0.220
0.214
0.945
–0.005
0.135
0.134
0.056
40
200
–0.016
0.333
0.318
0.950
–0.001
0.198
0.192
0.064
400
–0.010
0.234
0.223
0.950
–0.003
0.138
0.136
0.059
(0.69,0)
0
0
20
200
0.008
0.238
0.234
0.958
–0.006
0.158
0.152
0.060
400
0.002
0.163
0.164
0.948
–0.001
0.110
0.107
0.063
30
200
0.006
0.251
0.246
0.957
–0.005
0.164
0.157
0.065
400
0.003
0.170
0.170
0.952
–0.001
0.111
0.109
0.053
40
200
0.005
0.266
0.251
0.947
–0.004
0.167
0.158
0.068
400
0.004
0.177
0.177
0.950
–0.002
0.113
0.111
0.059
0.69
20
200
0.005
0.234
0.232
0.955
–0.006
0.159
0.152
0.066
400
0.000
0.160
0.163
0.954
–0.001
0.110
0.107
0.061
30
200
0.007
0.247
0.239
0.948
–0.007
0.163
0.155
0.064
400
0.002
0.168
0.168
0.950
–0.002
0.112
0.109
0.056
40
200
0.002
0.257
0.248
0.954
–0.005
0.169
0.159
0.067
400
0.002
0.173
0.174
0.956
–0.001
0.115
0.112
0.060
0.5
0
20
200
–0.003
0.264
0.260
0.952
0.004
0.201
0.189
0.071
400
–0.004
0.178
0.183
0.956
0.001
0.136
0.133
0.058
30
200
–0.004
0.270
0.268
0.950
0.004
0.203
0.191
0.068
400
–0.006
0.183
0.189
0.961
0.001
0.138
0.135
0.055
40
200
–0.004
0.283
0.277
0.946
0.004
0.204
0.193
0.064
400
–0.005
0.191
0.195
0.960
0.001
0.140
0.137
0.049
0.69
20
200
–0.003
0.259
0.259
0.952
0.004
0.201
0.189
0.070
400
–0.005
0.176
0.182
0.964
0.001
0.137
0.134
0.060
30
200
–0.002
0.270
0.265
0.955
0.003
0.201
0.191
0.068
400
–0.005
0.180
0.187
0.962
0.000
0.138
0.135
0.066
40
200
–0.004
0.274
0.272
0.952
0.003
0.205
0.193
0.063
400
–0.004
0.184
0.191
0.962
–0.001
0.140
0.137
0.055
As seen in Tables 1 and 2, our estimator is robust to the MAR assumption when . Both point and variance estimation are consistent. In fact, the efficiency is slightly better compared to the estimator when . We also set different values of κ1 to examine the performance of our estimation method under the missingness not at random assumption. However, since the simulation results are similar, we do not report the results here.
5 CF registry data
CF is one of the most common life-shortening, autosomal recessive genetic disorders, affecting about 30,000 individuals in the United States.16 It is caused by mutations in the gene encoding the CF transmembrane conductance regulator.17 Chronic lung infection and associated inflammation lead to significant morbidity in CF, with respiratory failure the leading cause of mortality. PA, one of the major virulent pathogens in CF patients, is a well-known risk factor for CF lung disease progression and survival. Several baseline risk factors for PA acquisition were examined in Lai et al.18 Meconium ileus, late CF diagnosis through signs and symptoms, severe CF genotypes, and female gender are associated with a higher risk of acquiring PA. However, most of the PA cases examined in the study were initial infections, which may be transient and less predictive of negative outcomes. On the other hand, mucoid PA, which is thought to develop after recurrent infections, is likely more critical to a patient’s lung disease progression.19 Therefore, regression modeling for different PA types, i.e., mucoid and nonmucoid, is important, since the baseline risk factors may differentially impact different infection types in various manners.
In this section, we extended the analysis in Lai et al.18 to multiple event types using the United States 2016 CF Foundation Patient Registry (CFFPR), in which baseline characteristics, such as genotype, phenotype, and other prognosis factors, are recorded upon enrollment. The CFFPR documents the diagnosis and follow-up of 29,887 individuals with CF in the registry. We aim to model the nonmucoid and mucoid PA occurrence rates in association with three baseline risk factors, which include (1) gender, (2) genotype, categorized based on the most common mutation: F508del homozygous, F508del heterozygous, and neither or unknown, and (3) method of diagnosis, categorized in four groups described elsewhere:18 newborn screening, meconium ileus, family history without symptom, and symptom and sign. Medication use for chronic PA infections and study site/center is included in the model for adjustment of possible confounding effects.
In the 2016 registry, we identified 14,888 patients who were born after 1997 and had complete baseline risk data in 188 accredited CF centers. In summary, half of these patients were male, 47% were F508del homozygous, 39% were F508del heterozygous, 47% were diagnosed by newborn screening, 31% were diagnosed by emerging symptoms and signs, and 19% and 3% were diagnosed by meconium ileus and family history, respectively. In the follow-up visits, there were 27,288 nonmucoid and 6,323 mucoid PA infections, in addition to 5,445 culture positives for both nonmucoid and mucoid types at the same visit. Since our method assumes two kinds of events cannot occur simultaneously, we treated those visits with both infections as the third type of recurrent event. Meanwhile, there were 5,392 culture positives in PA but with unknown status, which is in a high frequency of missing event type (12% of the total events). Addressing the missing information is highly desirable.
Table 3 shows the estimation results by the complete-case analysis and our proposed rate proportion method, assuming that the missingness does not depend on the event category, i.e., . We used AIC to choose the number of interior knots in the B-spline function in model (5). Up to five interior knots were examined, the minimum value of AIC was achieved by using a cubic function without any interior knots. We report rate ratio, , and its 95% confidence interval (95% CI) with Wald-type test p-value with respect to a reference group. We also report the p-value, indicated with a dagger, for the overall comparison among levels in genotype and diagnostic method.
Summary table for the complete-case analysis and rate proportion method.
Refer to the overall comparison among levels in genotype and diagnostic method.
As one can see, in the nonmucoid PA acquisition, both methods agree that female gender and heterozygous F508del genotype are significantly associated with the infection rate. While female has around 8% higher infection rate than male, patients with heterozygous F508del genotype have 11% lower infection rate than those with homozygous F508del genotype. However, the two methods have different test results in neither F508del nor unknown genotype and diagnosed by meconium ileus. While both methods have similar rate ratios, our proposed method, with more data points involved, has a narrower confidence interval and significant test result, which leads one to conclude that patients with a mild genotype (heterozygous F508del, neither F508del genotype, or unknown) have 10% lower nonmucoid PA infection rate than those with a more severe genotype (homozygous F508del), and patients diagnosed by with meconium ileus have 10% higher infection rate than those diagnosed by newborn screening. The finding is consistent with Lai et al.18 and other scientific reports. The disparity between the complete-case analysis and our method can be considered as an evidence against the missingness completely at random assumption. In fact, based on a multiple logistic regression model for the likelihood of missingness, the probability of missing event type is significantly associated with age, gender, genotype, and frequency of previous events. The results demonstrate that missing event type is more likely to occur in younger patients, in female patients, in patients with mild genotypes, and in patients with more prior PA infections.
Two methods otherwise have similar results in the mucoid PA acquisition and in both the types of infection in the same culture. One difference is that patients diagnosed by family history without symptoms are statistically significant in the mucoid PA infection using our proposed method, but not significant using the complete-case analysis.
The baseline mean function estimation for the three types of infections by our proposed estimator is shown in Figure 1. The testing for the proportionality of the baseline rate functions based on the Wald-type test is significant with p-value <0.0001, suggesting that model (9) is not a good fit to our data. Hence, we only report the results for the model (1). We also assess the assumption of equal probability of missingness by assuming that the log-ratio of the missingness probability in model (5) is possibly non-zero. Here, we implemented different values of κ1 and κ2, ranging from –1.5 to 1.5, to explore the sensitivity of the parameter estimation when the missingness is not at random. The result in the supplementary material shows that our estimation is quite robust to the violation of the MAR assumption, as the changes in the point and variance estimates of the regression coefficients are minimal, even when κ1 and κ2 are large.
The baseline mean function estimation by our proposed method is shown in log-scale for nonmucoid (solid), mucoid (dash), and both (dot) PA infections.
6 Conclusion and discussion
It is worth noting that the same proportionality property exploited by our method was also discussed in an intensity-based recurrent event model20 and in competing risk models with missing or uncertain cause of failure.21–24 A semiparametric framework for the estimation of otherwise has never been explored. It is widely anticipated that the semiparametric estimation will be more robust if the underlying unknown function is indeed time-dependent, which is quite likely in practice with time to event data.
The estimation procedure in this paper is tailored for the proportional rates model. It may not be feasible for a nonproportional rates model since the ratio of the rate functions may not be log-linearly correlated with covariates. It would be of interest to develop a more general approach for different types of rate models. One possibility is to derive the probability of missingness and then inversely weigh the estimating equations for unbiased estimation. Along these lines, one may also utilize the nonparametric estimation for the rate function to construct a doubly robust estimator, providing additional protection against the model misspecification.
Supplemental Material
sj-pdf-1-smm-10.1177_09622802211023975 - Supplemental material for Semiparametric estimation of the proportional rates model for recurrent events data with missing event category
Supplemental material, sj-pdf-1-smm-10.1177_09622802211023975 for Semiparametric estimation of the proportional rates model for recurrent events data with missing event category by Feng-Chang Lin, Jianwen Cai, Jason P Fine, Elisabeth P Dellon and Charles R Esther in Statistical Methods in Medical Research
Footnotes
Acknowledgements
The authors would like to thank the Cystic Fibrosis Foundation for the use of CFFPR data to conduct this study. Additionally,the authors would like to thank the patients,care providers,and clinic coordinators at CF Centers throughout the United States for their contributions to the CFFPR. The authors also like to thank Dr Huichuan Lai for her thoughtful comments on an early draft of the paper.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: National Center for Advancing Translational Sciences (NCATS),National Institutes of Health,through Grant Award Number UL1TR002489,and by the National Cancer Institute (NCI) through Grant Award Number P01CA142538.
ORCID iD
Feng-Chang Lin
References
1.
HenryRLMellisCMPetrovicL.Mucoid Pseudomonas aeruginosa is a marker of poor survival in cystic fibrosis. Pediatr Pulmonol1992;
12: 158–161.
2.
KonstanMWMorganWJButlerSM, et al.
Risk factors for rate of decline in forced expiratory volume in one second in children and adolescents with cystic fibrosis. J Pediatr2007;
151: 134–139.
3.
LiZKosorokMRFarrellPM, et al.
Longitudinal development of mucoid Pseudomonas aeruginosa infection and lung disease progression in children with cystic fibrosis. JAMA2005;
293: 581–588.
4.
ChenXWangQCaiJ, et al.
Semiparametric additive marginal regression models for multiple type recurrent events. Lifetime Data Anal2012;
18: 504–527.
5.
SchaubelDECaiJ.Rate/mean regression for multiple-sequence recurrent event data with missing event category. Scand J Stat2006;
33: 191–207.
6.
CaiJSchaubelD.Marginal means/rates models for multiple type recurrent event data. Lifetime Data Anal2004;
10: 121–138.
7.
YePZhaoXSunL, et al.
A semiparametric additive rates model for multivariate recurrent events with missing event categories. Comput Stat Data Anal2015;
89: 39–50.
8.
LinFCCaiJFineJP, et al.
Nonparametric estimation of the mean function for recurrent event data with missing event category. Biometrika2013;
100: 727–740.
9.
CarrollRJFanJGijbelsI, et al.
Generalized partially linear single-index models. J Am Stat Assoc1997;
92: 477–489.
WangLLiuXLiangH, et al.
Estimation and variable selection for generalized additive partial linear models. Ann Stat2011;
39: 1827–1851.
12.
LittleRJARubinDB.Statistical analysis with missing data.
New York, NY:
Wiley, 2002.
13.
HastieTTibshiraniRFriedmanJ.The elements of statistical learning: data mining, inference, and prediction. 2nd ed.
New York, NY:
Springer, 2009.
14.
PollardD.Empirical processes: theory and applications.
Hayward, CA:
Institute of Mathematical Statistics, 1990.
15.
van der VaartAWWellnerJA.Weak convergence and empirical processes.
New York, NY:
Springer, 1996.
16.
Cystic Fibrosis Foundation. CF Foundation Patient Registry annual data report, 2017.
17.
RiordanJRRommensJMKeremBS, et al.Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science1989;
245: 1066–1073.
18.
LaiHJChengYChoH, et al.
Association between initial disease presentation, lung disease outcomes, and survival in patients with cystic fibrosis. Am J Epidemiol2004;
159: 537–546.
19.
FolkessonAJelsbakLYangL, et al.
Adaptation of Pseudomonas aeruginosa to the cystic fibrosis airway: an evolutionary perspective. Nat Rev Microbiol2012;
10: 841–851.
20.
ChenBECookRJ.The analysis of multivariate recurrent events with partially missing event types. Stat Med2009;
24: 671–691.
21.
CraiuRVDuchesneT.Inference based on the EM algorithm for the competing risks model with masked causes of failure. Biometrika2004;
91: 543–558.
22.
LuKTsiatisAA.Multiple imputation methods for estimating regression coefficients in the competing risks model with missing cause of failure. Biometrics2001;
57: 1191–1197.
23.
LuWLiangY.Analysis of competing risks data with missing cause of failure under additive hazards model. Stat Sin2008;
18: 219–234.
24.
Racine-PoonAHHoelDG.Nonparametric estimation of the survival function when cause of death is uncertain. Biometrics1984;
40: 1151–1158.
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.