Sage Journals: Discover world-class research

Abstract

Proportional rates models are frequently used for the analysis of recurrent event data with multiple event categories. When some of the event categories are missing, a conventional approach is to either exclude the missing data for a complete-case analysis or employ a parametric model for the missing event type. It is well known that the complete-case analysis is inconsistent when the missingness depends on covariates, and the parametric approach may incur bias when the model is misspecified. In this paper, we aim to provide a more robust approach using a rate proportion method for the imputation of missing event types. We show that the log-odds of the event type can be written as a semiparametric generalized linear model, facilitating a theoretically justified estimation framework. Comprehensive simulation studies were conducted demonstrating the improved performance of the semiparametric method over parametric procedures. Multiple types of Pseudomonas aeruginosa infections of young cystic fibrosis patients were analyzed to demonstrate the feasibility of our proposed approach.

Keywords

Cystic fibrosis generalized partially linear model polynomial spline rate proportion weighted estimating equation

1 Introduction

Recurrent event data with multiple categories frequently arise in medical science and population health studies. Different causes of hospitalizations, multiple strains of bacteria infections, and various types of treatment failures all belong to such data type. Taking cystic fibrosis (CF) as an example, recurrent Pseudomonas aeruginosa (PA) infections are commonly observed in patients with CF. PA infection includes mucoid and nonmucoid strains. Without appropriate treatment, the recurrent infections with mucoid strains often become persistent and chronic, causing increased CF mortality and morbidity.^1–3 As another example, patients who received renal transplants may have different types of recurrent infections.⁴ End-stage renal disease patients who received continuous ambulatory peritoneal dialysis may have multiple types of treatment failures that make the patient switch to other dialysis methods.⁵

When modeling this kind of recurrent event data, a proportional rates model that is conditional only on the current value of covariates is commonly used.⁶ Let $N_{i j}^{*} (t)$ denote the number of recurrent events up to time t for subject i ( $i = 1, \dots, n$ ) and event category j ( $j = 1, \dots, J$ ). Let $Z_{i j} (t)$ denote the column vector of covariates which are possibly time-varying. A proportional rates model proposed by Cai and Schaubel⁶ is defined by $E {d N_{i j}^{*} (t) | Z_{i j} (t)} = \exp {β_{0}^{T} Z_{i j} (t)} d μ_{0 j} (t)$ (1)where β₀ is the regression coefficient and $d μ_{0 j} (t)$ is the baseline rate function for the jth type of the recurrent event. Although β₀ is not indexed by event category j, the model is flexible enough to accommodate covariate effects that are specific to the event type in each individual model. For example, if there are two types of recurrent events and one uses q₁- and q₂-column vector of covariates, ${\tilde{Z}}_{i 1}$ and ${\tilde{Z}}_{i 2}$ , for the first and second type of recurrent events, respectively, one can define $Z_{i 1} = {({\tilde{Z}}_{i 1}^{T}, 0_{q_{2}}^{T})}^{T}$ and $Z_{i 2} = {(0_{q_{1}}^{T}, {\tilde{Z}}_{i 2}^{T})}^{T}$ to specify the individual models, where $0_{q_{j}}$ is a q_j-column vector of zeros. Accordingly, one may define $β_{0} = {(β_{1}^{T}, β_{2}^{T})}^{T}$ , where $β_{j} = {(β_{j 1}, \dots, β_{j q_{j}})}^{T}$ for j = 1, 2.

Let $Y_{i} (t) = I (C_{i} \geq t)$ indicate whether subject i with censoring time C_i is under observation at time t, where $t \in [0, τ]$ , and τ is the end of follow-up time. Let $N_{i j} (t) = Y_{i} (t) N_{i j}^{*} (t)$ denote the observed number of events up to time t. When the event category is fully observed, Cai and Schaubel⁶ showed that the coefficient β₀ in (1) could be consistently estimated by the estimating equations $U^{n} (β) = \sum_{i = 1}^{n} \sum_{j = 1}^{J} \int_{0}^{τ} {Z_{i j} (t) - {\bar{Z}}_{j} (t; β)} d N_{i j} (t) = 0$ (2)where ${\bar{Z}}_{j} (t; β) = S_{j}^{(1)} (t; β) / S_{j}^{(0)} (t; β)$ with $S_{j}^{(d)} (t; β) = n^{- 1} \sum_{i = 1}^{n} Y_{i} (t) Z_{i j} {(t)}^{\otimes d} \exp {β^{T} Z_{i j} (t)}$ for d = 0, 1, where $a^{\otimes 0} = 1, a^{\otimes 1} = a$ , and $a^{\otimes 2} = a a^{T}$ for a column vector a.

However, when the event category is possibly missing, the estimating equation (2) is not feasible since the quantity $d N_{i j} (t)$ is not always observable. A naive approach, which uses completely observed data that include only events with known type, can be valid if the event category is missing completely at random, but may give biased results if the missingness depends on the covariates. Schaubel and Cai⁵ suggested that one rewrite $d N_{i j} (t)$ as $d N_{i j} (t) = R_{i} (t) d N_{i j} (t) + {1 - R_{i} (t)} δ_{i j} (t) d N_{i \cdot} (t)$ (3)where $R_{i} (t)$ indicates whether the event category is observed, $δ_{i j} (t)$ indicates whether the event category is type j, and $d N_{i \cdot} (t) = \sum_{j = 1}^{J} d N_{i j} (t)$ indicates the total number of events at time t. Note that $d N_{i \cdot} (t)$ equals 0 or 1 since they assume events with different types do not occur simultaneously. They further suggested that one replace the unknown quantity $δ_{i j} (t)$ with a consistent estimator for $p_{i j} (t)$ , where $p_{i j} (t) = E {δ_{i j} (t) | d N_{i \cdot} (t) = 1, Z_{i j} (t)}$

One may consider parametric, multinomial logit models for estimation of $p_{i j} (t)$ .^5,7 However, the association between the covariates and δ_ij may not be correctly specified, which may lead to inconsistent estimation. This motivates us to develop a more robust method that weakens the impact of model misspecification.

In this paper, we extend the proportional rates method previously developed for estimation of $p_{i j} (t)$ ⁸ using a semiparametric approach that exploits a special form of the rate proportion of event type j to the overall rate function. Interestingly, under the proportional rates model (1), the ratio of two rate proportions can be expressed as $\log {p_{i j} (t) / p_{i J} (t)} = β_{0}^{T} X_{i j} (t) + η_{0 j} (t)$ (4)where $X_{i j} (t) = Z_{i j} (t) - Z_{i J} (t)$ and $η_{0 j} (t) = \log {d μ_{0 j} (t) / d μ_{0 J} (t)}$ . In fact, model (4) can be viewed as a generalized partially linear model. Under certain regularity conditions, one can estimate β₀ and $η_{0 j}$ simultaneously via semiparametric regression techniques such as local polynomials,⁹ generalized additive models,¹⁰ and polynomial spline functions.¹¹

When there is no missing data in the event type, one can estimate $p_{i j} (t)$ using all of the data based on model (4). With missing event types, however, one can only estimate $p_{i j}^{c} (t)$ using completely observed data, where $p_{i j}^{c} (t) = E {δ_{i j} (t) | d N_{i \cdot} (t) = 1, R_{i} (t) = 1, Z_{i j} (t)}$

Letting $π_{i j} (t | Z_{i j}) = E {R_{i} (t) | d N_{i j} (t) = 1, Z_{i j} (t)}$ denote the probability of non-missingness given that the event type j occurs, one can show $\log {p_{i j}^{c} (t) / p_{i J}^{c} (t)} = β_{0}^{T} X_{i j} (t) + η_{0 j} (t) + κ_{j} (t | Z_{i j})$ (5)where $κ_{j} (t | Z_{i j}) = \log {π_{i j} (t | Z_{i j}) / π_{i J} (t | Z_{i J})}$ is the log-ratio of non-missingness for two event types. Since model (5) is time-varying, estimation is complicated. A common assumption in the previous literature is that $π_{i j} (t | Z_{i j})$ does not depend on j and $κ_{j} (t | Z_{i j}) = 0$ for each j.^5,7,8 This assumption corresponds to missing at random (MAR) assumption when the missingness does not depend on unobserved information.¹² In this paper, we adopt the same assumption and assume $κ_{j} (t | Z_{i j}) = 0$ .

Theoretical challenge remains when the semiparametric estimator of $p_{i j} (t)$ is substituted for the unknown $δ_{i j} (t)$ in the estimating equations for β. Specifically, it is not clear if one can still obtain a $n^{1 / 2}$ convergence rate in the estimation of β since the convergence rate of $p_{i j} (t)$ estimation is generally slower than $n^{1 / 2}$ with a semiparametric approach. We will show that, under mild regularity conditions, our estimator of β converges at a $n^{1 / 2}$ rate to a normal distribution with variance that may be consistently estimated using a simple plug-in formula.

The remaining sections are organized as follows. In Section 2, we exploit a cubic B-spline function for the estimation of $p_{i j} (t)$ and propose general estimating equations for the regression coefficient β₀ and baseline mean function $μ_{0 j} (t)$ in model (1). Consistency and large sample normality of the estimators are shown in Section 3. Finite-sample performances evaluated by comprehensive simulation experiments are studied in Section 4. A real-data analysis on multiple types of PA infections in the United States 2016 Cystic Fibrosis Foundation Patient Registry is presented in Section 5. Conclusions and discussions on future research are presented in Section 6.

2 Estimation method

Assume that $η_{0 j} (t)$ can be approximated by a cubic B-spline function ${\tilde{η}}_{0 j} (t; ξ_{j}) = ξ_{j 0} + \sum_{k = 1}^{m + 3} ξ_{j k} b_{k} (t)$ where $b_{k} (t)$ $(k = 1, \dots, m + 3)$ are basis functions, m is the number of interior knots, and $ξ_{j} = {(ξ_{j 0}, \dots, ξ_{j (m + 3)})}^{T}$ is a vector of spline coefficients. Let $θ_{0} = {(β_{0}^{T}, ξ_{0}^{T})}^{T}$ , where $ξ_{0} = {(ξ_{1}^{T}, \dots, ξ_{J - 1}^{T})}^{T}$ . One can estimate θ₀ by maximizing an approximate log-likelihood function $ℓ (θ) = \sum_{i = 1}^{n} \int_{0}^{τ} ℓ_{i} (t; β, ξ) R_{i} (t) d N_{i \cdot} (t)$ (6)where $ℓ_{i} (t; β, ξ) = \sum_{j = 1}^{J} δ_{i j} (t) m_{i j}^{c} (t) - \log [\sum_{j = 1}^{J} \exp {m_{i j}^{c} (t)}]$ with $m_{i j}^{c} (t; θ) = β^{T} X_{i j} (t) + ξ^{T} B_{j} (t)$ , where $B_{j} (t)$ is a $(m + 4) \times (J - 1)$ column vector with $b (t) = {(1, b_{1} (t), \dots, b_{m + 3} (t))}^{T}$ in the jth block for $j = 1, \dots, J - 1$ , and $B_{J} (t) = 0$ for all t. The number of interior knots m can be selected by Akaike information criteria (AIC) that minimizes $- 2 ℓ (θ)$ plus two times the number of parameters in θ. However, other approaches such as generalized cross-validation that approximates the leave-one-out cross-validation may also be considered.¹³

Letting $\tilde{θ} = {({\tilde{β}}^{T}, {\tilde{ξ}}^{T})}^{T}$ denote the maximizer of (6), one can estimate $p_{i j} (t)$ by $p_{i j} (t; \tilde{θ}) = \frac{\exp {{\tilde{β}}^{T} X_{i j} (t) + {\tilde{ξ}}^{T} B_{j} (t)}}{\sum_{ℓ = 1}^{J} \exp {{\tilde{β}}^{T} X_{i ℓ} (t) + {\tilde{ξ}}^{T} B_{ℓ} (t)}}$

By solving the estimating equations $U^{r} (β) = \sum_{i = 1}^{n} \sum_{j = 1}^{J} \int_{0}^{τ} {Z_{i j} (t) - {\bar{Z}}_{j} (t; β)} d N_{i j}^{r} (t; \tilde{θ}) = 0$ (7)where $d N_{i j}^{r} (t; θ) = R_{i} (t) d N_{i j} (t) + {1 - R_{i} (t)} p_{i j} (t; θ) d N_{i \cdot} (t)$ , one can obtain our proposed estimator ${\hat{β}}^{r}$ for β₀. With β replaced by ${\hat{β}}^{r}$ in the estimating equation $\sum_{i = 1}^{n} \sum_{j = 1}^{J} \int_{0}^{τ} [d N_{i j} (t) - Y_{i} (t) \exp {β^{T} Z_{i j} (t)} d μ_{0 j} (t)] = 0$ one can obtain an empirical estimator ${\hat{μ}}_{0 j}^{r} (t; {\hat{β}}^{r}, \tilde{θ})$ for the baseline mean function $μ_{0 j} (t)$ , where ${\hat{μ}}_{0 j}^{r} (t; {\hat{β}}^{r}, \tilde{θ}) = n^{- 1} \sum_{i = 1}^{n} \int_{0}^{t} S_{j}^{(0)} {(t; {\hat{β}}^{r})}^{- 1} d N_{i j}^{r} (t; \tilde{θ})$ (8)

Note that, although β is denoted the same in the log-rate ratio model (4) and proportional rates model (1), the β in the model (4) may not be fully identifiable. The covariate with β in model (4) is $X_{i j} (t) = Z_{i j} (t) - Z_{i J} (t)$ , which is the difference between $Z_{i j} (t)$ and $Z_{i J} (t)$ . If a covariate, for example, age is included and has the common effect on the rate function for all event categories, then the corresponding X_ij equals 0, and consequently, the corresponding component in β is not identifiable. Another situation is when a covariate is included in rate model for all event categories, but the effects are different for different event category. In this situation, what is estimable in β in model (4) is the difference between the effects of that covariate for different categories and not the effects themselves. Taking J = 2 for example, one can write $Z_{i 1} = {({\tilde{Z}}_{i}, 0)}^{T}$ and $Z_{i 2} = {(0, {\tilde{Z}}_{i})}^{T}$ , where ${\tilde{Z}}_{i}$ is the covariate in the rates models for both event categories, for example, gender, and $β = {(β_{1}, β_{2})}^{T}$ , where β_j is the effect of ${\tilde{Z}}_{i}$ on rate function for event category j for j = 1, 2. The difference between $Z_{i 1}$ and $Z_{i 2}$ is $X_{i 1} (t) = {({\tilde{Z}}_{i}, - {\tilde{Z}}_{i})}^{T}$ , and one can write $β^{T} X_{i 1} (t) = (β_{1} - β_{2}) {\tilde{Z}}_{i}$ . From this expression, we can see that only the contrast $β_{1} - β_{2}$ can be estimated from model (4), not β₁ and β₂ individually. Including some same set of covariates in the rate models for some event categories is common in practice. Therefore, it is not feasible to use the log-likelihood function (6) to estimate β in general. However, even though β in model (4) is not identifiable, our proposed method can still work because $p_{i j} (t)$ can still be consistently estimated using function (6).

Also note that the proportional rates model (1) with a baseline rate function specific for each event type is quite general. One may restrict the model with the baseline rate function to be proportional to a reference category J, meaning $d μ_{0 j} (t) = γ_{0 j} d μ_{0} (t)$ , where $d μ_{0} (t)$ is the baseline rate function of the reference category. The model can be written as $E {d N_{i j}^{*} (t) | Z_{i j} (t), △_{j}} = \exp {β_{0}^{T} Z_{i j} (t) + γ_{0}^{T} △_{j}} d μ_{0} (t)$ (9)where $△_{j}$ is a column vector of 1 in the jth element and 0 otherwise with $γ_{0} = {(γ_{01}, \dots, γ_{0 (J - 1)})}^{T}$ as the corresponding coefficient. One can show that the formula (5) becomes $\log {p_{i j}^{c} (t) / p_{i J}^{c} (t)} = β_{0}^{T} X_{i j} (t) + γ_{0 j} + κ_{j} (t | Z_{i j})$ (10)and estimate β₀ and $γ_{0 j}$ using completely observed data via parametric estimating equations $\sum_{i = 1}^{n} \sum_{j = 1}^{J} \int_{0}^{τ} X_{i j} (t) {δ_{i j} (t) - p_{i j}^{c} (t; β, γ)} R_{i} (t) d N_{i \cdot} (t) = 0$ (11)where $p_{i j}^{c} (t; β, γ) = \frac{\exp {β^{T} X_{i j} (t) + γ_{0 j}}}{\sum_{ℓ = 1}^{J} \exp {β^{T} X_{i ℓ} (t) + γ_{0 ℓ}}}$ assuming $κ_{j} (t | Z_{i j}) = 0$ for $j = 1, \dots, J$ .

With ${\hat{β}}^{c}$ and ${\hat{γ}}^{c}$ solving equation (11), one can obtain a more efficient estimator for β₀ and γ₀ using all of the events by replacing the unknown quantity $δ_{i j} (t)$ with $p_{i j}^{c} (t; {\hat{β}}^{c}, {\hat{γ}}^{c})$ . This yields the estimating equations $U^{r} (β, γ) = \sum_{i = 1}^{n} \sum_{j = 1}^{J} \int_{0}^{τ} {W_{i j} (t) - \bar{W} (t; β, γ)} d N_{i j}^{r} (t; {\hat{β}}^{c}, {\hat{γ}}^{c}) = 0$ (12)where $W_{i j} (t) = {(Z_{i j}^{T} (t), △_{j}^{T})}^{T}, \bar{W} (t; β, γ) = S^{(1)} (t; β, γ) / S^{(0)} (t; β, γ)$ with $S^{(d)} (t; β, γ) = n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{J} Y_{i} (t) W_{i j} {(t)}^{\otimes d} \exp {β^{T} Z_{i j} (t) + γ^{T} △_{j}}$

for d = 0, 1, and $d N_{i j}^{r} (t; β, γ) = R_{i} (t) d N_{i j} (t) + {1 - R_{i} (t)} p_{i j} (t; β, γ) d N_{i \cdot} (t)$

By comparing models (5) and (10), one can conduct a statistical test for the proportionality of the baseline rate functions by testing if $η_{0 j} (t)$ is constant for all t, i.e., testing the null hypothesis $H_{0} : η_{0 j} (t) = γ_{0 j}$ for $t \geq 0$ . Using our approach, the null hypothesis is equivalent to $H_{0} : ξ_{j 1} = \dots = ξ_{j (m + 3)} = 0$ for each j, which can be tested via a Wald-type test procedure. Under a more restricted model (9), the fully parametric model (10) for $p_{i j} (t)$ is somewhat different from the one proposed by Schaubel and Cai.⁵ The fully parametric model for $p_{i j} (t)$ in Schaubel and Cai⁵ includes more covariates than model (10), such as time of event occurrence t and number of previous events $N_{i \cdot} (t -)$ .

3 Asymptotic theory

Large sample properties of the semiparametric estimators ${\hat{β}}^{r}$ and ${\hat{μ}}_{0 j}^{r}$ using model (5) will be derived in this section. The developments are more challenging than those in Schaubel and Cai,⁵ which covers only the more restrictive parametric model (10). We first state our notations. Let $p_{i j}^{c} (t; θ) = \exp {m_{i j}^{c} (t; θ)} / \sum_{ℓ = 1}^{J} \exp {m_{i ℓ}^{c} (t; θ)}$ and let ${\dot{p}}_{i j}^{c} (t; θ) = \partial p_{i j}^{c} (t; θ) / \partial θ = p_{i j}^{c} (t; θ) {{\tilde{X}}_{i j} (t) - \sum_{ℓ = 1}^{J} {\tilde{X}}_{i ℓ} (t) p_{i ℓ}^{c} (t; θ)}$ where ${\tilde{X}}_{i j} (t) = {(X_{i j} {(t)}^{T}, B_{j} {(t)}^{T})}^{T}$ . The score function of $n^{- 1} ℓ (θ)$ can be written as $U (θ) = n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{J} U_{i j} (θ)$ , where $U_{i j} (θ) = \int_{0}^{τ} {\tilde{X}}_{i j} (t) {δ_{i j} (t) - p_{i j}^{c} (t; θ)} R_{i} (t) d N_{i \cdot} (t)$ and the negative Hessian matrix of $n^{- 1} ℓ (θ)$ can be written as $H (θ) = n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{J} \int_{0}^{τ} {\tilde{X}}_{i j} (t) {\dot{p}}_{i j}^{c} (t; θ) R_{i} (t) d N_{i \cdot} (t)$

Regularity conditions, especially for the number of interior knots, are outlined here. These conditions are required for the proof of the large sample properties of our estimators.

Variables ${N_{i j} (\cdot), Y_{i j} (\cdot), Z_{i j} (\cdot)}_{j = 1}^{J}$ $(i = 1, \dots, n)$ are independent and identically distributed.

The distribution of censoring time C_i satisfies $P (C_{i} \geq τ) > 0$ for each i.

The sample path of the covariates satisfies $| Z_{i j ℓ} (0) | + \int_{0}^{τ} | d Z_{i j ℓ} (t) | < c_{Z} < \infty$ for every $ℓ$ , where $Z_{i j ℓ}$ is the $ℓ$ th element of the covariate Z_ij.

The limiting matrices $Ω (β_{0})$ and $H (θ_{0})$ are positive-definite.

The baseline rate functions $d μ_{0 j} (\cdot), j = 1, \dots, J$ are bounded away from zero and infinity on $[0, τ]$ .

The second derivative of $η_{0 j} (\cdot)$ exists and satisfies Lipschitz condition of order ϵ on $[0, τ]$ for $j = 1, \dots, J$ for some $ϵ \in (0, 1]$ .

The number of interior knots satisfies $n^{1 / (4 + 2 ϵ)} < m < n^{1 / 4}$ .

Note that Conditions (a)–(e) are regularity conditions for recurrent event processes, outlined in Cai and Schaubel.⁶ The smoothness condition in (f) is similar to the condition (C1) in Wang et al.¹¹ and enables estimation of $η_{0 j} (t)$ using spline functions, with Condition (g) describing the number of parameters used in the spline functions relative to the sample size.

According to Wang et al.,¹¹ the convergence rate of the estimator of β₀ in model (5) is $n^{1 / 2}$ under Conditions (c)–(g), while the convergence rate of the nonparametric estimator of $η_{0 j} (t)$ is slower than $n^{1 / 2}$ . This makes that the convergence rate of the estimator for $p_{i j} (t)$ is slower than $n^{1 / 2}$ . However, we can show that the convergence rate of our estimator for the regression parameter β₀ is $n^{1 / 2}$ . The following theorem describes the large sample theory of our estimator. The detailed proof is given in Appendix 1. Theorem 1

Under Conditions (a)–(g), the estimator ${\hat{β}}^{r}$ is a consistent estimator of β₀ and $n^{1 / 2} ({\hat{β}}^{r} - β_{0})$ converges in distribution to a normal variable with mean 0 and variance Σ, which can be consistently estimated by $\hat{Ω} {({\hat{β}}^{r})}^{- 1} \hat{Φ} ({\hat{β}}^{r}) \hat{Ω} {({\hat{β}}^{r})}^{- 1}$ , where $\hat{Ω} (β) = n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{J} \int_{0}^{τ} {S_{j}^{(2)} (t; β) / S_{j}^{(0)} (t; β) - {\bar{Z}}_{j} {(t; β)}^{\otimes 2}} d N_{i j}^{r} (t; \tilde{θ})$

and $\hat{Φ} (β) = n^{- 1} \sum_{i = 1}^{n} {\hat{Ψ}}_{i} {(β, \tilde{θ})}^{\otimes 2}$ with $\begin{array}{l} {\hat{Ψ}}_{i} (β, θ) = \sum_{j = 1}^{J} \int_{0}^{τ} {Z_{i j} (t) - {\bar{Z}}_{j} (t; β)} d {\hat{M}}_{i j}^{r} (t; β, θ) + \hat{Γ} (β, θ) H {(θ)}^{- 1} U_{i j} (θ), \\ d {\hat{M}}_{i j}^{r} (t; β, θ) = d N_{i j}^{r} (t; θ) - Y_{i} (t) \exp {β^{T} Z_{i j} (t)} d {\hat{μ}}_{0 j}^{r} (t; β, θ), \end{array}$ and $\hat{Γ} (β, θ) = n^{- 1} \sum_{i = 1}^{n} \sum_{j = 1}^{J} \int_{0}^{τ} {Z_{i j} (t) - {\bar{Z}}_{j} (t; β)} ρ_{i j} (t; θ) {\tilde{X}}_{i j} {(t)}^{T} {1 - R_{i} (t)} d N_{i \cdot} (t)$ where $ρ_{i j} (t; θ) = p_{i j} (t; θ) {1 - p_{i j} (t; θ)}$ .

The consistency of ${\hat{β}}^{r}$ can be proved via consistency of $\tilde{θ}$ and conventional convex theories. Large sample normality can be established via an approximation to $n^{1 / 2} ({\hat{β}}^{r} - β_{0})$ by a summation of independent and identical random vectors, as shown in Appendix 1. Note that the variation of ${\hat{β}}^{r}$ is larger than that for ${\hat{β}}^{n}$ which solves equation (2) assuming that there are no missing data, since the estimation for $p_{i j} (t)$ creates additional uncertainty when the event type is missing. This additional variation can be seen in the second term of ${\hat{Ψ}}_{i} (β, θ)$ . Empirical studies show that the efficiency loss compared to the estimator with no missing data may be rather small.

Let $A_{j} (t; β, θ) = - \int_{0}^{t} {\bar{Z}}_{j} (s; β) d {\hat{μ}}_{0 j} (s; β, θ),$ and $D_{j} (t; β, θ) = n^{- 1} \sum_{i = 1}^{n} \int_{0}^{t} S_{j}^{(0)} {(s; β)}^{- 1} ρ_{i j} (s; θ) {\tilde{X}}_{i j} (s) {1 - R_{i} (s)} d N_{i \cdot} (s)$

The following theorem describes the limiting properties of the baseline mean function estimator ${\hat{μ}}_{0 j}^{r} (t; {\hat{β}}^{r}, \tilde{θ})$ for $μ_{0 j} (t)$ in model (1).

Theorem 2

Under the same conditions of Theorem 1, the baseline mean function estimator ${\hat{μ}}_{0 j}^{r} (t; {\hat{β}}^{r}, \tilde{θ})$ is uniformly consistent for $μ_{0 j} (t), t \in [0, τ]$ , and $n^{1 / 2} {{\hat{μ}}_{0 j}^{r} (t; {\hat{β}}^{r}, \tilde{θ}) - μ_{0 j} (t)}$ converges weakly to a Gaussian process with mean 0 and covariance function $V_{j} (s, t), s, t \in [0, τ]$ , which can be consistently estimated by

{\hat{V}}_{j}^{r} (s, t) = n^{- 1} \sum_{i = 1}^{n} {\hat{ϕ}}_{i j} (s; {\hat{β}}^{r}, \tilde{θ}) {\hat{ϕ}}_{i j} (t; {\hat{β}}^{r}, \tilde{θ})

(13)

where $\begin{array}{l} {\hat{ϕ}}_{i j} (t; β, θ) = A_{j} {(t; β, θ)}^{T} \hat{Ω} {(β)}^{- 1} {\hat{Ψ}}_{i} (β, θ) \\ + D_{j} (t; β, θ) H {(θ)}^{- 1} \sum_{j = 1}^{J} U_{i j} (θ) + \int_{0}^{t} S_{j}^{(0)} {(s; β)}^{- 1} d {\hat{M}}_{i j}^{r} (s; β, θ) \end{array}$

The proof begins by decomposing ${\hat{ω}}_{j} (t) = {\hat{μ}}_{0 j}^{r} (t; {\hat{β}}^{r}, \tilde{θ}) - μ_{0 j} (t)$ as ${\hat{ω}}_{j}^{(1)} (t) + {\hat{ω}}_{j}^{(2)} (t)$ , where ${\hat{ω}}_{j}^{(1)} (t) = {\hat{μ}}_{0 j}^{r} (t; {\hat{β}}^{r}, \tilde{θ}) - {\hat{μ}}_{0 j}^{r} (t; β_{0}, θ_{0})$ and ${\hat{ω}}_{j}^{(2)} (t) = {\hat{μ}}_{0 j}^{r} (t; β_{0}, θ_{0}) - μ_{0 j} (t)$ . The uniform consistency of ${\hat{μ}}_{0 j}^{r} (t; {\hat{β}}^{r}, \tilde{θ})$ can be proved by showing that both $\sup_{t \in [0, τ]} | {\hat{ω}}_{j}^{(1)} (t) |$ and $\sup_{t \in [0, τ]} | {\hat{ω}}_{j}^{(2)} (t) |$ converge to 0. The uniform convergence of ${\hat{ω}}_{j}^{(2)} (t)$ can be proved using a law of large numbers for empirical processes and uniform convergence of ${\tilde{η}}_{0 j}$ to $η_{0 j}$ . The uniform convergence of ${\hat{ω}}_{j}^{(1)} (t)$ involves some additional assumptions. The details of the proof are provided in Appendix 1. The proof of weak convergence, which establishes tightness and convergence to finite-dimensional distributions, follows the standard tools in Pollard¹⁴ and van der Vaart and Wellner;¹⁵ see Appendix 1 for details.

4 Simulation study

In this section, we demonstrate the feasibility of our proposed method via comprehensive simulations. We first examine a scenario when the proportional baseline rate model (9) holds; therefore, the general model (1) also holds. We then investigate a scenario when the baseline rate functions are nonproportional, in that model (1) holds, but not model (9). For subject i, two types of recurrent events were simulated from two intensity functions sharing the same latent variable G_i, which was sampled from Gamma $(1 / α, α)$ with $E (G_{i}) = 1$ and var $(G_{i}) = α$ . In the first scenario, the intensity functions are assumed $λ_{i 1} (t) = G_{i} r_{01} t \exp (β_{1} {\tilde{Z}}_{i})$ and $λ_{i 2} (t) = G_{i} r_{02} t \exp (β_{2} {\tilde{Z}}_{i})$ with constants r₀₁ and r₀₂, while, in the second scenario, the intensity functions are $λ_{i 1} (t) = G_{i} r_{01} \exp (β_{1} {\tilde{Z}}_{i})$ and $λ_{i 2} (t) = G_{i} r_{02} h_{0} (t) \exp (β_{2} {\tilde{Z}}_{i})$ , where $h_{0} (t) = \exp {- \sin (t / 3) - 3 \cos (3 t)}$ . We let α = 0, 0.5, or 1 for different dependencies between two types of recurrent events, where α = 0 indicates two types of recurrent events are independent. We set $r_{01} = 0.125$ and $r_{02} = 0.0625$ in the first scenario and $r_{01} = r_{02} = 0.125$ in the second scenario. We let $β_{1} = 0$ or $\log (2)$ and $β_{2} = 0$ . The covariate ${\tilde{Z}}_{i}$ was randomly drawn from a Bernoulli distribution with probability 0.5. The censoring time was generated uniformly between 0 and 5 for each subject. In summary, there are 1.2–1.4 total number of events on average in these simulation scenarios, with a 1:2 or 1:3 ratio of type 1 events to type 2 events. The maximum number of events in a subject ranges from 6 to 11 events on average among the scenarios.

One can show that the rate functions can be expressed as $E {d N_{i j}^{*} (t) | Z_{i j} (t)} = \exp (β_{0}^{T} Z_{i j}) d μ_{0 j} (t)$ , where $β_{0} = {(β_{1}, β_{2})}^{T}, Z_{i j} = {(I (j = 1) {\tilde{Z}}_{i}, I (j = 2) {\tilde{Z}}_{i})}^{T}, d μ_{01} (t) = r_{01} t$ and $d μ_{02} (t) = r_{02} t$ in the first scenario, and $d μ_{01} (t) = r_{01}$ and $d μ_{02} (t) = r_{02} h_{0} (t)$ in the second scenario. Note that the weighted estimating equations method in Schaubel and Cai⁵ is unbiased in the first scenario if one uses ${\tilde{Z}}_{i}$ as the covariate in the logistic regression model for $p_{i 1} (t)$ . However, it is quite evident that the model is misspecified if one uses the same model in the second scenario.

We assumed that the probability of having a missing category was given by $1 - π_{i} (t) = {[1 + \exp {- ϵ^{'} z_{i} (t)}]}^{- 1}$ where $z_{i} (t) = (1, t, N_{i \cdot} (t -), {\tilde{Z}}_{i})'$ . We let $ϵ = (ϵ_{0}, ϵ_{t}, ϵ_{n}, ϵ_{z})'$ , with $ϵ_{t} = - 0.15, ϵ_{n} = 0.1$ , and $ϵ_{z} = 0, \log (2)$ , where $ϵ_{z} = 0$ indicated that the missingness depends on covariates and the missingness assumption is MAR. Various values of ϵ₀ were given to control the percentages of events with missing categories, denoted by $M_{p}$ . Here, we assume that the missingness does not depend on the event type, i.e., $π_{i 1} = π_{i 2} = π_{i}$ and $κ_{1} = 0$ .

Table 1 shows the simulation results for β₁ under 1,000 repetitions of sample size n = 200. We present the results of our proposed estimator ${\hat{β}}_{1}^{r}$ and the weighted estimating equations method ${\hat{β}}_{1}^{w}$ , where $z_{i} (t)$ was used as a covariate in the logistic regression model for $p_{i 1} (t)$ to derive ${\hat{β}}_{1}^{w}$ . We also present the results of the estimator ${\hat{β}}_{1}^{n}$ assuming that there is no missing event category. This approach is generally not feasible in practice but provides the best possible results in the ideal situation. Note that our proposed estimator is obtained under a more general model (1). Later, we will show that our estimator endures little efficiency loss even when the underlying model has proportional baseline rates. We report the average of biases in our replicated estimates, empirical standard deviation σ₁, and the relative mean-squared error to our proposed method, denoted by $e_{r}^{x} = m^{x} / m^{r}$ , where $m^{x} = {({\hat{β}}_{1}^{x} - β_{1})}^{2} + {(σ_{1}^{x})}^{2}, x = n, r, w$ . The result shows that our estimator provides comparable estimates when the weighted estimating equations method correctly specifies the model in the first scenario. The efficiency loss is minimal, as the relative mean-squared errors are all close to 1. When the model is misspecified by the weighted estimating equations method in the second scenario, the estimator has a relatively larger variation, especially when more events with missing type are present in the data. Our estimator on the other hand provides a more robust approach with significant efficiency improvements. Overall, our proposed estimator does not lose much efficiency compared to the ideal solution, while outperforming the current existing parametric estimator.

Table 1.

Simulation results are reported based on scenario 1, where data were generated from model (9), and scenario 2, where data were generated from model (1).

					${\hat{β}}_{1} - β_{1}$			σ ₁			Ratio
Scenario	β ₁	α	ϵ_z	$M_{p}$ (%)	${\hat{β}}_{1}^{n}$	${\hat{β}}_{1}^{r}$	${\hat{β}}_{1}^{w}$	${\hat{β}}_{1}^{n}$	${\hat{β}}_{1}^{r}$	${\hat{β}}_{1}^{w}$	$e_{r}^{n}$	$e_{r}^{w}$
1	0.69	0.5	0	10	0.008	0.009	0.010	0.230	0.233	0.233	0.97	1.00
				20		0.011	0.011		0.237	0.236	0.95	0.99
				30		0.010	0.011		0.242	0.241	0.90	0.99
			0.69	10		0.010	0.010		0.234	0.234	0.97	1.00
				20		0.011	0.011		0.238	0.237	0.93	0.99
				30		0.011	0.011		0.241	0.240	0.91	0.99
		1.0	0	10	0.010	0.011	0.011	0.257	0.260	0.260	0.98	1.00
				20		0.012	0.012		0.260	0.260	0.98	1.00
				30		0.012	0.012		0.263	0.262	0.96	0.99
			0.69	10		0.012	0.012		0.260	0.259	0.98	1.00
				20		0.012	0.012		0.261	0.260	0.97	0.99
				30		0.013	0.013		0.261	0.260	0.97	0.99
2	0	0	0	20	–0.001	0.001	–0.003	0.249	0.267	0.279	0.87	1.09
				30		0.003	–0.007		0.280	0.300	0.80	1.15
				40		0.006	–0.005		0.294	0.322	0.72	1.20
			0.69	20		0.001	–0.005		0.265	0.280	0.88	1.11
				30		0.005	–0.005		0.278	0.297	0.81	1.14
				40		–0.006	–0.008		0.297	0.323	0.71	1.19
		0.5	0	20	–0.010	–0.015	–0.018	0.287	0.307	0.314	0.88	1.05
				30		–0.013	–0.020		0.321	0.339	0.80	1.12
				40		–0.013	–0.027		0.325	0.346	0.78	1.14
			0.69	20		–0.017	–0.021		0.307	0.316	0.87	1.06
				30		–0.015	–0.024		0.315	0.330	0.83	1.10
				40		–0.016	–0.022		0.333	0.357	0.75	1.15
3	0.69	0	0	20	0.002	0.008	0.007	0.222	0.238	0.245	0.87	1.06
				30		0.006	0.007		0.251	0.263	0.78	1.10
				40		0.005	0.006		0.266	0.281	0.70	1.12
			0.69	20		0.005	0.005		0.234	0.244	0.90	1.08
				30		0.007	0.008		0.247	0.258	0.81	1.09
				40		0.002	0.007		0.257	0.270	0.75	1.11
		0.5	0	20	–0.001	–0.003	–0.003	0.248	0.264	0.273	0.89	1.07
				30		–0.004	0.000		0.270	0.287	0.85	1.13
				40		–0.004	–0.001		0.283	0.302	0.77	1.14
			0.69	20		–0.003	–0.004		0.259	0.269	0.92	1.07
				30		–0.002	0.000		0.270	0.282	0.85	1.10
				40		–0.004	–0.004		0.274	0.287	0.82	1.10

Table 2 demonstrates that the distribution of our estimator can be well approximated by a normal distribution in finite samples. We show the results based on model (1) in the second scenario when n = 200 and 400. The results based on model (9) in the first scenario are similar; hence omitted here. As one can see, the average of our standard error estimates, denoted by ${\bar{σ}}_{1}$ and ${\bar{σ}}_{2}$ , is close to the empirical standard deviation, σ₁ and σ₂, respectively, and the coverage rate $C_{p}$ for β₁ based on the 95% confidence interval is close to the nominal level. Meanwhile, the size of the Wald-type test $C_{0}$ for $β_{2} = 0$ is close to the given significance level at 0.05.

Table 2.

Simulation results of parameter estimations from our proposed method for model (1).

					${\hat{β}}_{1}^{r}$				${\hat{β}}_{2}^{r}$
$(β_{1}, β_{2})$	α	ϵ_z	$M_{p}$ (%)	n	Bias	σ ₁	${\bar{σ}}_{1}$	$C_{p}$	Bias	σ ₂	${\bar{σ}}_{2}$	$C_{0}$
(0,0)	0	0	20	200	0.001	0.267	0.272	0.961	–0.006	0.149	0.151	0.035
				400	–0.002	0.185	0.191	0.959	–0.005	0.105	0.106	0.041
			30	200	0.003	0.280	0.287	0.961	–0.007	0.152	0.154	0.040
				400	–0.002	0.190	0.199	0.960	–0.005	0.107	0.108	0.045
			40	200	0.006	0.294	0.294	0.957	–0.007	0.153	0.155	0.038
				400	0.003	0.198	0.207	0.961	–0.006	0.109	0.109	0.049
		0.69	20	200	0.001	0.265	0.272	0.967	–0.006	0.149	0.151	0.041
				400	–0.003	0.184	0.191	0.964	–0.004	0.106	0.106	0.048
			30	200	0.005	0.278	0.282	0.961	–0.008	0.150	0.153	0.037
				400	–0.001	0.190	0.198	0.955	–0.005	0.107	0.108	0.046
			40	200	–0.006	0.297	0.295	0.950	–0.005	0.153	0.155	0.041
				400	–0.006	0.200	0.207	0.958	–0.004	0.109	0.110	0.050
	0.5	0	20	200	–0.015	0.307	0.297	0.948	0.000	0.195	0.188	0.067
				400	–0.004	0.212	0.207	0.944	–0.004	0.134	0.133	0.060
			30	200	–0.013	0.321	0.306	0.947	0.000	0.196	0.190	0.062
				400	–0.004	0.222	0.215	0.941	–0.004	0.135	0.134	0.057
			40	200	–0.013	0.325	0.317	0.951	–0.001	0.199	0.192	0.056
				400	–0.005	0.228	0.222	0.950	–0.004	0.136	0.136	0.057
		0.69	20	200	–0.017	0.307	0.297	0.946	0.000	0.194	0.188	0.062
				400	–0.005	0.213	0.207	0.948	–0.004	0.133	0.133	0.056
			30	200	–0.015	0.315	0.306	0.952	–0.001	0.197	0.190	0.064
				400	–0.003	0.220	0.214	0.945	–0.005	0.135	0.134	0.056
			40	200	–0.016	0.333	0.318	0.950	–0.001	0.198	0.192	0.064
				400	–0.010	0.234	0.223	0.950	–0.003	0.138	0.136	0.059
(0.69,0)	0	0	20	200	0.008	0.238	0.234	0.958	–0.006	0.158	0.152	0.060
				400	0.002	0.163	0.164	0.948	–0.001	0.110	0.107	0.063
			30	200	0.006	0.251	0.246	0.957	–0.005	0.164	0.157	0.065
				400	0.003	0.170	0.170	0.952	–0.001	0.111	0.109	0.053
			40	200	0.005	0.266	0.251	0.947	–0.004	0.167	0.158	0.068
				400	0.004	0.177	0.177	0.950	–0.002	0.113	0.111	0.059
		0.69	20	200	0.005	0.234	0.232	0.955	–0.006	0.159	0.152	0.066
				400	0.000	0.160	0.163	0.954	–0.001	0.110	0.107	0.061
			30	200	0.007	0.247	0.239	0.948	–0.007	0.163	0.155	0.064
				400	0.002	0.168	0.168	0.950	–0.002	0.112	0.109	0.056
			40	200	0.002	0.257	0.248	0.954	–0.005	0.169	0.159	0.067
				400	0.002	0.173	0.174	0.956	–0.001	0.115	0.112	0.060
	0.5	0	20	200	–0.003	0.264	0.260	0.952	0.004	0.201	0.189	0.071
				400	–0.004	0.178	0.183	0.956	0.001	0.136	0.133	0.058
			30	200	–0.004	0.270	0.268	0.950	0.004	0.203	0.191	0.068
				400	–0.006	0.183	0.189	0.961	0.001	0.138	0.135	0.055
			40	200	–0.004	0.283	0.277	0.946	0.004	0.204	0.193	0.064
				400	–0.005	0.191	0.195	0.960	0.001	0.140	0.137	0.049
		0.69	20	200	–0.003	0.259	0.259	0.952	0.004	0.201	0.189	0.070
				400	–0.005	0.176	0.182	0.964	0.001	0.137	0.134	0.060
			30	200	–0.002	0.270	0.265	0.955	0.003	0.201	0.191	0.068
				400	–0.005	0.180	0.187	0.962	0.000	0.138	0.135	0.066
			40	200	–0.004	0.274	0.272	0.952	0.003	0.205	0.193	0.063
				400	–0.004	0.184	0.191	0.962	–0.001	0.140	0.137	0.055

As seen in Tables 1 and 2, our estimator is robust to the MAR assumption when $ϵ_{z} \neq 0$ . Both point and variance estimation are consistent. In fact, the efficiency is slightly better compared to the estimator when $ϵ_{z} = 0$ . We also set different values of κ₁ to examine the performance of our estimation method under the missingness not at random assumption. However, since the simulation results are similar, we do not report the results here.

5 CF registry data

CF is one of the most common life-shortening, autosomal recessive genetic disorders, affecting about 30,000 individuals in the United States.¹⁶ It is caused by mutations in the gene encoding the CF transmembrane conductance regulator.¹⁷ Chronic lung infection and associated inflammation lead to significant morbidity in CF, with respiratory failure the leading cause of mortality. PA, one of the major virulent pathogens in CF patients, is a well-known risk factor for CF lung disease progression and survival. Several baseline risk factors for PA acquisition were examined in Lai et al.¹⁸ Meconium ileus, late CF diagnosis through signs and symptoms, severe CF genotypes, and female gender are associated with a higher risk of acquiring PA. However, most of the PA cases examined in the study were initial infections, which may be transient and less predictive of negative outcomes. On the other hand, mucoid PA, which is thought to develop after recurrent infections, is likely more critical to a patient’s lung disease progression.¹⁹ Therefore, regression modeling for different PA types, i.e., mucoid and nonmucoid, is important, since the baseline risk factors may differentially impact different infection types in various manners.

In this section, we extended the analysis in Lai et al.¹⁸ to multiple event types using the United States 2016 CF Foundation Patient Registry (CFFPR), in which baseline characteristics, such as genotype, phenotype, and other prognosis factors, are recorded upon enrollment. The CFFPR documents the diagnosis and follow-up of 29,887 individuals with CF in the registry. We aim to model the nonmucoid and mucoid PA occurrence rates in association with three baseline risk factors, which include (1) gender, (2) genotype, categorized based on the most common mutation: F508del homozygous, F508del heterozygous, and neither or unknown, and (3) method of diagnosis, categorized in four groups described elsewhere:¹⁸ newborn screening, meconium ileus, family history without symptom, and symptom and sign. Medication use for chronic PA infections and study site/center is included in the model for adjustment of possible confounding effects.

In the 2016 registry, we identified 14,888 patients who were born after 1997 and had complete baseline risk data in 188 accredited CF centers. In summary, half of these patients were male, 47% were F508del homozygous, 39% were F508del heterozygous, 47% were diagnosed by newborn screening, 31% were diagnosed by emerging symptoms and signs, and 19% and 3% were diagnosed by meconium ileus and family history, respectively. In the follow-up visits, there were 27,288 nonmucoid and 6,323 mucoid PA infections, in addition to 5,445 culture positives for both nonmucoid and mucoid types at the same visit. Since our method assumes two kinds of events cannot occur simultaneously, we treated those visits with both infections as the third type of recurrent event. Meanwhile, there were 5,392 culture positives in PA but with unknown status, which is in a high frequency of missing event type (12% of the total events). Addressing the missing information is highly desirable.

Table 3 shows the estimation results by the complete-case analysis and our proposed rate proportion method, assuming that the missingness does not depend on the event category, i.e., $κ_{1} = κ_{2} = 0$ . We used AIC to choose the number of interior knots in the B-spline function in model (5). Up to five interior knots were examined, the minimum value of AIC was achieved by using a cubic function without any interior knots. We report rate ratio, $\exp (β)$ , and its 95% confidence interval (95% CI) with Wald-type test p-value with respect to a reference group. We also report the p-value, indicated with a dagger, for the overall comparison among levels in genotype and diagnostic method.

Table 3.

Summary table for the complete-case analysis and rate proportion method.

	Complete-case			Rate proportion
Covariates	$\exp (β)$	95% CI	p-value	$\exp (β)$	95% CI	p-value
	Nonmucoid PA infection
Female	1.09	1.03, 1.15	0.001	1.08	1.01, 1.15	0.017
Genotype F508del			<0.001^a			<0.001^a
Homozygous	1.00	–	–	1.00	–	–
Heterozygous	0.89	0.83, 0.94	<0.001	0.89	0.84, 0.94	<0.001
Neither or unknown	0.89	0.77, 1.03	0.121	0.90	0.83, 0.97	0.008
Diagnostic method			0.079^a			0.025 $†$
Newborn screening	1.00	–	–	1.00	–	–
Meconium ileus	1.15	0.99, 1.34	0.069	1.10	1.01, 1.21	0.037
Family history	0.99	0.77, 1.27	0.946	0.94	0.80, 1.10	0.445
Symptom	1.13	0.93, 1.35	0.212	1.07	0.98, 1.16	0.127
Medication	1.85	1.64, 2.09	<0.001	1.88	1.68, 2.09	<0.001
	Mucoid PA infection
Female	1.18	1.01, 1.38	0.038	1.16	1.08, 1.24	<0.001
Genotype F508del			0.298^a			0.089 $†$
Homozygous	1.00	–	–	1.00	–	–
Heterozygous	0.95	0.81, 1.12	0.547	0.95	0.86, 1.06	0.374
Neither or unknown	1.20	0.91, 1.59	0.199	1.21	1.00, 1.46	0.052
Diagnostic method			0.004^a			<0.001 $†$
Newborn screening	1.00	–	–	1.00	–	–
Meconium ileus	1.18	0.91, 1.54	0.212	1.13	0.99, 1.29	0.063
Family history	1.45	0.96, 2.21	0.079	1.38	1.02, 1.87	0.035
Symptom	1.59	1.17, 2.15	0.003	1.51	1.30, 1.75	<0.001
Medication	3.06	2.57, 3.66	<0.001	2.98	2.71, 3.28	<0.001
	Both PA infection
Female	1.22	1.03, 1.44	0.018	1.21	1.12, 1.31	<0.001
Genotype F508del			0.287^a			0.217 $†$
Homozygous	1.00	–	–	1.00	–	–
Heterozygous	0.92	0.78, 1.09	0.341	0.93	0.81, 1.06	0.258
Neither or unknown	1.11	0.87, 1.42	0.388	1.13	0.90, 1.42	0.294
Diagnostic method			<0.001^a			<0.001^a
Newborn screening	1.00	–	–	1.00	–	–
Meconium ileus	1.62	1.24, 2.11	<0.001	1.54	1.32, 1.80	<0.001
Family history	1.91	1.11, 3.28	0.019	1.78	1.23, 2.58	0.002
Symptom	1.99	1.47, 2.70	<0.001	1.88	1.68, 2.11	<0.001
Medication	2.47	1.98, 3.08	<0.001	2.39	2.13, 2.68	<0.001

Refer to the overall comparison among levels in genotype and diagnostic method.

As one can see, in the nonmucoid PA acquisition, both methods agree that female gender and heterozygous F508del genotype are significantly associated with the infection rate. While female has around 8% higher infection rate than male, patients with heterozygous F508del genotype have 11% lower infection rate than those with homozygous F508del genotype. However, the two methods have different test results in neither F508del nor unknown genotype and diagnosed by meconium ileus. While both methods have similar rate ratios, our proposed method, with more data points involved, has a narrower confidence interval and significant test result, which leads one to conclude that patients with a mild genotype (heterozygous F508del, neither F508del genotype, or unknown) have 10% lower nonmucoid PA infection rate than those with a more severe genotype (homozygous F508del), and patients diagnosed by with meconium ileus have 10% higher infection rate than those diagnosed by newborn screening. The finding is consistent with Lai et al.¹⁸ and other scientific reports. The disparity between the complete-case analysis and our method can be considered as an evidence against the missingness completely at random assumption. In fact, based on a multiple logistic regression model for the likelihood of missingness, the probability of missing event type is significantly associated with age, gender, genotype, and frequency of previous events. The results demonstrate that missing event type is more likely to occur in younger patients, in female patients, in patients with mild genotypes, and in patients with more prior PA infections.

Two methods otherwise have similar results in the mucoid PA acquisition and in both the types of infection in the same culture. One difference is that patients diagnosed by family history without symptoms are statistically significant in the mucoid PA infection using our proposed method, but not significant using the complete-case analysis.

The baseline mean function estimation for the three types of infections by our proposed estimator is shown in Figure 1. The testing for the proportionality of the baseline rate functions based on the Wald-type test is significant with p-value <0.0001, suggesting that model (9) is not a good fit to our data. Hence, we only report the results for the model (1). We also assess the assumption of equal probability of missingness by assuming that the log-ratio of the missingness probability $κ_{j} (t | Z_{i j})$ in model (5) is possibly non-zero. Here, we implemented different values of κ₁ and κ₂, ranging from –1.5 to 1.5, to explore the sensitivity of the parameter estimation when the missingness is not at random. The result in the supplementary material shows that our estimation is quite robust to the violation of the MAR assumption, as the changes in the point and variance estimates of the regression coefficients are minimal, even when κ₁ and κ₂ are large.

Figure 1.

The baseline mean function estimation by our proposed method is shown in log-scale for nonmucoid (solid), mucoid (dash), and both (dot) PA infections.

6 Conclusion and discussion

It is worth noting that the same proportionality property exploited by our method was also discussed in an intensity-based recurrent event model²⁰ and in competing risk models with missing or uncertain cause of failure.^21–24 A semiparametric framework for the estimation of $p_{i j} (t)$ otherwise has never been explored. It is widely anticipated that the semiparametric estimation will be more robust if the underlying unknown function is indeed time-dependent, which is quite likely in practice with time to event data.

The estimation procedure in this paper is tailored for the proportional rates model. It may not be feasible for a nonproportional rates model since the ratio of the rate functions may not be log-linearly correlated with covariates. It would be of interest to develop a more general approach for different types of rate models. One possibility is to derive the probability of missingness and then inversely weigh the estimating equations for unbiased estimation. Along these lines, one may also utilize the nonparametric estimation for the rate function to construct a doubly robust estimator, providing additional protection against the model misspecification.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802211023975 - Supplemental material for Semiparametric estimation of the proportional rates model for recurrent events data with missing event category

Supplemental material, sj-pdf-1-smm-10.1177_09622802211023975 for Semiparametric estimation of the proportional rates model for recurrent events data with missing event category by Feng-Chang Lin, Jianwen Cai, Jason P Fine, Elisabeth P Dellon and Charles R Esther in Statistical Methods in Medical Research

Footnotes

Acknowledgements

The authors would like to thank the Cystic Fibrosis Foundation for the use of CFFPR data to conduct this study. Additionally,the authors would like to thank the patients,care providers,and clinic coordinators at CF Centers throughout the United States for their contributions to the CFFPR. The authors also like to thank Dr Huichuan Lai for her thoughtful comments on an early draft of the paper.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: National Center for Advancing Translational Sciences (NCATS),National Institutes of Health,through Grant Award Number UL1TR002489,and by the National Cancer Institute (NCI) through Grant Award Number P01CA142538.

ORCID iD

Feng-Chang Lin

References

Henry

Mellis

Petrovic

Mucoid Pseudomonas aeruginosa is a marker of poor survival in cystic fibrosis. Pediatr Pulmonol 1992; 12: 158–161.

Konstan

Morgan

Butler

, et al. Risk factors for rate of decline in forced expiratory volume in one second in children and adolescents with cystic fibrosis. J Pediatr 2007; 151: 134–139.

Kosorok

Farrell

, et al. Longitudinal development of mucoid Pseudomonas aeruginosa infection and lung disease progression in children with cystic fibrosis. JAMA 2005; 293: 581–588.

Chen

Wang

Cai

, et al. Semiparametric additive marginal regression models for multiple type recurrent events. Lifetime Data Anal 2012; 18: 504–527.

Schaubel

Cai

Rate/mean regression for multiple-sequence recurrent event data with missing event category. Scand J Stat 2006; 33: 191–207.

Cai

Schaubel

Marginal means/rates models for multiple type recurrent event data. Lifetime Data Anal 2004; 10: 121–138.

Zhao

Sun

, et al. A semiparametric additive rates model for multivariate recurrent events with missing event categories. Comput Stat Data Anal 2015; 89: 39–50.

Lin

Cai

Fine

, et al. Nonparametric estimation of the mean function for recurrent event data with missing event category. Biometrika 2013; 100: 727–740.

Carroll

Fan

Gijbels

, et al. Generalized partially linear single-index models. J Am Stat Assoc 1997; 92: 477–489.

10.

Hastie

Tibshirani

Generalized additive models. London, UK: Chapman & Hall, 1990.

11.

Wang

Liu

Liang

, et al. Estimation and variable selection for generalized additive partial linear models. Ann Stat 2011; 39: 1827–1851.

12.

Little

RJA

Rubin

DB.

Statistical analysis with missing data. New York, NY: Wiley, 2002.

13.

Hastie

Tibshirani

Friedman

The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York, NY: Springer, 2009.

14.

Pollard

Empirical processes: theory and applications. Hayward, CA: Institute of Mathematical Statistics, 1990.

15.

van der Vaart

Wellner

JA.

Weak convergence and empirical processes. New York, NY: Springer, 1996.

16.

Cystic Fibrosis Foundation. CF Foundation Patient Registry annual data report, 2017.

17.

Riordan

Rommens

Kerem

, et al. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 1989; 245: 1066–1073.

18.

Lai

Cheng

Cho

, et al. Association between initial disease presentation, lung disease outcomes, and survival in patients with cystic fibrosis. Am J Epidemiol 2004; 159: 537–546.

19.

Folkesson

Jelsbak

Yang

, et al. Adaptation of Pseudomonas aeruginosa to the cystic fibrosis airway: an evolutionary perspective. Nat Rev Microbiol 2012; 10: 841–851.

20.

Chen

Cook

RJ.

The analysis of multivariate recurrent events with partially missing event types. Stat Med 2009; 24: 671–691.

21.

Craiu

Duchesne

Inference based on the EM algorithm for the competing risks model with masked causes of failure. Biometrika 2004; 91: 543–558.

22.

Tsiatis

AA.

Multiple imputation methods for estimating regression coefficients in the competing risks model with missing cause of failure. Biometrics 2001; 57: 1191–1197.

23.

Liang

Analysis of competing risks data with missing cause of failure under additive hazards model. Stat Sin 2008; 18: 219–234.

24.

Racine-Poon

Hoel

DG.

Nonparametric estimation of the survival function when cause of death is uncertain. Biometrics 1984; 40: 1151–1158.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.16 MB