Sage Journals: Discover world-class research

Abstract

A rating scale is a set of categories designed to obtain information about a quantitative or a qualitative attribute. Item response theory (IRT) proposes that a probability function over a single latent variable represents the overall attribute evolution that the scale is designed to assess. Here we utilize an information theory approach to IRT to analyze rating scale data. The proposed IRT analyses, based on surprisal, offer new tools for assessing raters, rated items, and the whole rating scale. The information transformation from probability to surprisal is a new lens from which to view choice data and is an important augmentation of probability-based IRT. It also offers new graphical tools to measure the amount of information captured by an item in an additive metric, and to measure covariation among items using mutual information. The proposed methodology is illustrated using two scales from real clinical data and the proposed approach is compared with analyses made with the commonly used parametric IRT graded response model. Practical implications of the proposed methodology are provided.

Keywords

surprisal information manifold scope scale information score index entropy mutual entropy

1. Introduction

Self-reported rating scales are a mainstay for communicating people’s experiences to their medical teams. They capture aspects of the participant’s certain experience that cannot be measured by clinical or performance-based measures and are therefore essential in evaluating new therapies and gauging disease status. As clinical and research tools, scales must be critically evaluated for their accuracy and internal consistency, and therefore should provide a wide range of graphical and numerical outputs that can lead to improved items and in-depth assessments of what the scale is designed to measure.

The usual practice is to compute the sum score. The sum score $y_{j}$ is the sum of the weights $α_{im}$ for the chosen options, which can be expressed as $y_{j} = \sum_{i = 1}^{n} \sum_{m = 1}^{M_{i}} α_{im} U_{jim}$ , where $j = 1, . . ., N$ is the participant index, $i = 1, . . ., n$ is the item index, and $m = 1, \dots, M_{i},$ is the option index. A particular choice is recorded by $U_{ijm}$ which is 1 if option $m$ is chosen for item $i$ by participant $j$ , and 0 otherwise.

A disadvantage of sum scores is that all items contribute equally to the overall score and thus implicitly to the underlying trait the test aims to measure. Instead of sum scores, item response theory (IRT) models, with the exception of the Rasch model, aim to produce more sophisticated scores that are less subject to various biases and have higher accuracy (see Wiberg et al., 2019). IRT models use a multinomial response vector function $P_{im} (θ)$ for choice option $m$ within item $i$ to vary over a continuous latent variable $θ$ . The three popular parametric IRT models for rating scale data are the graded response model (GRM; Samejima, 1969), partial credit model (Masters, 1982), and the heterogeneous GRM (Masters, 1982). These models define probability curves in terms of the exponential of the two-parameter linear transform $a_{im} (θ - b_{im})$ for curve $m$ on item $i$ , where $θ$ varies over the entire real line. Parameters $a_{im}$ and $b_{im}$ allow for variations in slope and location for option response functions, hence, represent discrimination and difficulty, respectively. The nominal model of Bock (1972) allows the two option parameters to be estimated, but the partial credit model fixes $a_{im}$ to one and the GRM has a single slope value $a_{i}$ for each item.

A nonparametric IRT model for polytomously scored items was introduced by Molenaar (1997), followed by other nonparametric/semiparametric approaches. For example, Emons (2008) studied the effectiveness of generalizations of nonparametric person-fit statistics to polytomous item response data. Further, Stochl et al. (2012) studied the Mokken scale with polytomous rating scale health data coded as 1–2–3–4, but recoded them into 0–0–1–1, that is, making the items binary before analyzing the data. Falk and Cai (2016) compared the monotonic polynomial method to nonparametric and semiparametric alternatives.

After reviewing these nonparametric and semiparametric developments in modeling polytomous item responses, we now turn to the semiparametric approach that forms the basis of our current study. We build on earlier work which employs data smoothing methods developed in Ramsay and Silverman (2005) and Ramsay et al. (2009). Previous research within this area have mainly focused on analysis of data from multiple choice tests (Li et al., 2019; Ramsay & Wiberg, 2017; Ramsay et al., 2020a; Wiberg et al., 2019). Recently, Wallmark et al. (2023) proposed to model scores with information theory. Their proposed method was concluded to be a reasonable alternative to the generalized partial credit model. Although Ramsay et al. (2020) examined rating scales, our research differs from that study in several important ways. First, we compare the proposed approach with the GRM which has not been done before. This is important as the GRM is typically used in these situations. Second, we utilize the scope metric, as described in Section 2.2, to compare the efficiency of the models fitted to datasets from different depression questionnaires. The metric properties of the scope allow for direct comparisons that cannot be done with any other metrics. Third, we introduce mutual entropy as a measure of the amount of information shared by two items. Mutual entropy is particularly useful for evaluating the common IRT assumption of local independence, which states that the item responses are independent given $θ$ . Fourth, we propose new graphical tools associated with the above points and how to interpret them, and the code will be made available on Github and used as a vignette of similar analysis.

The rest of the article is structured as follows: The Methods part comprises Section 2.1, in which the GRM is briefly described as it is used for comparison in the empirical study. In Section 2.2, we develop some useful tools for surprisal-based IRT, including scope and the proposal of two- or three-dimensional plots that display transitions between different phases as the information collected moves from zero to the total scale amount. In Section 2.3, mutual entropy is introduced. Section 2.4 discusses more technical details of the estimation process, readers can choose to skip this part if prefer. The third and fourth sections contain an empirical illustration using real clinical data on two scales and compare results of the scales and the items of the proposed methodology with the GRM. The article ends with a discussion and some concluding remarks including how the proposed methodology can be used in practice.

2. Methods

2.1 The Graded Response Model

The parametric IRT model GRM (Samejima, 1969) can be used to model items that are scored in more than two ordered polytomous categories. This include rating scales or Likert scales. Using indices that were defined above, the GRM model defines the probability of a randomly chosen participant with ability $θ$ choosing each option of item $i$ as

$\begin{matrix} \begin{matrix} p_{i, m} (θ) = {\begin{matrix} 1 - \frac{1}{1 + \exp (- a_{i} (θ - b_{i, m + 1}))}, & m = 1 \\ \frac{1}{1 + \exp (- a_{i} (θ - b_{i, m}))} - \frac{1}{1 + \exp (- a_{i} (θ - b_{i, m + 1}))}, & 2 \leq m < M_{i} \\ \frac{1}{1 + \exp (- a_{i} (θ - b_{i, m}))}, & m = M_{i} \end{matrix}, \end{matrix} \end{matrix}$ (1)

where $b_{i, m}$ is the item threshold parameters and $a_{i}$ is the item discrimination parameters.

2.2 Surprisal and Probability IRT

Information theory was original defined by Shannon (1948), and two recent resources are Cover and Thomas (2006) and Stone (2022). In information theory, terms like the information content, self-information, surprisal, or Shannon information are used interchangeably, which is a basic quantity derived from the probability of a particular event occurring from a random variable. Here we will use the term surprisal, introduced by Tribus (1961), and denote it by $S$ or $s$ . An important simple transformation of probability $P$ to surprisal is

$S = - \log (P) or P = \exp (- S)$ (2)

within contexts where no probability will be exactly 0. The surprisal transformation is used in the definitions of the log-odds transform, negative log-likelihood, divergence (Kullback, 1959), and in the theory of choice in mathematical psychology (Luce, 1959). Surprisal and probability can thus be seen as two different ways of looking at an option choice. But surprisal has an important advantage; it is a measure that has what Stevens (1946) called a ratio scale, but which we prefer the term additive scale; that is, having (a) a special meaning for zero, (b) nonnegativity, and (c) the ratio of differences in different locations mean the same thing everywhere. The use of surprisal instead of probability also makes computations faster.

As seen in Equation 2, surprisal is defined as the negative logarithmic transformation of probability without a particular base and can be interpreted as quantifying the level of “surprise” of a particular event. Surprisal and probability are linked together, and by using either of them, we are viewing choice from two complementary lenses, each having its own interpretation. In order to further explain surprisal, assume a roulette wheel with $M$ pockets, where $M > 1$ is an arbitrary positive integer, then the probability of a ball ends up in a fixed pocket $n$ times in a row is $P = {(1 / M)}^{n}$ , and the corresponding surprisal with base $M$ is $S_{M} (P) = - lo g_{M} (P) = n$ . Therefore, for an arbitrary positive probability $P$ , the corresponding $S_{M} (P)$ , which is a real number instead of integer, represents the expected number of times over a sequence of trials that a ball ends up in the fixed pocket, given probability $P$ . Surprisal $S_{M} (P)$ with log base $M$ is therefore a counting number, and a surprisal value of zero means certainty, that is, $P = 1$ . Surprisal values have a natural unit, called $M$ -bit, since $S_{M} (1 / M) = - lo g_{M} (1 / M) = 1$ . For a multinomial probability vector $P$ and corresponding surprisal vector $S$ of length $M$ , the use of log-base $M$ is handy when vectors of different lengths need to be compared. A base 2 surprisal $S_{2}$ and a base $M$ surprisal $S_{M}$ are related by $S_{M} = (lo g_{M} 2) S_{2}$ ; for example, a rating scale with three options, $S_{3} = (lo g_{3} 2) S_{2} = 0.63 S_{2}$ . To facilitate notations, from this point on, we drop the subscript on $S$ and assume that the log base is the length $M$ of the multinomial vector.

For the more general situation, where probability is a function of a latent variable $θ$ ,

$S (θ) = - lo g_{M} P (θ) and P (θ) = M^{- S (θ)} .$ (3)

For the methods introduced here, the role of $θ$ is as a continuous index of the positions of points along the surprisal curve, rather than that of a fixed independent variable in a function. Any smooth one-to-one transformation $h (θ)$ of $θ$ is also an index of the curve since we can always transform surprisal as $S^{*} [h (θ)] = S (θ)$ so that a surprisal value remains unchanged. We refer to $θ$ as a score index rather than as a latent variable to emphasize that its value is infinitely transformable. A convenient choice of score index is the percentage rank of the sum score, which therefore uniformly distributes within $[0, 100]$ .

For simplicity and illustration purposes, below discussion will focus on $M = 3$ , and higher-dimensional cases will be easily generalizable. Figures 1a and 2a shows option probability curves and option surprisal curves of two items, where the score index is 21 equally spaced numbers between 0 and 100. We propose here to use the values of three probability/surprisal curves as coordinates $(p_{1} (θ), p_{2} (θ), p_{3} (θ))$ or $(s_{1} (θ), s_{2} (θ), s_{3} (θ))$ to generate a three-dimensional curve, called the item probability curve (Figure 1b and c) and the item surprisal curve (Figure 2b and c)). Such curves can be used to compare items.

Figure 1.

Example of option probability curves of two items (a), and their corresponding item probability curves in the probability manifold (b, c): two views of the probability manifold.

Figure 2.

Example of option surprisal curves of two items (a), and their corresponding item surprisal curves in the surprisal manifold (b, c): two views of the surprisal manifold.

From Figure 1a, it can be seen that item A, with three active options, is more informative than item B, where only two options are active, and option 1 is the most popular one. In the 3D plot (Figure 1b and c), as for each item, probabilities of the three options should satisfy the below conditions:

$P_{i, 1} + P_{i, 2} + P_{i, 3} = 1, 0 < P_{i, 1}, P_{i, 2}, P_{i, 3} < 1,$ (4)

the set of three nonsingular probability vectors $P$ defines a 2D subspace within the 3D space, which is a flat equilateral triangle, and the two item probability curves have to exist within this subspace. Here, the subspace is a manifold, which we denote as $M_{P}$ . The term manifold is used for smooth structures that are low-dimensional, smooth, and contained within a higher-dimensional space. Probability is not an additive scale because the ratio of differences in different locations varies. Therefore, in the probability manifold, points on the item probability curves are no longer equally spaced, but cluster around where the probabilities are either 0 or 1.

Figure 2b and c display two views of the curved surprisal manifold that is defined by the below condition:

$P_{i, 1} + P_{i, 2} + P_{i, 3} = M^{- S_{i, 1}} + M^{- S_{i, 2}} + M^{- S_{i, 3}} = 1, S_{i, 1}, S_{i, 2}, S_{i, 3} > 0$ (5)

The manifold is in the positive part of the three-dimensional space, is bounded below by 0, is unbounded above, and has the shape of the interior of a soup bowl. The point in the surface nearest to point $(0, 0, 0)$ has all values equal to $1 / 3$ , a position that information theory calls maximum entropy. Note that, surprisal is an additive scale, which can be confirmed by the two 3D item surprisal curves, where the points in each curve remain equally spaced. Note that, it is difficult to show just one angle from a 3D plot, and in reality when performing the analysis on a computer, the 3D plot can be viewed from different angles.

The slope of surprisal, $d s_{im} (θ) / d θ$ , plays a central role in information-based IRT. Positions along the space curve shown in Figure 2 can be computed by arc length measure, computed by integrating the total slope of an item surprisal curve $i$ :

$d_{i}^{S} (θ) = \int_{t = 0}^{θ} \sqrt{\sum_{m}^{M} {(\frac{d s_{i m} (t)}{d t})}^{2}} d t,$ (6)

where the variable of integration $t$ is a score index value. Because this integral depends only on the surprisal values of each of the $M$ curves, the arc length $d_{i}^{S} (θ)$ that it computes is invariant over smooth one-to-one transformations of the score index $θ$ . The longer the item surprisal curve, the more revealing and informative the choices for that item are, and we use the term item scope as a description of the power of the item that it represents.

For the whole scale, a scale surprisal curve of dimensionality $\sum_{i}^{n} M_{i}$ is also defined for the $n$ item surprisal curves $d_{i}^{I} (θ)$ jointly varying within the space. Arc length distance $d_{I}^{S} (θ)$ along the curve is defined by the indefinite integral:

$d_{I}^{S} (θ) = \int_{t = 0}^{θ} \sqrt{\sum_{i}^{n} \sum_{m}^{M} {(\frac{d s_{i m} (t)}{d t})}^{2}} d t,$ (7)

It measures the amount of information in $M$ -bits provided by the sequence of $n$ choices, we call this the scale scope. Because of the $M$ -bit unit, distances on the scale scope have an inherent meaning, as opposed to the arbitrary $θ$ scores. For laymen, it is also easier to understand something that starts from zero. Figure 7 shows the scale scopes for the empirical illustration.

Since surprisal curves do not change with modifications of the score indexing system, arc length along this curve is the ideal representation of the amount of information collected by each item or the whole scale. Using arc length, two researchers working with different score index configurations can compare their results directly, and two equal length intervals are always comparable.

2.3 Mutual Entropy: A Measure of Inter-Item Dependence

The intimate relationship between information theory and probability theory is especially evident in the core concept of the entropy, $\bar{S}$ , of a nonsingular multinomial vector $P$ . The entropy ${\bar{S}}_{i} (θ)$ of multinomial vector $P_{i} (θ)$ is

${\bar{S}}_{i} (θ) = - \sum_{m}^{M} p_{im} (θ) \log p_{im} (θ) = \sum_{m}^{M} p_{im} (θ) s_{im} (θ) .$ (8)

We use the notation $\bar{S}$ for entropy because it is the weighted mean or expectation of option surprisal values. Entropy is often described as a measure of disorder because the maximum value of $\bar{S}$ for a fixed $M$ is attained when $p_{m} = 1 / M, m = 1, \dots, M$ , and entropy approaches zero when a single probability approaches 1. Entropy is the amount of information required to exactly predict an option choice given choice probabilities in $P_{i}$ .

The joint entropy of two multinomial vectors $P_{i} (θ)$ and $P_{ℓ} (θ)$ of lengths $M_{i}$ and $M_{ℓ}$ , respectively, and their corresponding surprisal vectors $S_{i} (θ)$ and $S_{ℓ} (θ)$ is

${\bar{S}}_{i ℓ} (θ) = \sum_{m}^{M_{i}} \sum_{n}^{M_{ℓ}} p_{mn}^{i ℓ} (θ) s_{mn}^{i ℓ} (θ),$ (9)

where $p_{mn}^{i ℓ} (θ)$ and $s_{mn}^{i ℓ} (θ)$ are the joint probability and surprisal of two option choices $m$ and $n$ within items $i$ and $ℓ$ , respectively.

The mutual entropy of the two vectors is

$R_{i ℓ}^{2} (θ) = {\bar{S}}_{i} (θ) + {\bar{S}}_{ℓ} (θ) - {\bar{S}}_{i ℓ} (θ) .$ (10)

Note that the right side of this relation has the structure of the negative log transform of a squared correlation. Mutual entropy $R_{i ℓ}^{2} (θ)$ is small when the choices in two items are independent for a given $θ$ , but is close to ${\bar{S}}_{i} (θ)$ and ${\bar{S}}_{ℓ} (θ)$ when the responses of two items are highly related, even after conditioning on $θ$ . It is therefore a scalar measure of the joint variation between two multinomial item choices and a multinomial analogue of the covariance between two vectors. If a model fits the data well, one would expect $R_{i ℓ}^{2} (θ)$ to be small for all items, as the joint variation is captured by $θ$ .

Because of this, $R_{i ℓ}^{2} (θ)$ provides a way to check the local independence assumption in IRT, that the item responses are independent given $θ$ . Most other local independence tests are based on residual correlations, such as the $LD$ and $Q_{3}$ statistics implemented in the mirt package (Chalmers, 2012).

2.4 The Estimation Cycle

The proposed methodology uses an estimation cycle, which technical details are described in this section.

2.4.1 Estimating the Scale Information Manifold Given Score Indices

We initialize the optimization by converting sum scores to rank percentages. The large number of tied sum scores inevitable in the use of integer-valued scores is removed by adding random values less than 0.5 to the scores in order to break up any possible dependencies in the data.

The first step in a cycle is to estimate a smooth density estimate of the current score index values in order to construct bin boundaries for the current values of $θ$ that has roughly equal frequencies. The number of bins $n_{b}$ to use depends on the size of $N$ . In the R package TestGardener (Ramsay & Li, 2021), used in the later empirical illustration, the default number of bins $n_{b}$ ’s are $N < 500 : n_{b} \approx N / 25$ , $500 \leq N < 10, 000 : n_{b} \approx N / 50$ , and $10, 000 \leq N : n_{b} \approx 50$ . Within an item, bin proportions are computed for each option, and these proportions are then converted to surprisal values, except for zero proportions, which are replaced by a maximum surprisal value. Then the $n_{b}$ surprisal values for each of the $M_{i}$ choices are smoothed using a version of the least squares fitted smoothing splines that are adapted to the surprisal structure. Using seven basis functions of order five has provided enough flexibility in the surprisal curves to track the data without over-fitting them. The smallest possible number of spline basis functions is two, for which the order must also be two. After surprisal curves are estimated, the total arc length of each item surprisal curve is computed.

Note that the “data” in this process, the binned surprisal values, vary with choice of number of basis functions, since they effectively adjusted to provide the best fits to the surprisal curves. This is because the surprisal curve values and participant score index values are jointly optimized. In this sense, the optimization process is more like canonical correlation analysis than principal component analysis.

2.4.2 Estimating Score Index Values Given the Scale Manifold

The negative log likelihood objective function $F_{j}$ for rater $j$ in terms of surprisal with respect to $θ$ is

$F_{j} (θ) = \sum_{i}^{n} \sum_{m}^{M_{i}} U_{jim} s_{im} (θ) .$ (11)

Since the indicator value $U_{jim}$ is zero except for the value one for the chosen option, $F (θ)$ is simply the sum of the surprisal values of the chosen option curves at score index $θ$ . An optimal value of $θ$ is the location having the lowest value, at which its derivative $F^{'} (θ) = 0$ , that is, the score index value of $θ$ at which

$\frac{d F_{j} (θ)}{d θ} = \sum_{i}^{n} \sum_{m}^{M_{i}} U_{jim} \frac{d s_{im} (θ)}{d θ} = 0 .$ (12)

A positive slope value pushes the current value of $θ$ downward, and a negative slope value pushes it upward. At the optimal $θ$ , the sums of the positive and negative slope values cancel each other out. It is therefore the slope value that carries the information about performance for tests and intensity for scales. It follows that slope values that are small in size provide only a vague definition of a rater’s $θ$ value, but high slope values define the location of $θ$ with much more precision.

Chosen-option surprisal derivatives act in this way as multiplicative weights. But choice data where there are multiple local minima are not rare, and any scale scoring method should include a screen for such cases, and then bring these to the attention of the scale scorer and perhaps also the rater.

When cycling between surprisal and $θ$ estimation, in our experience, ten to twenty cycles are usually sufficient to define optimal surprisal and score index values. Spline smoothing of surprisal data to yield surprisal curves requires optimization, but the centered bi-linear structure of surprisal means that the response surface is only mildly nonquadratic, and usually terminates in two or three iterations. Consequently, the surprisal curve estimation process is very fast, and of the order of a few seconds for thousands of respondents, using the R package TestGardener (Ramsay & Li, 2021). Note that, in the empirical illustration, our proposed model approach is referred to as TestGardener (TG) models, as they are obtained from the TestGardener R package.

3. Empirical Illustration

3.1 Two Depression Scales for the Same Patient Cohort

This study was a secondary analysis of de-identified data from the “Defining the Burden and Managing the Effects of Immune-mediated Inflammatory Disease” (IMID) study. The sample includes participants (N = 810) with one of the five diagnosis: inflammatory bowel disease, multiple sclerosis, rheumatoid arthritis, and depression or anxiety disorders. For more details about the IMID study, see the published study protocol (Marrie et al., 2018). All data were collected with Research Ethics Board approval and participant consent.

Two depression scales, Patient Health Questionnaire (PHQ-9; Kroenke et al., 2001) and PROMIS Emotional Distress: Depression (PROMIS-D; Schalet et al., 2016), were analyzed for the empirical illustration. PHQ-9 is a 9-item scale, which scores each of the item as “0” (not at all) to “3” (nearly every day), that is, 36 options in total, sum score range 0 to 27. PROMIS-D is an eight-item scale, which scores each of the item as “1” (Never) to “5” (Always), that is, 40 options in total, sum score range 8 to 40. As suggested by the name, PROMIS-D focuses on participants’ emotional distress and each item is started with “I felt ….” While PHQ-9 covers nine DSM-IV criteria of clinically significant behavioral or psychological syndrome or pattern in individuals with depression.

Figure 3 shows the distributions of the sum score for these two scales and the associated depression categories using commonly used cut-values. Sum scores of both depression scales were right-skewed and more so in PROMIS-D. Depression categories based on these two scales were not always consistent.

Figure 3.

The frequencies of the observed sum score values of the two depression scales and frequencies of the associated depression categories. Panels (a), (b) show the sum score distributions of the PHQ-9 and PROMS-D tests, respectively. Colors indicate the associated depression categories using commonly-used cut-values, see legends. Panels (c), (d) show the relationship between PHQ-9 and PROMS-D classified depression categories.

3.2 Statistical Analyses

We examined how well the data fitted the model using both surprisal and probability curves with both the proposed approach and the GRM. Further, we examined the scope of the items, which in this situation is the intensity of depression captured by the choices made within an item. We also calculated the mutual entropies of the items.

The results of the TG models shown in Section 4 were accomplished using the R package TestGardener (Ramsay & Li, 2021), which also has a web-based version introduced in Li et al. (2019). The 3D plots were generated using the R package plotly (Sievert, 2020). The proposed approach was compared with GRM using the R package mirt (Chalmers, 2012). Code for this analysis is available on Github: https://github.com/JuanLiOHRI/Manitoba-IMID. For accessing data, please contact PIs of the IMID study.

4. Results

4.1 Comparing Data Fits at Two Resolutions and With GRM Generated Curves

Figures 4 and 5 present the surprisal (a) and probability (b) curves estimated for items in PHQ-9 and PROMIS-D, respectively. Within each panel, the fitted curves of GRM model (left), 2-basis TG model (middle), and 4-basis TG model (right) were compared. The 2-basis functions are tilted straight lines that fitted surprisal values over the score indices to allow better comparisons with GRM (note here test information (i.e., arc length) is used as the x-axis, so the 2-basis surprisal curves do not appear to be straight). The 2-basis results are closer to what would be seen in an analysis using the nominal model of Bock (1972) and also of the GRM.

Figure 4.

The option surprisal (a) and probability (b) curves for items in PHQ-9. The dots indicate the values within the 16 bins, which the surprisal curves are optimized to fit. The bold titles of each column indicate the associating model and their log-likelihood values for model fitting. The scope values in the titles of each plot are the lengths of the corresponding item surprisal curves and indicate the amount of information provided by choices within this item. The vertical dashed lines in each panel are at the five marker percentages.

Figure 5.

The option surprisal (a) and probability (b) curves for items in PROMIS-D. The dots indicate the values within the 16 bins, which the surprisal curves are optimized to fit. The bold titles of each column indicate the associating model and their log-likelihood values for model fitting. The scope values in the titles of each plot are the lengths of the corresponding item surprisal curves and indicate the amount of information provided by choices within this item. The vertical dashed lines in each panel are at the five marker percentages.

The 4-spline cubic basis functions are our preferred choice after considerable experimentation with other levels of resolution. The surprisal curves have much more flexibility, and consequently, they are able to fit the binned data values more closely, which is confirmed by the largest log-likelihood values comparing with their GRM and 2-basis counterparts. Their slopes are higher and, therefore, able to delineate transitions from one choice to another more effectively. The five marker percentiles 5, 25, 50, 75, and 95 indicate the percentages of rater positions. We see that 4-basis percentiles are located at quite different locations from those for 2-basis.

4.2 Understanding Items in These Two Depression Scales

Comparing with items in PROMIS-D (Figure 5), PHQ-9 items (Figure 4) had larger variation in their scope values, meaning certain items were less informative than the others. Especially, item 9 Thoughts that you would be better off dead or of hurting yourself in some way was the one with the shortest scope (i.e., least informative) using both 2-basis and 4-basis models, which was as expected: for this severe item, 84% participants chose level 0 and only 6% chose level 2 or 3. Similarly, PHQ-9 item 8 Moving or speaking so slowly that other people could have noticed? Or the opposite—being so fidgety or restless that you have been moving around a lot more than usual is another less-informative item.

4.3 Item Surprisal Curve and Scale Surprisal Curve

The proposed 3D item surprisal curve, like in Figure 2, can be a useful visual tool to compare items and scales. Unfortunately, this was not possible for items with over three options. Another plot that can be useful here is to plot item or scale scope values versus score indices, as Figures 6 and 7, respectively. It’s obvious that for each item, the 4-basis curve is more complex and therefore longer than the 2-basis counterpart. It shows that the more flexible spline basis can result in a very large gain in the information provided by the choices in this item and explains again why the 4-spline cubic basis functions are our preferred choice. Once again, PHQ-9 items 8 and 9 stood out as the least informative items. For the interscale comparison, these two depression scales had similar scale scope values, with PROMIS-D being slightly more informative.

Figure 6.

The item surprisal curves. Panels (a, c) are for PHQ-9 items and panels (b, d) are for PROMIS-D items. Information/scope values in panels (a, b) are on the original 4-bits or 5-bits based on number of options of each scale. In panels (c, d), scope values were transformed to 2-bit for interscale comparison.

Figure 7.

The scale surprisal curves. Panels (a, c) are for PHQ-9 and panels (b, d) are for PROMIS-D. Information/scope values in panels (a, b) are on the original 4-bits or 5-bits based on number of options of each scale. In panels (c, d), scope values were transformed to 2-bit for inter-scale comparison.

4.4 Mutual Entropy Between Items

Here, we provide an example of how mutual entropy $R_{i ℓ}^{2} (θ)$ , as introduced in Section 2.4.1, can be used to check the local independence assumption in IRT. Note that $R_{i ℓ}^{2} (θ)$ is calculated for a given value of $θ$ , but can be approximated by grouping respondents with similar $θ$ estimates. Figure 8 shows the resulting mutual entropy matrices for each item pair after dividing the ordered $θ$ scores of the participants into five equally sized groups. For comparison, the first plots in each row show the overall mutual entropy for the entire cohort. The diagonal numbers indicate the amount of entropy associated with the items by themselves, and the off-diagonal numbers indicate mutual entropies for pairs of different items.

Figure 8.

Mutual entropy between items in PHQ-9 and PROMIS-D. First panel in each row was the overall mutual entropy within the entire cohort, and the rest panels were local mutual entropy over participants within certain percentile ranges, using the corresponding score indices estimated with the 4-basis TG model.

The overall mutual entropy (first panel of each row) can be interpreted similarly to correlation. Recall that items in PROMIS-D are all about emotional distress and PHQ-9 covers multiple domains. Thus, it was not surprising to see that PROMIS-D items had higher overall mutual entropy values than PHQ-9 items. The local mutual entropy for participants within a certain percentile range could be interpreted similarly to residual correlation, where the values were much smaller than the corresponding overall mutual entropy, indicating local independence.

In this case, since the off-diagonal mutual entropies are small for all $θ$ grouped plots in Figure 8, it suggests that the item responses are close to independent given $θ$ . As expected, in the overall plots, a relatively large portion of the item entropies (the diagonals) are mutual (the off-diagonals), since the items are meant to measure the same underlying trait. However, when grouping the participants based on $θ$ as in the other plots, the off-diagonals are, for the most part, relatively small compared to the diagonals. This suggests that the $θ$ scores from each model capture most of the original item dependencies.

One thing to notice is that in the situation when for a certain item, all participants within a certain range chose the same option (e.g., PHQ-9 item 9 in the 0–20% panel, and PROMIS-D item 4 in the 20%–40% panel), then the corresponding self-entropy and mutual entropy involving that item will be 0. Of course, grouping the test takers into smaller groups would result in not only even smaller mutual entropies between items but also more uncertainty as there would be a smaller number of participants in each group.

5. Discussion and Conclusions

Probability and surprisal/information are two lens through which we observe two different aspects of a single entity. Probability displays how often something happens at each $θ$ level, and the surprisal scope measures how large the information content is. In this article, we have illustrated that rating scales also measure, via the scope, the quality and power of the set of choices that constitute an item by the lengths of their information curves. As was shown in the empirical illustration, two different rating scales can also do so through their scopes. Scope can also be a focus on an interesting interval, such as the subinterval of top 25% of raters identified in the item and scale information curves. This should however be examined further in the future.

The item characteristics curves in our example was placed on the arc length scale in order to facilitate comparisons. The use of arc length is an objective measure which allows us to make direct comparisons and has been used previously by Ramsay et al. (2020) with rating scales, although there were no comparison with the GRM. Arc length was also used in Wallmark et al. (2023) when using a similar approach as we propose here, but with data that generate partial credits. The overall conclusion from our study when comparing to GRM is in line with the conclusions of Wallmark et al. (2023), the proposed approach allow for more flexibility in the curves, and arc length is a useful tool when comparing models from different frameworks.

A potential advantage with our model is that it allows for modeling very different kind of data, and here, we illustrated it for two different scales. Another advantage is the use of information from all response options that are possible nowadays when computers are fast. Although more research is needed to examine the proposed approach’s full potential. Information can be redundant in the sense that some of the choices for two items are essentially describing the same thing. The proposed concept of mutual entropy can identify redundancy over the entire scale using plots such as the leftmost ones in Figure 8, and is therefore a useful measure in addition to correlational measures. Simultaneously, mutual entropy can be used to test for local independence and model fit by grouping respondents over sub-intervals along the information curve, as also shown in Figure 8.

Another advantage with the proposed approach is how information, from a mathematical perspective, is a measure of structure in choice data. Confidence in the depression scales for patients depends directly on the face value of the choices of item texts, and the weights placed on the choices and the sum scores that initialize the analyses that apply to these choice data. A long item information curve is better than a short one, and also for the entire test scale. This was shown in the empirical illustration where we compared two scales. Another potential advantage is that the proposed approach allows us to model the data even when the data do not fit a parametric IRT model. This topic should, however, be examined in more depth in the future.

In this study, the parameters of the GRM were estimated using marginal maximum likelihood (MML; Bock & Aitkin, 1981) with a normally distributed $θ$ scale implicitly assumed. Our proposed estimation algorithm, as described in Section 2.4 can be viewed as a version of joint maximum likelihood (Baker & Kim, 2004, Chapter 4), for which the abilities and item curves are estimated repetitively over a number of iterations and no distributional is assumed for the index scale. To explore if there is a large improvement in model fit from relaxing the distributional assumption of the index scale as done here, an interesting topic for future studies would be to compare the proposed algorithm to MML with, for example, a uniformly distributed index and the same spline flexibility.

Our proposed approach had some limitations. One is that we only examined $θ$ in a single dimension. For future research, one can ask whether the use of a single dimension, $θ$ , make sense. Or is it possible that depression within a hospital arises from more than one source, so that distinguishing between causes of depression can benefit its treatment? To answer this issue, we need to allow for higher-dimensional score indices. The mathematical framework that we propose in Section 2.2 is relatively easy to extend to score index vectors $θ$ that are of lengths two and higher and thus, something we plan to pursue in the future. We also only gave one empirical illustration with two different scales, and in the future one should examine several different scales and continue to compare it with the GRM.

Footnotes

Acknowledgements

The data used to illustrate the application of the methodology were obtained from a study funded by the Canadian Institutes of Health Research (THC-135234), Crohn’s and Colitis Canada, and the Waugh Family Chair in Multiple Sclerosis (to RAM). Dr Bernstein is supported in part by the Bingham Chair in Gastroenterology. Members of the CIHR Team in Defining the Burden and Managing the Effects of Psychiatric Comorbidity in Chronic Immunoinflammatory Disease are Ruth Ann Marrie, James M. Bolton, Jitender Sareen, John R. Walker, Scott B. Patten, Alexander Singer, Lisa M. Lix, Carol A. Hitchon, Renée El-Gabalawy, Alan Katz, John D. Fisk, Charles N. Bernstein, Lesley Graff, Lindsay Berrigan, Ryan Zarychanski, Christine Peschken, and James Marriott. The authors acknowledge the use of Shared Health facilities during data collection. This study was a secondary analysis of de-identified data from the “Defining the Burden and Managing the Effects of Immune-mediated Inflammatory Disease” (IMID) study. This clinical data are not publicly accessible. For accessing data, please contact the cohort PIs: Ruth Ann Marrie (RuthAnn.Marrie@dal.ca) and Charles Bernstein (Charles.Bernstein@umanitoba.ca). Data may be made available by the PIs to qualified investigators with the appropriates ethical approvals and data use agreements.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The methodology research was funded by the Swedish Research Council 2022-02046 and the Swedish Wallenberg grant MMW 2019.0129 with PI Marie Wiberg.

ORCID iDs

Joakim Wallmark

Marie Wiberg

Authors

JAMES O. RAMSAY is a Professor Emeritus in the Department of Psychology and an associate member of the department of mathematics and statistics at McGill University, 2748 Howe St., Ottawa, Ontario, K2B 6W9, Canada. E-mail: james.ramsay@mcgill.ca. His research interests are psychometrics, functional data analysis, dynamic systems identification, and spatial data analysis.

JUAN LI is a Senior Clinical Research Associate at the Neuroscience Program and Clinical Epidemiology Program, Ottawa Hospital Research Institute, 451 Smyth Road #1442, Ottawa Ontario K1H 8M5, Canada. E-mail: juli@ohri.ca. Her research interests include predictive modelling, machine learning, psychometrics, and Parkinson’s disease.

CHARLES N. BERNSTEIN is a Distinguished Professor of Medicine and Bingham Chair in Gastroenterology at the Max Rady College of Medicine, Rady Faculty of Health Sciences and Director of the University of Manitoba IBD Clinical and Research Centre, 804-715 McDermot Avenue, Winnipeg, Manitoba, Canada, R3E3P4. E-mail: charles.bernstein@umanitoba. His main research interests are translational and epidemiological studies into the burden, cause and clinical outcomes in inflammatory bowel disease

RUTH ANN MARRIE is a Professor of Medicine at Dalhousie University, Canada. E-mail: ruthann.marrie@dal.ca. Her research interests include multiple sclerosis and comorbidity.

JOAKIM WALLMARK is a researcher at the Department of Statistics, Umeå School of Business, Economics and Statistics, Umeå University, SE-90187 Umeå, Sweden. E-mail: joakim.wallmark@umu.se. His research interests are statistical modelling, machine learning, statistical software and psychometrics.

MARIE WIBERG is a professor at the Department of Statistics, Umeå School of Business, Economics and Statistics, Umeå University, SE-901 87 Umeå, Sweden, E-mail: marie.wiberg@umu.se. Her research interests include statistical modeling and psychometrics, especially test equating, parametric and nonparametric item response theory.

References

Baker

F. B.

Kim

S.-H.

(2004). Item response theory: Parameter estimation techniques. CRC Press; Taylor & Francis Group.

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika, 37, 29–51.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika, 46(4), 443–459.

Chalmers

R. P.

(2012). MIRT: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29.

Cover

T. M.

Thomas

J. A.

(2006). Elements of information theory. Wiley-Interscience.

Emons

W. H.

(2008). Nonparametric person-fit analysis of polytomous item scores. Applied Psychological Measurement, 32(3), 224–247.

Falk

C. F.

Cai

(2016). Maximum marginal likelihood estimation of a monotonic polynomial generalized partial credit model with applications to multiple group analysis. Psychometrika, 81(2), 434–460.

Kroenke

Spitzer

R. L.

Williams

J. B.

(2001). The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606–613.

Kullback

(1959). Information theory and statistics. Wiley.

10.

Ramsay

J. O.

Wiberg

(2019). TestGardener: A program for optimal scoring and graphical analysis. In: Wiberg

Culpepper

Janssen

González

Molenaar

(Eds.), Quantitative psychology. IMPS 2017 (Vol. 265). Springer Proceedings in Mathematics and Statistics. Springer.

11.

Luce

R. D.

(1959). Individual choice behavior: A theoretical analysis. Wiley.

12.

Marrie

R. A.

Graff

Walker

J. R.

Fisk

J. D.

Patten

S. B.

Hitchon

C. A.

Lix

L. M.

Bolton

Sareen

Katz

Berrigan

L. I.

Marriott

J. J.

Singer

El-Gabalawy

Peschken

C. A.

Zarychanski

Bernstein

C. N.

(2018). Effects of psychiatric comorbidity in immune-mediated inflammatory disease: Protocol for a prospective study. JMIR Research Protocols, 7(1), e15.

13.

Masters

G. N.

(1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.

14.

Molenaar

I. W.

(1997). Nonparametric models for polytomous responses. In van der Linden

W. J.

Hambleton

R. K.

(Eds.), Handbook of modern item response theory (pp. 369–380). Springer.

15.

Ramsay

J. O.

Hooker

Graves

(2009). Functional data analysis with R and Matlab. Springer.

16.

Ramsay

J. O.

(2021). TestGardener: Optimal analysis of test and rating scale data (R package version 2.0.1). https://CRAN.R-project.org/package=TestGardener

17.

Ramsay

J. O.

Silverman

B. W.

(2005). Functional data analysis. Springer.

18.

Ramsay

J. O.

Wiberg

(2020a). Full information optimal scoring. Journal of Educational and Behavioral Statistics, 45, 297–315.

19.

Ramsay

J. O.

Wiberg

(2020b). Better rating scale scores with information–based psychometrics. Psych, 2(4) 347–369.

20.

Ramsay

J. O.

Wiberg

(2017). A strategy to replace sum scoring. Journal of Educational and Behavioral Statistics, 42, 282–307.

21.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34, 1–97.

22.

Schalet

B. D.

Pilkonis

P. A.

Dodds

Johnston

K. L.

Yount

Riley

Cella

(2016). Clinical validity of PROMIS depression, anxiety, and anger across diverse clinical samples. Journal of Clinical Epidemiology, 73, 119–127.

23.

Shannon

(1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423.

24.

Sievert

(2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com

25.

Stevens

S. S.

(1946). On the theory of scales of measurement. Science, 107, 677–680.

26.

Stochl

Jones

P. B.

Croudace

T. J.

(2012). Mokken scale analysis of mental health and well-being questionnaire item responses: A non-parametric IRT method in empirical research for applied health researchers. BMC Medical Research Methodology, 12(1), 1–16.

27.

Stone

J. V.

(2022). Information theory: A tutorial introduction (2nd ed.). Sebtel Press.

28.

Tribus

(1961). Thermodynamics and thermostatistics: an introduction to energy, information and states of matter, with engineering applications. D. van Nostrand.

29.

Wallmark

Ramsay

J. O.

Wiberg

(2023). Analyzing polytomous test data: A comparison between an information-based IRT model and the generalized partial credit model. Journal of Educational and Behavioral Statistics, 49(5), 753–759. http://doi.org/10.3102/10769986231207879

30.

Wiberg

Ramsay

J. O.

(2019). Optimal scores—An alternative to parametric item response theory and sum scores. Psychometrika, 84, 310–322.