Abstract
1. Introduction
Self-reported rating scales are a mainstay for communicating people’s experiences to their medical teams. They capture aspects of the participant’s certain experience that cannot be measured by clinical or performance-based measures and are therefore essential in evaluating new therapies and gauging disease status. As clinical and research tools, scales must be critically evaluated for their accuracy and internal consistency, and therefore should provide a wide range of graphical and numerical outputs that can lead to improved items and in-depth assessments of what the scale is designed to measure.
The usual practice is to compute the
A disadvantage of sum scores is that all items contribute equally to the overall score and thus implicitly to the underlying trait the test aims to measure. Instead of sum scores, item response theory (IRT) models, with the exception of the Rasch model, aim to produce more sophisticated scores that are less subject to various biases and have higher accuracy (see Wiberg et al., 2019). IRT models use a multinomial response vector function
A nonparametric IRT model for polytomously scored items was introduced by Molenaar (1997), followed by other nonparametric/semiparametric approaches. For example, Emons (2008) studied the effectiveness of generalizations of nonparametric person-fit statistics to polytomous item response data. Further, Stochl et al. (2012) studied the Mokken scale with polytomous rating scale health data coded as 1–2–3–4, but recoded them into 0–0–1–1, that is, making the items binary before analyzing the data. Falk and Cai (2016) compared the monotonic polynomial method to nonparametric and semiparametric alternatives.
After reviewing these nonparametric and semiparametric developments in modeling polytomous item responses, we now turn to the semiparametric approach that forms the basis of our current study. We build on earlier work which employs data smoothing methods developed in Ramsay and Silverman (2005) and Ramsay et al. (2009). Previous research within this area have mainly focused on analysis of data from multiple choice tests (Li et al., 2019; Ramsay & Wiberg, 2017; Ramsay et al., 2020a; Wiberg et al., 2019). Recently, Wallmark et al. (2023) proposed to model scores with information theory. Their proposed method was concluded to be a reasonable alternative to the generalized partial credit model. Although Ramsay et al. (2020) examined rating scales, our research differs from that study in several important ways. First, we compare the proposed approach with the GRM which has not been done before. This is important as the GRM is typically used in these situations. Second, we utilize the scope metric, as described in Section 2.2, to compare the efficiency of the models fitted to datasets from different depression questionnaires. The metric properties of the scope allow for direct comparisons that cannot be done with any other metrics. Third, we introduce mutual entropy as a measure of the amount of information shared by two items. Mutual entropy is particularly useful for evaluating the common IRT assumption of local independence, which states that the item responses are independent given
The rest of the article is structured as follows: The Methods part comprises Section 2.1, in which the GRM is briefly described as it is used for comparison in the empirical study. In Section 2.2, we develop some useful tools for surprisal-based IRT, including
2. Methods
2.1 The Graded Response Model
The parametric IRT model GRM (Samejima, 1969) can be used to model items that are scored in more than two ordered polytomous categories. This include rating scales or Likert scales. Using indices that were defined above, the GRM model defines the probability of a randomly chosen participant with ability
where
2.2 Surprisal and Probability IRT
Information theory was original defined by Shannon (1948), and two recent resources are Cover and Thomas (2006) and Stone (2022). In information theory, terms like the information content, self-information, surprisal, or Shannon information are used interchangeably, which is a basic quantity derived from the probability of a particular event occurring from a random variable. Here we will use the term
within contexts where no probability will be exactly 0. The surprisal transformation is used in the definitions of the log-odds transform, negative log-likelihood, divergence (Kullback, 1959), and in the theory of choice in mathematical psychology (Luce, 1959). Surprisal and probability can thus be seen as two different ways of looking at an option choice. But surprisal has an important advantage; it is a measure that has what Stevens (1946) called a
As seen in Equation 2, surprisal is defined as the negative logarithmic transformation of probability without a particular base and can be interpreted as quantifying the level of “surprise” of a particular event. Surprisal and probability are linked together, and by using either of them, we are viewing choice from two complementary lenses, each having its own interpretation. In order to further explain surprisal, assume a roulette wheel with
For the more general situation, where probability is a function of a latent variable
For the methods introduced here, the role of
For simplicity and illustration purposes, below discussion will focus on

Example of option probability curves of two items (a), and their corresponding item probability curves in the probability manifold (b, c): two views of the probability manifold.

Example of option surprisal curves of two items (a), and their corresponding item surprisal curves in the surprisal manifold (b, c): two views of the surprisal manifold.
From Figure 1a, it can be seen that item A, with three active options, is more informative than item B, where only two options are active, and option 1 is the most popular one. In the 3D plot (Figure 1b and c), as for each item, probabilities of the three options should satisfy the below conditions:
the set of three nonsingular probability vectors
Figure 2b and c display two views of the curved surprisal manifold that is defined by the below condition:
The manifold is in the positive part of the three-dimensional space, is bounded below by 0, is unbounded above, and has the shape of the interior of a soup bowl. The point in the surface nearest to point
The slope of surprisal,
where the variable of integration
For the whole scale, a
It measures the amount of information in
Since surprisal curves do not change with modifications of the score indexing system, arc length along this curve is the ideal representation of the amount of information collected by each item or the whole scale. Using arc length, two researchers working with different score index configurations can compare their results directly, and two equal length intervals are always comparable.
2.3 Mutual Entropy: A Measure of Inter-Item Dependence
The intimate relationship between information theory and probability theory is especially evident in the core concept of the
We use the notation
The
where
The
Note that the right side of this relation has the structure of the negative log transform of a squared correlation. Mutual entropy
Because of this,
2.4 The Estimation Cycle
The proposed methodology uses an estimation cycle, which technical details are described in this section.
2.4.1 Estimating the Scale Information Manifold Given Score Indices
We initialize the optimization by converting sum scores to rank percentages. The large number of tied sum scores inevitable in the use of integer-valued scores is removed by adding random values less than 0.5 to the scores in order to break up any possible dependencies in the data.
The first step in a cycle is to estimate a smooth density estimate of the current score index values in order to construct bin boundaries for the current values of
Note that the “data” in this process, the binned surprisal values, vary with choice of number of basis functions, since they effectively adjusted to provide the best fits to the surprisal curves. This is because the surprisal curve values and participant score index values are jointly optimized. In this sense, the optimization process is more like canonical correlation analysis than principal component analysis.
2.4.2 Estimating Score Index Values Given the Scale Manifold
The negative log likelihood objective function
Since the indicator value
A positive slope value pushes the current value of
Chosen-option surprisal derivatives act in this way as multiplicative weights. But choice data where there are multiple local minima are not rare, and any scale scoring method should include a screen for such cases, and then bring these to the attention of the scale scorer and perhaps also the rater.
When cycling between surprisal and
3. Empirical Illustration
3.1 Two Depression Scales for the Same Patient Cohort
This study was a secondary analysis of de-identified data from the “Defining the Burden and Managing the Effects of Immune-mediated Inflammatory Disease” (IMID) study. The sample includes participants (
Two depression scales, Patient Health Questionnaire (PHQ-9; Kroenke et al., 2001) and PROMIS Emotional Distress: Depression (PROMIS-D; Schalet et al., 2016), were analyzed for the empirical illustration. PHQ-9 is a 9-item scale, which scores each of the item as “0” (not at all) to “3” (nearly every day), that is, 36 options in total, sum score range 0 to 27. PROMIS-D is an eight-item scale, which scores each of the item as “1” (Never) to “5” (Always), that is, 40 options in total, sum score range 8 to 40. As suggested by the name, PROMIS-D focuses on participants’ emotional distress and each item is started with “I felt ….” While PHQ-9 covers nine
Figure 3 shows the distributions of the sum score for these two scales and the associated depression categories using commonly used cut-values. Sum scores of both depression scales were right-skewed and more so in PROMIS-D. Depression categories based on these two scales were not always consistent.

The frequencies of the observed sum score values of the two depression scales and frequencies of the associated depression categories. Panels (a), (b) show the sum score distributions of the PHQ-9 and PROMS-D tests, respectively. Colors indicate the associated depression categories using commonly-used cut-values, see legends. Panels (c), (d) show the relationship between PHQ-9 and PROMS-D classified depression categories.
3.2 Statistical Analyses
We examined how well the data fitted the model using both surprisal and probability curves with both the proposed approach and the GRM. Further, we examined the scope of the items, which in this situation is the intensity of depression captured by the choices made within an item. We also calculated the mutual entropies of the items.
The results of the TG models shown in Section 4 were accomplished using the R package
4. Results
4.1 Comparing Data Fits at Two Resolutions and With GRM Generated Curves
Figures 4 and 5 present the surprisal (a) and probability (b) curves estimated for items in PHQ-9 and PROMIS-D, respectively. Within each panel, the fitted curves of GRM model (left), 2-basis TG model (middle), and 4-basis TG model (right) were compared. The 2-basis functions are tilted straight lines that fitted surprisal values over the score indices to allow better comparisons with GRM (note here test information (i.e., arc length) is used as the

The option surprisal (a) and probability (b) curves for items in PHQ-9. The dots indicate the values within the 16 bins, which the surprisal curves are optimized to fit. The bold titles of each column indicate the associating model and their log-likelihood values for model fitting. The scope values in the titles of each plot are the lengths of the corresponding item surprisal curves and indicate the amount of information provided by choices within this item. The vertical dashed lines in each panel are at the five marker percentages.

The option surprisal (a) and probability (b) curves for items in PROMIS-D. The dots indicate the values within the 16 bins, which the surprisal curves are optimized to fit. The bold titles of each column indicate the associating model and their log-likelihood values for model fitting. The scope values in the titles of each plot are the lengths of the corresponding item surprisal curves and indicate the amount of information provided by choices within this item. The vertical dashed lines in each panel are at the five marker percentages.
The 4-spline cubic basis functions are our preferred choice after considerable experimentation with other levels of resolution. The surprisal curves have much more flexibility, and consequently, they are able to fit the binned data values more closely, which is confirmed by the largest log-likelihood values comparing with their GRM and 2-basis counterparts. Their slopes are higher and, therefore, able to delineate transitions from one choice to another more effectively. The five marker percentiles 5, 25, 50, 75, and 95 indicate the percentages of rater positions. We see that 4-basis percentiles are located at quite different locations from those for 2-basis.
4.2 Understanding Items in These Two Depression Scales
Comparing with items in PROMIS-D (Figure 5), PHQ-9 items (Figure 4) had larger variation in their scope values, meaning certain items were less informative than the others. Especially, item 9
4.3 Item Surprisal Curve and Scale Surprisal Curve
The proposed 3D item surprisal curve, like in Figure 2, can be a useful visual tool to compare items and scales. Unfortunately, this was not possible for items with over three options. Another plot that can be useful here is to plot item or scale scope values versus score indices, as Figures 6 and 7, respectively. It’s obvious that for each item, the 4-basis curve is more complex and therefore longer than the 2-basis counterpart. It shows that the more flexible spline basis can result in a very large gain in the information provided by the choices in this item and explains again why the 4-spline cubic basis functions are our preferred choice. Once again, PHQ-9 items 8 and 9 stood out as the least informative items. For the interscale comparison, these two depression scales had similar scale scope values, with PROMIS-D being slightly more informative.

The item surprisal curves. Panels (a, c) are for PHQ-9 items and panels (b, d) are for PROMIS-D items. Information/scope values in panels (a, b) are on the original 4-bits or 5-bits based on number of options of each scale. In panels (c, d), scope values were transformed to 2-bit for interscale comparison.

The scale surprisal curves. Panels (a, c) are for PHQ-9 and panels (b, d) are for PROMIS-D. Information/scope values in panels (a, b) are on the original 4-bits or 5-bits based on number of options of each scale. In panels (c, d), scope values were transformed to 2-bit for inter-scale comparison.
4.4 Mutual Entropy Between Items
Here, we provide an example of how mutual entropy

Mutual entropy between items in PHQ-9 and PROMIS-D. First panel in each row was the overall mutual entropy within the entire cohort, and the rest panels were local mutual entropy over participants within certain percentile ranges, using the corresponding score indices estimated with the 4-basis TG model.
The overall mutual entropy (first panel of each row) can be interpreted similarly to correlation. Recall that items in PROMIS-D are all about emotional distress and PHQ-9 covers multiple domains. Thus, it was not surprising to see that PROMIS-D items had higher overall mutual entropy values than PHQ-9 items. The local mutual entropy for participants within a certain percentile range could be interpreted similarly to residual correlation, where the values were much smaller than the corresponding overall mutual entropy, indicating local independence.
In this case, since the off-diagonal mutual entropies are small for all
One thing to notice is that in the situation when for a certain item, all participants within a certain range chose the same option (e.g., PHQ-9 item 9 in the 0–20% panel, and PROMIS-D item 4 in the 20%–40% panel), then the corresponding self-entropy and mutual entropy involving that item will be 0. Of course, grouping the test takers into smaller groups would result in not only even smaller mutual entropies between items but also more uncertainty as there would be a smaller number of participants in each group.
5. Discussion and Conclusions
Probability and surprisal/information are two lens through which we observe two different aspects of a single entity. Probability displays how often something happens at each
The item characteristics curves in our example was placed on the arc length scale in order to facilitate comparisons. The use of arc length is an objective measure which allows us to make direct comparisons and has been used previously by Ramsay et al. (2020) with rating scales, although there were no comparison with the GRM. Arc length was also used in Wallmark et al. (2023) when using a similar approach as we propose here, but with data that generate partial credits. The overall conclusion from our study when comparing to GRM is in line with the conclusions of Wallmark et al. (2023), the proposed approach allow for more flexibility in the curves, and arc length is a useful tool when comparing models from different frameworks.
A potential advantage with our model is that it allows for modeling very different kind of data, and here, we illustrated it for two different scales. Another advantage is the use of information from all response options that are possible nowadays when computers are fast. Although more research is needed to examine the proposed approach’s full potential. Information can be redundant in the sense that some of the choices for two items are essentially describing the same thing. The proposed concept of
Another advantage with the proposed approach is how information, from a mathematical perspective, is a measure of structure in choice data. Confidence in the depression scales for patients depends directly on the face value of the choices of item texts, and the weights placed on the choices and the sum scores that initialize the analyses that apply to these choice data. A long item information curve is better than a short one, and also for the entire test scale. This was shown in the empirical illustration where we compared two scales. Another potential advantage is that the proposed approach allows us to model the data even when the data do not fit a parametric IRT model. This topic should, however, be examined in more depth in the future.
In this study, the parameters of the GRM were estimated using marginal maximum likelihood (MML; Bock & Aitkin, 1981) with a normally distributed
Our proposed approach had some limitations. One is that we only examined
Footnotes
Acknowledgements
The data used to illustrate the application of the methodology were obtained from a study funded by the Canadian Institutes of Health Research (THC-135234), Crohn’s and Colitis Canada, and the Waugh Family Chair in Multiple Sclerosis (to RAM). Dr Bernstein is supported in part by the Bingham Chair in Gastroenterology. Members of the CIHR Team in Defining the Burden and Managing the Effects of Psychiatric Comorbidity in Chronic Immunoinflammatory Disease are Ruth Ann Marrie, James M. Bolton, Jitender Sareen, John R. Walker, Scott B. Patten, Alexander Singer, Lisa M. Lix, Carol A. Hitchon, Renée El-Gabalawy, Alan Katz, John D. Fisk, Charles N. Bernstein, Lesley Graff, Lindsay Berrigan, Ryan Zarychanski, Christine Peschken, and James Marriott. The authors acknowledge the use of Shared Health facilities during data collection. This study was a secondary analysis of de-identified data from the “Defining the Burden and Managing the Effects of Immune-mediated Inflammatory Disease” (IMID) study. This clinical data are not publicly accessible. For accessing data, please contact the cohort PIs: Ruth Ann Marrie (
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The methodology research was funded by the Swedish Research Council 2022-02046 and the Swedish Wallenberg grant MMW 2019.0129 with PI Marie Wiberg.
Authors
JAMES O. RAMSAY is a Professor Emeritus in the Department of Psychology and an associate member of the department of mathematics and statistics at McGill University, 2748 Howe St., Ottawa, Ontario, K2B 6W9, Canada. E-mail:
JUAN LI is a Senior Clinical Research Associate at the Neuroscience Program and Clinical Epidemiology Program, Ottawa Hospital Research Institute, 451 Smyth Road #1442, Ottawa Ontario K1H 8M5, Canada. E-mail:
CHARLES N. BERNSTEIN is a Distinguished Professor of Medicine and Bingham Chair in Gastroenterology at the Max Rady College of Medicine, Rady Faculty of Health Sciences and Director of the University of Manitoba IBD Clinical and Research Centre, 804-715 McDermot Avenue, Winnipeg, Manitoba, Canada, R3E3P4. E-mail:
RUTH ANN MARRIE is a Professor of Medicine at Dalhousie University, Canada. E-mail:
JOAKIM WALLMARK is a researcher at the Department of Statistics, Umeå School of Business, Economics and Statistics, Umeå University, SE-90187 Umeå, Sweden. E-mail:
MARIE WIBERG is a professor at the Department of Statistics, Umeå School of Business, Economics and Statistics, Umeå University, SE-901 87 Umeå, Sweden, E-mail:
