Sage Journals: Discover world-class research

Abstract

This article presents a table containing redacted data from a real study. The table contains three curiosities: statistical significance in the absence of clinical significance, narrow standard deviations, and the absence of a placebo effect. The data in the table had been obtained by an inexperienced rater; how the inexperience compromised the data is explained. Action points for rater experience, rater training, and rating procedures are suggested.

Keywords

Rater training rater drift inter-rater reliability statistical noise statistical significance vs clinical significance placebo effect

In order to read an original article critically, the reader must be (a) an expert in the field, (b) an expert in research methods, and (c) an expert in statistics. It could take years, even decades, to attain such expertise. Nevertheless, readers could expect to learn from issue-specific teaching. Toward this goal, this article presents a single table (Table 1) containing several curiosities. These curiosities are identified and explained, providing the reader with useful learning points.

The Study

Table 1 contains data from an actual study that examined how three behavioral interventions (BEH 1, 2, and 3) reduced anxiety scores in undergraduate students preparing for their final examinations. The interventions were taught at baseline, and students were instructed to practice the interventions daily. A rater who was blind to treatment assignment assessed anxiety at baseline and at 1-week and 2-week follow-up. The study details and data have been redacted to ensure anonymity.

Results

The main effect for Groups was statistically significant (F = 3.35; df = 2,62; P =.04), indicating that, regardless of changes across time, anxiety levels differed significantly across the three groups. Eyeballing the data in Table 1, it appears that this finding arose from higher anxiety scores in BEH1 and lower anxiety scores in BEH3.

The main effect for Time was statistically significant (F = 72.88; df = 2,62; P <.001), indicating that, regardless of differences between groups, anxiety levels differed significantly across the three time points. Eyeballing the data, it appears that this finding arose from higher anxiety at Week 1 and lower anxiety at Week 2.

The Group × Time interaction was statistically significant (F = 7.96; df = 4,124; P < .001), indicating that anxiety levels, across time, changed in different ways in different groups. Eyeballing the data, it is seen that, in BEH1 and BEH2, anxiety ratings increased at Week 1 and decreased at Week 2, whereas, in BEH3, the anxiety ratings decreased at Week 1 and further decreased at Week 2.

Readers who need help with understanding the main and interaction effects may refer to an earlier article in this column.¹

Curiosities

Uninformed and trusting readers would conclude, from these data, that BEH3 was the best intervention for pre-examination anxiety in undergraduate students and that BEH3 may therefore be recommended for this purpose. However, a closer look at Table 1 identifies three curiosities. First, despite the statistical significance identified, there was very little real difference in scores in the three groups at the three time points; that is, what was statistically significant was not clinically significant. Second, in all groups and at all time points, the standard deviations (SDs) were very narrow, indicating that there was little dispersion of scores around the means. Finally, even if all three treatments were ineffective, there should at least have been a noticeable placebo effect; there was none evident at either Week 1 or Week 2.

Table 1.

Anxiety Ratings at Baseline, Week 1, and Week 2 in Students Exposed to Different Behavioral Interventions (BEH 1, 2, and 3).*

	Anxiety Ratings Week 0	Anxiety Ratings Week 1	Anxiety Ratings Week 2
BEH1 (n = 22)	24.0 (2.8)	26.6 (2.0)	24.5 (1.5)
BEH2 (n = 22)	23.9 (2.0)	24.9 (1.5)	23.1 (1.3)
BEH3 (n = 21)	23.9 (1.8)	23.4 (1.8)	22.2 (1.6)

*Data presented in cells are mean (standard deviation).

Statistical inferences are presented in the text.

Explanations

The rater responsible for the assessments had no prior experience with the use of the rating instrument or with evaluating pre-examination anxiety. As a result, the interview responses were interpreted with little discrimination, and the scores assigned were much the same in all subjects in all groups and at all time points, resulting in both similar means and the negligible dispersion of scores around the means. Because SDs represent the ‘noise’ around the mean and because the mean is the ‘signal’ of interest, narrow SDs make it easy for a signal to be detected.² This is why statistical significance emerged despite the very small differences between groups and between time intervals. And if all subjects are rated in more or less the same way at all time points, neither the true intervention effect nor the placebo effect have an opportunity to emerge.

As an aside, it is possible that the statistical significance represents a true advantage for BEH3. However, given the already explained curiosities in Table 1 and the unexpected ‘worsening’ in BEH1 and BEH2 at Week 1, it is also reasonable to consider that these differences arose from chance variations in ratings that were statistically significant only because the SDs were narrow. That is, the signal identified may have been spurious.

Action Points

Individuals who rate subjects in research should be experienced not only with the context of the study but also with the rating instruments that are intended to be used in that context; if not, they will rate with little skill, as shown in Table 1, and they may rate in progressively different ways with the passage of time, as they gain experience in the field and with the research instruments. A good rule of thumb is that they should have been in the field for at least a year; ideally, they should have a qualification in the field. They should also have had sufficient prior experience with the use of the rating instruments.

Where possible, individuals who perform ratings should also be trained using standardized videos, and training should continue until their ratings match well against the rating standards for those videos. For example, on the 17-item Hamilton Rating Scale for Depression, ideally, the rater should not deviate from the rating standard by more than one point for individual items and by more than three points for the total scale score.

If the study runs for a long time, periodic rater recalibration against standardized videos is desirable. This is because rater drift in the pattern of rating may occur across time. Finally, in an ideal world, subjects are interviewed in a standardized fashion by trained and experienced interviewers, the interviews are videotaped, and the videos are then presented in random order for assessment by an experienced rater who was not involved with the interviews. This procedure helps remove the Rosenthal component of the placebo effect.³

Additional Note

This is a known issue about assessments in research. When raters do not have an academic stake in the study, such as when they are merely employed to do the ratings, they may cut corners. As an example, they may ask a friend in another research project to do one or more ratings so that they get time off for personal purposes. This is not acceptable practice because when different people perform ratings, because of different styles of evaluation and scoring, the dispersion of scores increases (i.e., statistical noise is introduced), making statistical significance harder to attain should the outcome be true in the population. The use of different raters is permissible only if inter-rater reliability exercises have been conducted and the reliability is high.

Footnotes

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author received no financial support for the research,authorship,and/or publication of this article.

References

Andrade

. Understanding factorial designs, main effects, and interaction effects: simply explained with a worked example. Indian J Psychol Med, 2024; 46(2): 175–177.

Andrade

. Understanding statistical noise in research: 1. Basic concepts. Indian J Psychol Med, 2023; 45(1): 89–90.

Andrade

. There’s more to placebo-related improvement than the placebo effect alone. J Clin Psychiatry, 2012; 73(10): 1322–1325.

Curiosities in a Table: Learning Points for Responsible Clinical Rating

Abstract

Keywords

The Study

Results

Curiosities

Explanations

Action Points

Additional Note

Footnotes

Declaration of Conflicting Interests

Funding

References