Abstract
Keywords
In order to read an original article critically, the reader must be (a) an expert in the field, (b) an expert in research methods, and (c) an expert in statistics. It could take years, even decades, to attain such expertise. Nevertheless, readers could expect to learn from issue-specific teaching. Toward this goal, this article presents a single table (Table 1) containing several curiosities. These curiosities are identified and explained, providing the reader with useful learning points.
The Study
Table 1 contains data from an actual study that examined how three behavioral interventions (BEH 1, 2, and 3) reduced anxiety scores in undergraduate students preparing for their final examinations. The interventions were taught at baseline, and students were instructed to practice the interventions daily. A rater who was blind to treatment assignment assessed anxiety at baseline and at 1-week and 2-week follow-up. The study details and data have been redacted to ensure anonymity.
Results
The main effect for Groups was statistically significant (
The main effect for Time was statistically significant (
The Group × Time interaction was statistically significant (
Readers who need help with understanding the main and interaction effects may refer to an earlier article in this column. 1
Curiosities
Uninformed and trusting readers would conclude, from these data, that BEH3 was the best intervention for pre-examination anxiety in undergraduate students and that BEH3 may therefore be recommended for this purpose. However, a closer look at Table 1 identifies three curiosities. First, despite the statistical significance identified, there was very little real difference in scores in the three groups at the three time points; that is, what was statistically significant was not clinically significant. Second, in all groups and at all time points, the standard deviations (SDs) were very narrow, indicating that there was little dispersion of scores around the means. Finally, even if all three treatments were ineffective, there should at least have been a noticeable placebo effect; there was none evident at either Week 1 or Week 2.
Anxiety Ratings at Baseline, Week 1, and Week 2 in Students Exposed to Different Behavioral Interventions (BEH 1, 2, and 3).*
*Data presented in cells are mean (standard deviation).
Statistical inferences are presented in the text.
Explanations
The rater responsible for the assessments had no prior experience with the use of the rating instrument or with evaluating pre-examination anxiety. As a result, the interview responses were interpreted with little discrimination, and the scores assigned were much the same in all subjects in all groups and at all time points, resulting in both similar means and the negligible dispersion of scores around the means. Because SDs represent the ‘noise’ around the mean and because the mean is the ‘signal’ of interest, narrow SDs make it easy for a signal to be detected. 2 This is why statistical significance emerged despite the very small differences between groups and between time intervals. And if all subjects are rated in more or less the same way at all time points, neither the true intervention effect nor the placebo effect have an opportunity to emerge.
As an aside, it is possible that the statistical significance represents a true advantage for BEH3. However, given the already explained curiosities in Table 1 and the unexpected ‘worsening’ in BEH1 and BEH2 at Week 1, it is also reasonable to consider that these differences arose from chance variations in ratings that were statistically significant only because the SDs were narrow. That is, the signal identified may have been spurious.
Action Points
Individuals who rate subjects in research should be experienced not only with the context of the study but also with the rating instruments that are intended to be used in that context; if not, they will rate with little skill, as shown in Table 1, and they may rate in progressively different ways with the passage of time, as they gain experience in the field and with the research instruments. A good rule of thumb is that they should have been in the field for at least a year; ideally, they should have a qualification in the field. They should also have had sufficient prior experience with the use of the rating instruments.
Where possible, individuals who perform ratings should also be trained using standardized videos, and training should continue until their ratings match well against the rating standards for those videos. For example, on the 17-item Hamilton Rating Scale for Depression, ideally, the rater should not deviate from the rating standard by more than one point for individual items and by more than three points for the total scale score.
If the study runs for a long time, periodic rater recalibration against standardized videos is desirable. This is because rater drift in the pattern of rating may occur across time. Finally, in an ideal world, subjects are interviewed in a standardized fashion by trained and experienced interviewers, the interviews are videotaped, and the videos are then presented in random order for assessment by an experienced rater who was not involved with the interviews. This procedure helps remove the Rosenthal component of the placebo effect. 3
Additional Note
This is a known issue about assessments in research. When raters do not have an academic stake in the study, such as when they are merely employed to do the ratings, they may cut corners. As an example, they may ask a friend in another research project to do one or more ratings so that they get time off for personal purposes. This is not acceptable practice because when different people perform ratings, because of different styles of evaluation and scoring, the dispersion of scores increases (i.e., statistical noise is introduced), making statistical significance harder to attain should the outcome be true in the population. The use of different raters is permissible only if inter-rater reliability exercises have been conducted and the reliability is high.
