Sage Journals: Discover world-class research

Abstract

To support teachers in facilitating students’ moral reasoning development as specified within the Singapore Ministry of Education Character and Citizenship Education curriculum, the Moral Reasoning Questionnaire (MRQ) was developed and underwent preliminary validation. Based upon expert reviews, cognitive interviews and a classical test theory-based factor analytic approach, the development and preliminary validation found evidence (i.e., content appropriateness, response processes and internal structure) to support the validity and reliability of the MRQ. This study aims to extend the validation by examining the purported MRQ items and scale at a deeper level on the Rasch Measurement Theory, given that it is the only model that presents appropriate properties of interval measurement on a log-linear scale. The Rasch analysis found anomalies including differential item functioning and disordered thresholds in the initial set of items. Upon remediation and a second Rasch analysis, the MRQ responses were consistent with that expressed by the Rasch model (i.e., an item with an endorsability higher than what a respondent would tend to endorse would have a lower probability of being endorsed than an item exhibiting an endorsability below what that respondent would tend to endorse) and hence, there was sufficient evidence to support measurement invariance, and that MRQ scores could be concluded to characterise persons invariantly across a continuum.

Keywords

Moral reasoning assessment secondary education validity reliability Rasch measurement

Introduction

The Character and Citizenship (CCE) curriculum developed by the Singapore Ministry of Education (MOE) aims to develop students into confident people who are discerning in judgment and possess a strong sense of right and wrong (MOE, 2012, 2016, 2020). This supports the MOE desired outcomes of education, which include qualities embodied within a person who is: (1) confident, (2) a self-directed learner, (3) an active contributor and (4) a concerned citizen. Learning outcome eight (LO8: Reflect on and respond to community, national and global issues, as an informed and responsible citizen) of the CCE curriculum appropriates a charge in supporting the desired outcomes of education. LO8 is detailed by key stage outcomes, in part addressing the values of respect and responsibility, and the social awareness domain. Inextricably linked to moral reasoning, the intended key stage outcome is for students to: (1) be able to distinguish right from wrong at the primary level, (2) have moral integrity at the secondary level and (3) have the moral courage to stand up for what is right at the pre-tertiary level (MOE, 2012, 2016, 2020).

In part to achieve LO8, the CCE curriculum intends for students to progress through various levels of moral reasoning based on Kohlberg's stage-based theory of moral development (Kohlberg, 1984) (Figure 1). To do this, curriculum documents have suggested several approaches that teachers can apply (e.g., discussing moral dilemmas on a clarify–sensitise–influence approach and modelling how decisions could be made in the context of these dilemmas). While personable and desirable, the approaches suggested are considerably resource intensive, given that teachers would have to record their discussions with each student as a form of tracking, without which they might not be conscious of progress made by each student. In light of this, the Moral Reasoning Questionnaire (MRQ) was developed upon an operational definition of moral reasoning proffered by Lim & Chapman (2021a) for use in Singapore schools on a large-scale basis for students aged between 12 and 18 (between grade 7 and 12), after an extensive review of established instruments found concerns with content appropriateness and group administrability (Lim & Chapman, 2021b). Based on critical stages recommended by the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association & National Council on Measurement in Education [AERA, APA & NCME], 2014), the MRQ was preliminarily validated in part on a classic test theory (CTT) factor analytic approach to establish its factorial structure, and the analyses found both quantitative and qualitative support for validity evidence and the reliability of the MRQ (Lim & Chapman, 2021c); quantitative support was established via the factor analytic approach (i.e., exploratory factor analysis followed by confirmatory factor analysis) and parallel analysis, while qualitative support was established via evidence from: (1) content appropriateness where an expert panel critique the initial item pool, and (2) response processes, where the items were validated through engaging five students, within the age range by which the MRQ was intended, in cognitive interviews (Willis, 2017).

Figure 1.

Kohlberg's stage-based theory of moral development.

Rasch Measurement Theory

Despite its widespread application in the field of educational measurement since the 1920s (Kohli et al., 2015), the CTT-based factor analytic approach is subject to several limitations, owing in part to its circular dependency in terms of conceptualisation. In CTT, person statistics (i.e., observed scores) are inherently item-sample dependent and often assumed to be normally distributed, while item statistics (i.e., difficulty and discrimination) are person-sample dependent (Boone, 2016; Ewing et al., 2005; Kohli et al., 2015). This restricts the applicability of CTT in various important measurement situations (e.g., situations in which different tests must be equated). Further, in situations that involve rating scale data (e.g., ordinal scales such as the Likert scale), approaches grounded in CTT do not provide a basis for exploring the additivity of scores, a critical attribute of any valid measure (Wu & Leung, 2017). These call for an additional validation of the MRQ, and Rasch Measurement Theory (RMT), posited as an elaboration of CTT (Andrich & Marais, 2019), fits the purpose of this study. The view that Rasch analysis is designed for dichotomous or polytomous response data, makes no distributional assumptions and enables rating scales to be modified through identifying and dealing with misfitting items so that an instrument fittingly measures a latent trait supports RMT as a suitable method for extending the validation of the MRQ (Hendriks et al., 2012; Pallant & Tennant, 2007).

Developed by Georg Rasch in 1957, an important feature of RMT is a table of expected response probabilities reflecting Rasch's view that a person with a greater proficiency than another should have a higher probability of solving or endorsing an item; conversely, an item that is more difficult or more difficult to endorse than another implies that for any person, the probability of endorsing or solving the other item is higher (Andrich, 1997; Andrich & Marais, 2019; Bond & Fox, 2015; Rasch, 1960). By first locating and ordering persons and item difficulty on a log-linear scale reflecting degrees of the latent trait (e.g., easy to difficult; most to least endorsable), incorporating the ordering features of Guttman scaling, and the focus on probabilistic distributions of examinees’ performance at the item level, rather than on test-level information, RMT offers an alternative basis for constructing measurements within and beyond education (Pallant & Tennant, 2007; Rasch, 1960).

Following the ordering of person and items on the log-linear scale, psychometric properties of an instrument can be determined; these include the dimensionality of the instrument, fit to the Rasch model (i.e., person and item fit), threshold ordering, differential item functioning (i.e., item bias), local independence and person-separation index (PSI) (i.e., internal reliability). An important advantage of the Rasch approach derives from the fact that the item parameters obtained do not depend on the characteristics of the persons taking the test, and that the person parameters do not depend on the specific items chosen for a given test (Andrich & Marais, 2019; Bond & Fox, 2015). The parameters produced through Rasch analysis are thus independent of specific sample characteristics; the interpretation of person measures are taken with reference to the items defining the purported latent trait, as opposed to CTT, where person measures are interpreted with reference to the sample mean (Ewing et al., 2005). In view of this, the Rasch approach, taken as an elaboration of CTT, addresses issues with the CTT-based factor analytic approach. Further, Rasch measurement models are tenable as a basis for examining scores obtained through rating scales, because these models provide a means by which the hierarchical structure, unidimensionality and additivity of the scores can be evaluated.

It is noteworthy that while there have been various terms involving Rasch analysis and its application to dichotomous or polytomous scoring structures, Andrich et al. (2018) suggested that there remains only one model (i.e., the unidimensional Rasch model for ordered categories) involved with different types of items; terms such as Dichotomous Model, Partial Credit Model, Rating Model are unnecessary and have misled some to assume that there are different Rasch models.

Moral Reasoning Questionnaire

The 26-item MRQ was developed by the recommendations of Lim & Chapman (2021b) in their review of existing moral reasoning measures whilst considering the intended audience (see Appendix A). Based upon Kohlberg's stage-based theory of moral development (Kohlberg, 1984), the MRQ was initially developed with 30 items, of which four were discarded during the preliminary validation. The MRQ is intended to be delivered online and the items within are of a two-tier response format. In tier one, respondents would select one of two ‘action’ options after reading a moral dilemma vignette. Tier two, an ordering response format, would then be presented based on respondents’ selection in tier one; tier two requires respondents to rank the options in order of importance to themselves, and each of these options correspond to a level based on Kohlberg's stage-based theory of moral development. Figure 2 presents an example of an item within the MRQ.

Figure 2.

Example item with vignette and corresponding options.

Responses to the MRQ are scored based on the scoring matrix presented in Table 1. This scoring matrix had been applied during the preliminary validation and there was no evidence to suggest, based on the CTT factor analytic approach, that it was inappropriate (Lim & Chapman, 2021c) (see Appendix B for results).

Table 1.

Scoring matrix of two-tier items.

Rank order (levels of moral judgment)	Score
(1) Post-conventional (2) Conventional (3) Pre-conventional	5
(1) Conventional (2) Post-conventional (3) Pre-conventional	4
(1) Post-conventional (2) Pre-conventional (3) Conventional	3
(1) Conventional (2) Pre-conventional (3) Post-conventional	2
(1) Pre-conventional (2) Post-conventional (3) Conventional	1
(1) Pre-conventional (2) Conventional (3) Post-conventional	0

Participants

Data for this study were drawn from the responses of participants who took part in the preliminary validation that was approved by the authors’ respective institutional review board (IRB). As required by the IRBs, each participant received a consent form and a participant information sheet that specified that: (1) participant involvement in the research was voluntary (2) participants were free to withdraw at any stage without prejudice in any way, with no reason required for withdrawal and (3) all data would be anonymised and each participant would not be identifiable. As all the participants were considered minors, parental consent was sought before each of the participants took part in the study. The participants were from three secondary schools (grades 8 to 12, aged between 12 and 18) in Singapore that agreed to support the study following the access permission that was granted by the MOE in 2015. Of the 670 participants whose parents/guardians agreed to let them participate in this study, 497 were female. The age range of the participants was 12 to 18 years (M = 14.24, SD = 1.30 years). As to schools, 36.7% (n = 246) of the participants were from school M (a mixed-sex government secondary and secular school), 44.2% (n = 296) from school P (a single-sex autonomous government-aided secondary and mission school) and 19.1% (n = 128) from school X (a mixed-sex autonomous government secondary and secular school). All school types follow the national syllabus with the following distinctions: (1) instituted by various community and religious organisations, government-aided schools include mission schools and serve the educational needs of specific communities; (2) autonomous schools offer a wider range of programmes for students.

The participants from the three participating schools were considered diverse as they represented different educational levels and streams. At the point of data collection, 17.6% (n = 118) of the participants were from secondary one, 27.8% (n = 186) were from secondary two; 28.4% (n = 190) were from secondary three, and 26.3% (n = 176) were from secondary four. As to streams, 79.4% (n = 532) were from the express stream, 16.3% (n = 109) from the normal-academic stream and 4.3% (n = 29) from the normal-technical stream; the majority of students in Singapore secondary schools are from the express stream. There were no missing values or incomplete responses when the data were processed after collection.

Rasch analysis

The Rasch analysis was conducted using the Rasch Unidimensional Measurement Model (RUMM2030) software version 5.4 (RUMM Laboratory Pty Ltd, Perth, Australia). The analysis was performed primarily to assess the data fit to an unrestricted Rasch model, without assuming a uniform distance between response thresholds. The MRQ and its items were evaluated, using parametric statistical tests, for: (1) threshold ordering and reliability, (2) the overall model fit, (3) individual item and person fit (4), the item characteristic curves, (5) local dependency and differential item functioning and (6) dimensionality.

Threshold ordering and reliability

Initial results obtained from the Rasch analysis indicated the presence of disordered thresholds across a number of items in the MRQ, and a significant chi-square statistic, χ² (270, N = 669) = 517.27, p < .001 (see Appendix C for illustration of disordered thresholds). The reliability of the MRQ, however, appeared to be good, with a PSI of 0.88. This PSI value indicated: (1) a good spread of item estimates given the presence of multiple thresholds (six response categories) for each item and (2) a high estimated true variance in respondents’ moral reasoning levels (i.e., only 12% of the variance attributable to error variance).

The presence of disordered thresholds suggested that respondents might not have been able to distinguish between the six response categories presented within the MRQ based on the Rasch model. Given this result, a remediation was undertaken to have the categories revised into a smaller number, as suggested by Andrich and Marais (2019), and the data was re-scored based on this revised scoring matrix and subjected to a second round of Rasch analysis (results reported from here on). The revised categories, presented in Table 2, were premised on the following assumptions: (1) that a respondent would score 2 if she or he identified the pre-conventional level as the lowest level of moral judgment; (2) a respondent would score 0 if she or he identified the levels of moral judgment opposite to that of Kohlberg's stage-based theory of moral development; and (3) a respondent would score 1 for all other rank order permutations.

Table 2.

Scoring matrix of two-tier items with revised categories.

Rank order (levels of moral judgment)	Score
(1) Post-conventional (2) Conventional (3) Pre-conventional	2
(1) Conventional (2) Post-conventional (3) Pre-conventional	2
(1) Post-conventional (2) Pre-conventional (3) Conventional	1
(1) Conventional (2) Pre-conventional (3) Post-conventional	1
(1) Pre-conventional (2) Post-conventional (3) Conventional	1
(1) Pre-conventional (2) Conventional (3) Post-conventional	0

In the Rasch analysis performed on the MRQ using the revised categories, items A5, A11, A13 and A20 were removed as disordered thresholds (i.e., persons at a higher moral reasoning stage demonstrated a probability of endorsing a more endorsable item lower compared to that of persons at a lower moral reasoning stage and vice versa) remained evident. All other items did not have disordered thresholds using the revised categories. This is in agreement with the preliminary validation that also identified these four items as items that likely caused a model misfit based on the factor analytic approach. Figure 3 presents the threshold map for the remaining 26 items and Table 3 shows the uncentralised item thresholds that indicate all category responses have been used as expected consistently.

Figure 3.

Threshold map for remaining 26 items.

Table 3.

Uncentralised item thresholds.

Item label in RUMM2030	Item number in MRQ	Endorsability	Mean	Threshold 1	Threshold 2
I0001	A1	.01	.01	−.60	.63
I0002	A2	.04	.04	−.37	.45
I0003	A3	−.01	−.01	−.46	.44
I0004	A4	−.08	−.08	−.86	.69
I0005	A6	.17	.17	−.59	.94
I0006	A7	1.00	1.00	−.41	2.41
I0007	A8	.33	.33	−1.08	1.74
I0008	A9	−.45	−.45	−2.86	1.95
I0009	A10	.11	.11	−.70	.92
I0010	A12	−.06	−.06	−1.01	.88
I0011	A14	.09	.09	−.54	.72
I0012	A15	−.78	−.78	−1.21	−.35
I0013	A16	−.51	−.51	−1.47	.45
I0014	A17	−.24	−.24	−1.21	.73
I0015	A18	−.00	−.00	−1.20	1.20
I0016	A19	−.66	−.66	−1.09	−.22
I0017	A21	.33	.33	−.47	1.12
I0018	A22	−.46	−.46	−1.53	.62
I0019	A23	.93	.93	−.05	1.91
I0020	A24	−.17	−.17	−.68	.33
I0021	A25	.19	.19	−.38	.77
I0022	A26	.18	.18	−.52	.88
I0023	A27	.59	.59	−.34	1.51
I0024	A28	.16	.16	−.77	1.08
I0025	A29	−.48	−.48	−1.05	.09
I0026	A30	−.21	−.21	−.85	.43

Despite the removal of items A5, A11, A13 and A20, the Rasch analysis continued to find adequate reliability estimates (i.e., a PSI of .84 and Cronbach's α of .89) based on the minimum PSI recommended by Tennant and Conaghan (2007) as .7. These outcomes suggest that the MRQ scale should be able to differentiate between at least two groups of respondents, given that the PSI reflects whether the scale is adequately robust and provides for differentiation between subgroups or individuals within the data set (Andrich, 1982; Streiner et al., 2015).

Overall model fit

In terms of the overall fit of the model, the χ² test for the overall fit to the Rasch model remained significant at χ² (234, n = 647) = 333.74, p < .001 based on the Rasch analysis with revised categories. While a p-value less than .01 indicates that the data would likely not fit the Rasch model, it should be noted that the χ² statistic increases with sample size, and that other sources of evidence such as the item characteristic curve should also be reviewed (Andrich & Marais, 2019). In view of the sensitivity of the χ² statistic to the sample size, an adjusted sample was used, and the overall fit test indicated χ² (234, n = 527) = 271.84, p = .05; the adjusted sample is an algebraic adjustment option in RUMM2030 involving the same data set (Andrich & Marais, 2019). This non-significant item–trait interaction χ² statistic suggested: (1) a good overall model fit; and (2) that the items collectively measure a common latent trait. The good overall model fit was also supported by the item fit residual (M = -.14, SD = 1.53) and person fit residual (M = −.16, SD = .95), which had means close to zero and standard deviations close to one, as suggested by Andrich and Marais (2019), and Tennant and Conaghan (2007).

Individual item and person fit

Examining the individual item and person fit outputs, fit residuals should lie within the range of ± 2.5 for an item to be considered fitting to the Rasch model (Tennant & Conaghan, 2007; Tennant & Pallant, 2006) though Andrich and Marais (2019) stated that ‘there are no absolute criteria for interpreting fit statistics’ (p. 196). In this case, all items (with Bonferroni adjustment) except for A9 (2.71) and 26 (−2.65) were within the threshold and p-values were >.01 (Table 4). Based on this evidence, the vast majority of items within the MRQ meet the criteria for adequate fit.

Table 4.

Fit statistics for MRQ with revised categories.

Item label in RUMM2030	Item number in MRQ	Fit residual	χ² probability
I0001	A1	1.11	0.22
I0002	A2	0.73	0.60
I0003	A3	−0.47	0.78
I0004	A4	−0.19	0.84
I0005	A6	1.68	0.10
I0006	A7	0.32	0.14
I0007	A8	1.27	0.31
I0008	A9	2.71	0.01
I0009	A10	0.12	0.11
I0010	A12	1.7	0.01
I0011	A14	−2.35	0.06
I0012	A15	−0.99	0.19
I0013	A16	−2.02	0.03
I0014	A17	0.85	0.88
I0015	A18	−2.25	0.02
I0016	A19	−1.41	0.36
I0017	A21	−2.23	0.07
I0018	A22	0.63	0.77
I0019	A23	−0.02	0.52
I0020	A24	−1.39	0.29
I0021	A25	1.16	0.20
I0022	A26	−2.65	0.01
I0023	A27	−2.08	0.06
I0024	A28	1.32	0.18
I0025	A29	1.24	0.29
I0026	A30	−0.46	0.60

Item characteristic curves

Item characteristic curves (ICC) are reviewed as part of any Rasch analysis (e.g., when the p-value of the χ² statistic is less than .01) (Andrich & Marais, 2019). In this Rasch analysis, reviewing all the 26 ICCs did not reveal major concerns, with the exception of items A9 and A26 that showed modest non-systematic misfits. The ICC of item A9 (Figure 4) suggests that respondents at the post-conventional level of moral reasoning found it difficult to endorse this item, and this likely influenced the item fit. The ICC for item A26 (Figure 5) suggests that the item misfit observed could have been influenced by over-discrimination where respondents at the post-conventional level found the item slightly too endorsable while those at the pre-conventional level found it slightly more difficult to endorse.

Figure 4.

ICC of item A9.

Figure 5.

ICC of item A26.

Though the fit residuals of items A9 and A26 were slightly beyond the ± 2.5 range, their χ² probability with Bonferroni adjustment was not less than .01. In view of this and the modest item misfit, it was concluded that all 26 items measure a common underlying construct (i.e., moral reasoning). With regard to person fit, only 10 respondents excluding extreme cases had a fit residual, out of the ± 2.5 range, between −3.69 and 3.11. This could indicate anomalies in the score patterns of these respondents, which may have reflected fatigue. As there was no data entry error and given the good overall model fit, these respondents were not removed.

Local dependency and differential item functioning

Violations of local dependency were investigated based on the residual correlation matrix (Pallant & Tennant, 2007). From the Rasch analysis, the maximum inter-item residual correlation (r = .18) was between items A2 and A3, less than the .2 threshold proffered by Andrich et al. (2018). This indicates minimal local dependency and how a respondent performs for an item would have little or no bearing on other items.

Differential item functioning (DIF) occurs ‘when items do not function in the same way for different groups of people, who otherwise have the same value on the trait’ (Andrich & Marais, 2019, p. 199). The DIF analyses performed in this study afford the determination of evidence of any item bias within the MRQ, although these analyses were somewhat limited by the data that could be obtained in the study. As the MRQ is designed to assess moral reasoning across students at secondary one to four across gender, schools and streams, DIF was assessed for school, stream, gender and level. A Bonferroni correction (i.e., dividing the probability value of significance by the number of tests of fit) was used in these analyses. Table 5 presents the p-values of the ANOVA (main effect) used to determine the existence of DIF by level, gender, school and educational stream. From the results, it was concluded that no DIF was evident for school, educational stream, gender and level, with the exception of item A10 that exhibited item bias (i.e., uniform DIF) across different levels, F(3, 607) = 12.70, p < .001; there was no evidence of non-uniform DIF for item A10 based on the ANOVA (interaction effects), F(27, 607) = 0.74, p = .83.

Table 5.

DIF by level, gender, school and education stream.

		p-value
Item label in RUMM2030	Item number in MRQ	by level	by gender	by school	by educational stream
I0001	A1	0.033	0.508	0.233	0.516
I0002	A2	0.715	0.621	0.893	0.085
I0003	A3	0.083	0.146	0.121	0.368
I0004	A4	0.530	0.006	0.060	0.247
I0005	A6	0.597	0.676	0.736	0.545
I0006	A7	0.552	0.007	0.043	0.377
I0007	A8	0.181	0.100	0.346	0.572
I0008	A9	0.057	0.428	0.086	0.343
I0009	A10	0.000	0.475	0.344	0.927
I0010	A12	0.009	0.136	0.078	0.089
I0011	A14	0.472	0.731	0.658	0.580
I0012	A15	0.247	0.208	0.270	0.939
I0013	A16	0.602	0.862	0.482	0.258
I0014	A17	0.281	0.005	0.262	0.224
I0015	A18	0.441	0.063	0.010	0.266
I0016	A19	0.884	0.878	0.269	0.227
I0017	A21	0.897	0.954	0.342	0.600
I0018	A22	0.308	0.262	0.099	0.220
I0019	A23	0.748	0.049	0.452	0.038
I0020	A24	0.514	0.682	0.577	0.325
I0021	A25	0.318	0.816	0.047	0.637
I0022	A26	0.606	0.688	0.967	0.032
I0023	A27	0.305	0.013	0.187	0.284
I0024	A28	0.421	0.853	0.441	0.827
I0025	A29	0.954	0.115	0.130	0.307
I0026	A30	0.269	0.475	0.149	0.088

To appreciate the extent of the item bias, the ICC with level plots for item A10 of the MRQ (Figure 6) was generated and reviewed. Visually, there is item bias between the secondary one (S1) and secondary four (S4) levels at 0 to 0.5 logits and above 2.5 logits. The uniform DIF could be significant due to the large sample size given that a small difference in how an item functions across subgroups would result in a significant statistical test (Dogan et al., 2018; Teresi et al., 2021). Further, upon reviewing the wording and structure of the vignette of item A10 (Figure 2), the DIF was deemed to be benign rather than adverse (Douglas et al., 1996). Despite the lack of evidence of non-uniform DIF that would warrant a removal of the offending item (Pallant & Tennant, 2007), item A10 was removed to ascertain if model fit would be impacted. The Rasch analysis found that model fit was not impacted even with the removal of item A10. In view of these, and the evidence supporting the adequacy of how the data fits the Rasch model, item A10 was eventually retained.

Figure 6.

ICC with level plots for item A10.

Dimensionality

Further to the non-significant item-trait interaction χ² statistic, evidence from the principal components analysis (PCA) supported the unidimensionality of the MRQ. All the items loaded on one principal component, which also supported the assumption of item local independence. The PCA revealed two patterns of items that loaded (positively and negatively) onto the first principal component. An independent t -test was then done on these two sets of items to make separate person estimates. For unidimensionality, no more than 5% of the t -test results should be significant (p < .05) (Andrich & Marais, 2019; Smith, 2002). As only 4.33% of respondents showed a significant difference between the person locations based on the two sets of items, the MRQ was concluded to be unidimensional (Figure 7).

Figure 7.

PCA t-test of ± loaded items on first principal component.

The findings presented so far point to a good overall model fit based on the Rasch model, and support the unidimensionality of the MRQ. The person-item threshold distribution that places student (person) and item location estimates on the same logit scale (Figure 8) shows that the items and thresholds spanned almost the range of person scores except for some who scored very high on the MRQ. This could be explained by the MOE's expectation that more secondary school students fall within the conventional to post-conventional levels of moral development.

Figure 8.

Person-item threshold distribution.

Further analyses suggested that inferences drawn from the measure would not be confounded by students’ demographic attributes (i.e., gender, school or educational stream). Females did have slightly higher moral reasoning scores, but did not differ significantly from males (F(1, 668) = 2.91, p = .09) (Figure 9). Moral reasoning scores were also not influenced by where the school participants were from (F(2, 667) = .09, p = .92) (Figure 10) or by the educational stream (F(1, 668) = .31, p = .58) (Figure 11) that the participants were in. Nonetheless, for the levels of study, there was a statistically significant difference in moral reasoning scores (F(3, 666) = 2.87, p = .04) (Figure 12). While this difference is to be expected, as moral reasoning should be developmental, there was no definitive trend in the analysis that students at higher levels of study scored higher on the MRQ.

Figure 9.

Person-item measure threshold distribution by gender.

Figure 10.

Person-item measure threshold distribution by school.

Figure 11.

Person-item measure threshold distribution by educational stream.

Figure 12.

Person-item measure threshold distribution by level of study.

Discussion

Based on the analyses presented in six areas (i.e., threshold ordering and reliability, overall model fit, individual item and person fit, item characteristic curves, local dependency and differential item functioning, and dimensionality), there was adequate evidence to affirm the unidimensionality and hence the intended purpose of the MRQ. The MRQ items also proved to be functioning as anticipated, based on how the data fit the Rasch model.

The Rasch analysis presented item A7 as the least endorsable item on the log-linear scale (i.e., item location = .998), with adjacent category thresholds −.4 and 2.4 logits (Figure 13). Based on the polytomous Rasch model as expressed by equation (1) (Andrich & Marais, 2019), the probabilities of a student with a proficiency at the zero logit scoring 0, 1 and 2 are .38, .57 and .05, respectively; the probabilities of an average student with a mean of the order of 1.624 (Figure 8) scoring 0, 1 and 2 are .23, .65 and .11, respectively. In the same vein, the most endorsable item on the log-linear scale, item A15 (i.e., item location = −.779) had adjacent category thresholds of −1.2 and −.3 (Figure 14). Based on equation (1), the probability of a student with proficiency at the zero logit scoring 0, 1 and 2 are .11, .37 and .52, respectively, and that of a student with a mean proficiency of the order of 1.624 are .00, .00 and .94, respectively. $P {x_{n i} = x} = \frac{e^{- τ_{1 i} - τ_{2 i} \dots - τ_{x i} + x (β_{n} - δ_{i})}}{\sum_{x^{'} = 0}^{m_{i}} e^{- τ_{1 i} - τ_{2 i} \dots - τ_{x^{'} i} + x^{'} (β_{n} - δ_{i})}}$ (1)where $P {x_{n i} = x}$ is the probability that person n selects in category ‘x’ respectively on item i, β is the person location parameter, τ are the response probability thresholds, and δ is the mean of these thresholds.

Figure 13.

Threshold probability curve of item A7.

Figure 14.

Threshold probability curve of item A15.

Considering these and that the MOE expects most secondary school students to fall within the conventional to post-conventional levels of Kohlberg's stage-based model, it appears that the targeting of the MRQ (Figure 8) could be further refined and subsequent versions of the MRQ could include more ‘difficult to endorse’ items to assess the conventional to post-conventional levels of moral reasoning.

More could also be done to establish the MRQ as a measure of moral reasoning independent of cognition and intelligence, given that various previous studies have reported correlations in the .20 to .50 range between moral judgments and measures of intelligence, aptitude and achievement (Rest, 1979; Thoma & Dong, 2014). The data used in this study were from students from mainstream secondary schools and hence the MRQ appears to be fit for purpose in Singapore mainstream secondary school students. To ascertain that there is no DIF across a wider range of cognitive levels, and hence provide assurance that students of higher scholastic aptitude do not necessarily exhibit higher levels of moral reasoning, the MRQ could be administered to students who have comparable literacy levels but are not from mainstream secondary schools.

Though the Rasch analysis suggested that inferences drawn from the measure would not be confounded by students’ demographic attributes (i.e., gender, school or educational stream), the disproportionate sample by gender and educational stream, owing to availability sampling, could optically suggest otherwise. Hence, a more representative sample could be invited to participate in subsequent studies. With more data, measurement invariance by educational stream and gender could be affirmed.

Disordered thresholds that were identified through this study called for the MRQ scoring matrix to be revised. Hence, the revised scoring matrix should be used moving forward so that a log-linear person-measure of the MRQ can be established for the meaningful comparison of respondents’ moral reasoning.

Based on the triangulation of evidence across all of these analyses, the present study together with the preliminary validation present the MRQ as an instrument that can be used in Singapore secondary schools to monitor students’ development in the area of moral reasoning. As an accessible instrument with sound psychometric properties founded upon both CTT and RMT, the MRQ would be suitable for use on a large-scale basis. A further advantage of this instrument is that minimal training is required for teachers to administer and score the test. This adds further support to the notion that the MRQ can provide a practical means by which students’ development in moral reasoning can be monitored, hence addressing a major gap identified in this context (Lim & Chapman, 2021b).

Conclusion

This paper detailed how the RMT approach was used to validate the moral reasoning scale based on the MRQ, how the analyses were interpreted and how identified issues were resolved. The RMT approach undertaken in this study served as an elaboration of the CTT-based factor analytic approach used within the preliminary validation of the MRQ.

The Rasch analysis found evidence to support, amongst the reported psychometric properties, the unidimensionality and intended purpose of the MRQ, though issues related to disordered thresholds were identified. This led to a revised scoring matrix upon which further analyses found that invariant comparisons of persons and items could be drawn. Hence, it appears that the MRQ presents a viable scale free of DIF for measuring moral reasoning in students within the context of Singapore secondary education.

By its nature, the ‘validation process never ends, as there is always additional information that can be gathered to more fully understand a test and the inferences that can be drawn from it’ (AERA, APA & NCME, 2014, p. 21). While this study is an extension of the preliminary validation and presents the MRQ as an instrument holding considerable promise for use within the Singapore context, further research might be needed to support adoption on a widespread basis. As a concluding example, other Rasch analysis software could be applied to ascertain the findings of this study.

Footnotes

Consent to participate

Consent to participate was obtained from all participants either in hard copy (in the case of qualitative interviews) or online (in the case of online instrument delivery) format.

Consent for publication

Consent for publication was obtained in the same format as for consent to participate.

Ethics approval

Approvals to conduct this research were obtained from the Human Research Ethics Committee of the University of Western Australia (RA/4/1/7813) and from the Ministry of Education,Singapore (RQ105-15(09)).

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship and/or publication of this article.

ORCID iD

Lyndon Lim

Article Note

Under Appendix 11B1,the following text “not touch the laptop in case your classmates disapproved of your action” has been changed to “change the mark in case your classmates disapproved of your action”.

Correction (July 2023):

Article updated;for further details please see the Article Note at the end of the article.

References

American Educational Research Association, American Psychological Association & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Andrich

(1982). An index of person separation in latent trait theory, the traditional KR-20 index, and the Guttman scale response pattern. Education Research and Perspectives, 9(1), 95–104. https://www.rasch.org/erp7.htm

Andrich

(1997). Georg Rasch in his own words. Rasch Measurement Transactions, 11(1), 542–543. https://www.rasch.org/rmt/rmt111d.htm

Andrich

Lyne

Sheridan

Luo

(2018). Interpreting RUMM2030 part II polytomous data. RUMM Laboratory Pty Ltd.

Andrich

Marais

(2019). A course in Rasch measurement theory: Measuring in the educational, social and health sciences. Springer.

Bond

T. G.

Fox

C. M.

(2015). Applying the Rasch model: Fundamental measurement in the human sciences (3rd ed.). Routledge.

Boone

W. J.

(2016). Rasch analysis for instrument development: Why, when, and how? CBE Life Sciences Education, 15(4), 1–7. https://doi.org/10.1187/cbe.16-04-0148

Dogan

Hambleton

R. K.

Yurtcu

Yavuz

(2018). The comparison of differential item functioning predicted through experts and statistical techniques. Cypriot Journal of Educational Sciences, 13(2), 137–148. https://doi.org/10.18844/cjes.v13i2.2427

Douglas

Roussos

Stout

(1996). Item-bundle DIF hypothesis testing: Identifying suspect bundles and assessing their differential functioning. Journal of Educational Measurement, 33(4), 465–484. http://www.jstor.org/stable/1435335 https://doi.org/10.1111/j.1745-3984.1996.tb00502.x

10.

Ewing

M. T.

Salzberger

Sinkovics

R. R.

(2005). An alternate approach to assessing cross-cultural measurement equivalence in advertising research. Journal of Advertising, 34(1), 17–36. 10.1080/00913367.2005.10639181

11.

Hendriks

Fyfe

Styles

Skinner

S. R.

Merriman

(2012). Scale construction utilising the Rasch unidimensional measurement model: A measurement of adolescent attitudes towards abortion. Australasian Medical Journal, 5(5), 251–261. https://dx.doi.org/10.4066%2FAMJ.2012.952

12.

Kohlberg

(1984). The psychology of moral development: The nature and validity of moral stages (essays on moral development, volume 2). Harper & Row.

13.

Kohli

Koran

Henn

(2015). Relationships among classical test theory and item response theory frameworks via factor analytic models. Educational and Psychological Measurement, 75(3), 389–405. https://doi.org/10.1177/0013164414559071

14.

Lim, L., & Chapman, E. (2021c). Development and preliminary validation of the Moral Reasoning Questionnaire for secondary school students. SAGE Open. https://doi.org/10.1177/21582440221085271.

15.

Lim, L., & Chapman, E. (2021b). Moral reasoning assessment for Singapore secondary schools: A review. Issues in Educational Research, 31(4), 1121–1137. http://www.iier.org.au/iier31/lim.pdf.

16.

Lim, L., & Chapman, E. (2021a). Moral reasoning in secondary education curriculum: An operational definition. International Journal of Ethics Education. https://doi.org/10.1007/s40889-021-00129-z.

17.

Pallant

J. F.

Tennant

(2007). An introduction to the Rasch measurement model: An example using the hospital anxiety and depression scale (HADS). British Journal of Clinical Psychology, 46(1), 1–18. https://doi.org/10.1348/014466506X96931

18.

Rasch

(1960). Probabilistic models for some intelligence and attainment tests. Danmarks Paedagogiske Institut.

19.

Rest

(1979). Development in judging moral issues. University of Minnesota Press.

20.

Singapore Ministry of Education. (2012). 2014 syllabus, character and citizenship education (secondary) . https://www.moe.gov.sg/-/media/files/programmes/2014-character-citizenship-education-secondary.pdf?la=en&hash=45889182909A7AF1889080B6C2A50E03261764E0

21.

Singapore Ministry of Education. (2016). Character and citizenship syllabus (pre-university) . https://www.moe.gov.sg/-/media/files/programmes/character_and_citizenship_education_preu_syllabus.pdf?la=en&hash=F4B55203C3F0B328B45A6E703E7529AC2CC0FA0B

22.

Singapore Ministry of Education. (2020). 2021 syllabus, character and citizenship education (secondary) . https://www.moe.gov.sg/-/media/files/secondary/syllabuses/cce/2021-character-and-citizenship-education-syllabus-secondary.pdf

23.

Smith

E. V.

(2002). Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals. Journal of Applied Measurement, 3(2), 205–231.

24.

Streiner

D. L.

Norman

G. R.

Cairney

(2015). Health measurement scales: A practical guide to their development and use (5th ed.). Oxford University Press. https://doi.org/10.1093/med/9780199685219.001.0001

25.

Tennant

Conaghan

P. G.

(2007). The Rasch measurement model in rheumatology: What is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis Care and Research, 57(8), 1358–1362. https://doi.org/10.1002/art.23108

26.

Tennant

Pallant

(2006). Unidimensionality matters! (A tale of two Smiths?) Rasch Measurement Transactions, 20(1), 1048–1051. https://www.rasch.org/rmt/rmt201c.htm

27.

Teresi

J. A.

Wang

Kleinman

Jones

R. N.

Weiss

D. J.

(2021). Differential item functioning analyses of the patient-reported outcomes measurement information system (PROMIS®) measures: Methods, challenges, advances, and future directions. Psychometrika. https://doi.org/10.1007/s11336-021-09775-0

28.

Thoma

S. J.

Dong

(2014). The defining issues test of moral judgment development. Behavioral Development Bulletin, 19(3), 55–61. https://doi.org/10.1037/h0100590

29.

Willis

(2017). Cognitive interviewing in survey design: State of the science and future directions. In Vannette

Krosnick

(eds.), The Palgrave handbook of survey research. Palgrave Macmillan. https://doi.org/10.1007/978-3-319-54395-6_14

30.

Leung

(2017). Can Likert scales be treated as interval scales? A simulation study. Journal of Social Service Research, 43(4), 527–532. https://doi.org/10.1080/01488376.2017.1329775