Abstract
Introduction
The Character and Citizenship (CCE) curriculum developed by the Singapore Ministry of Education (MOE) aims to develop students into confident people who are discerning in judgment and possess a strong sense of right and wrong (MOE, 2012, 2016, 2020). This supports the MOE desired outcomes of education, which include qualities embodied within a person who is: (1) confident, (2) a self-directed learner, (3) an active contributor and (4) a concerned citizen. Learning outcome eight (LO8: Reflect on and respond to community, national and global issues, as an informed and responsible citizen) of the CCE curriculum appropriates a charge in supporting the desired outcomes of education. LO8 is detailed by key stage outcomes, in part addressing the values of respect and responsibility, and the social awareness domain. Inextricably linked to moral reasoning, the intended key stage outcome is for students to: (1) be able to distinguish right from wrong at the primary level, (2) have moral integrity at the secondary level and (3) have the moral courage to stand up for what is right at the pre-tertiary level (MOE, 2012, 2016, 2020).
In part to achieve LO8, the CCE curriculum intends for students to progress through various levels of moral reasoning based on Kohlberg's stage-based theory of moral development (Kohlberg, 1984) (Figure 1). To do this, curriculum documents have suggested several approaches that teachers can apply (e.g., discussing moral dilemmas on a clarify–sensitise–influence approach and modelling how decisions could be made in the context of these dilemmas). While personable and desirable, the approaches suggested are considerably resource intensive, given that teachers would have to record their discussions with each student as a form of tracking, without which they might not be conscious of progress made by each student. In light of this, the Moral Reasoning Questionnaire (MRQ) was developed upon an operational definition of moral reasoning proffered by Lim & Chapman (2021a) for use in Singapore schools on a large-scale basis for students aged between 12 and 18 (between grade 7 and 12), after an extensive review of established instruments found concerns with content appropriateness and group administrability (Lim & Chapman, 2021b). Based on critical stages recommended by the

Kohlberg's stage-based theory of moral development.
Rasch Measurement Theory
Despite its widespread application in the field of educational measurement since the 1920s (Kohli et al., 2015), the CTT-based factor analytic approach is subject to several limitations, owing in part to its circular dependency in terms of conceptualisation. In CTT, person statistics (i.e., observed scores) are inherently item-sample dependent and often assumed to be normally distributed, while item statistics (i.e., difficulty and discrimination) are person-sample dependent (Boone, 2016; Ewing et al., 2005; Kohli et al., 2015). This restricts the applicability of CTT in various important measurement situations (e.g., situations in which different tests must be equated). Further, in situations that involve rating scale data (e.g., ordinal scales such as the Likert scale), approaches grounded in CTT do not provide a basis for exploring the additivity of scores, a critical attribute of any valid measure (Wu & Leung, 2017). These call for an additional validation of the MRQ, and Rasch Measurement Theory (RMT), posited as an elaboration of CTT (Andrich & Marais, 2019), fits the purpose of this study. The view that Rasch analysis is designed for dichotomous or polytomous response data, makes no distributional assumptions and enables rating scales to be modified through identifying and dealing with misfitting items so that an instrument fittingly measures a latent trait supports RMT as a suitable method for extending the validation of the MRQ (Hendriks et al., 2012; Pallant & Tennant, 2007).
Developed by Georg Rasch in 1957, an important feature of RMT is a table of expected response probabilities reflecting Rasch's view that a person with a greater proficiency than another should have a higher probability of solving or endorsing an item; conversely, an item that is more difficult or more difficult to endorse than another implies that for any person, the probability of endorsing or solving the other item is higher (Andrich, 1997; Andrich & Marais, 2019; Bond & Fox, 2015; Rasch, 1960). By first locating and ordering persons and item difficulty on a log-linear scale reflecting degrees of the latent trait (e.g., easy to difficult; most to least endorsable), incorporating the ordering features of Guttman scaling, and the focus on probabilistic distributions of examinees’ performance at the item level, rather than on test-level information, RMT offers an alternative basis for constructing measurements within and beyond education (Pallant & Tennant, 2007; Rasch, 1960).
Following the ordering of person and items on the log-linear scale, psychometric properties of an instrument can be determined; these include the dimensionality of the instrument, fit to the Rasch model (i.e., person and item fit), threshold ordering, differential item functioning (i.e., item bias), local independence and person-separation index (PSI) (i.e., internal reliability). An important advantage of the Rasch approach derives from the fact that the item parameters obtained do not depend on the characteristics of the persons taking the test, and that the person parameters do not depend on the specific items chosen for a given test (Andrich & Marais, 2019; Bond & Fox, 2015). The parameters produced through Rasch analysis are thus independent of specific sample characteristics; the interpretation of person measures are taken with reference to the items defining the purported latent trait, as opposed to CTT, where person measures are interpreted with reference to the sample mean (Ewing et al., 2005). In view of this, the Rasch approach, taken as an elaboration of CTT, addresses issues with the CTT-based factor analytic approach. Further, Rasch measurement models are tenable as a basis for examining scores obtained through rating scales, because these models provide a means by which the hierarchical structure, unidimensionality and additivity of the scores can be evaluated.
It is noteworthy that while there have been various terms involving Rasch analysis and its application to dichotomous or polytomous scoring structures, Andrich et al. (2018) suggested that there remains only one model (i.e., the unidimensional Rasch model for ordered categories) involved with different types of items; terms such as Dichotomous Model, Partial Credit Model, Rating Model are unnecessary and have misled some to assume that there are different Rasch models.
Moral Reasoning Questionnaire
The 26-item MRQ was developed by the recommendations of Lim & Chapman (2021b) in their review of existing moral reasoning measures whilst considering the intended audience (see Appendix A). Based upon Kohlberg's stage-based theory of moral development (Kohlberg, 1984), the MRQ was initially developed with 30 items, of which four were discarded during the preliminary validation. The MRQ is intended to be delivered online and the items within are of a two-tier response format. In tier one, respondents would select one of two ‘action’ options after reading a moral dilemma vignette. Tier two, an ordering response format, would then be presented based on respondents’ selection in tier one; tier two requires respondents to rank the options in order of importance to themselves, and each of these options correspond to a level based on Kohlberg's stage-based theory of moral development. Figure 2 presents an example of an item within the MRQ.

Example item with vignette and corresponding options.
Responses to the MRQ are scored based on the scoring matrix presented in Table 1. This scoring matrix had been applied during the preliminary validation and there was no evidence to suggest, based on the CTT factor analytic approach, that it was inappropriate (Lim & Chapman, 2021c) (see Appendix B for results).
Scoring matrix of two-tier items.
Participants
Data for this study were drawn from the responses of participants who took part in the preliminary validation that was approved by the authors’ respective institutional review board (IRB). As required by the IRBs, each participant received a consent form and a participant information sheet that specified that: (1) participant involvement in the research was voluntary (2) participants were free to withdraw at any stage without prejudice in any way, with no reason required for withdrawal and (3) all data would be anonymised and each participant would not be identifiable. As all the participants were considered minors, parental consent was sought before each of the participants took part in the study. The participants were from three secondary schools (grades 8 to 12, aged between 12 and 18) in Singapore that agreed to support the study following the access permission that was granted by the MOE in 2015. Of the 670 participants whose parents/guardians agreed to let them participate in this study, 497 were female. The age range of the participants was 12 to 18 years (
The participants from the three participating schools were considered diverse as they represented different educational levels and streams. At the point of data collection, 17.6% (n = 118) of the participants were from secondary one, 27.8% (n = 186) were from secondary two; 28.4% (n = 190) were from secondary three, and 26.3% (n = 176) were from secondary four. As to streams, 79.4% (
Rasch analysis
The Rasch analysis was conducted using the Rasch Unidimensional Measurement Model (RUMM2030) software version 5.4 (RUMM Laboratory Pty Ltd, Perth, Australia). The analysis was performed primarily to assess the data fit to an unrestricted Rasch model, without assuming a uniform distance between response thresholds. The MRQ and its items were evaluated, using parametric statistical tests, for: (1) threshold ordering and reliability, (2) the overall model fit, (3) individual item and person fit (4), the item characteristic curves, (5) local dependency and differential item functioning and (6) dimensionality.
Threshold ordering and reliability
Initial results obtained from the Rasch analysis indicated the presence of disordered thresholds across a number of items in the MRQ, and a significant chi-square statistic, χ2 (270, N = 669) = 517.27,
The presence of disordered thresholds suggested that respondents might not have been able to distinguish between the six response categories presented within the MRQ based on the Rasch model. Given this result, a remediation was undertaken to have the categories revised into a smaller number, as suggested by Andrich and Marais (2019), and the data was re-scored based on this revised scoring matrix and subjected to a second round of Rasch analysis (results reported from here on). The revised categories, presented in Table 2, were premised on the following assumptions: (1) that a respondent would score 2 if she or he identified the pre-conventional level as the lowest level of moral judgment; (2) a respondent would score 0 if she or he identified the levels of moral judgment opposite to that of Kohlberg's stage-based theory of moral development; and (3) a respondent would score 1 for all other rank order permutations.
Scoring matrix of two-tier items with revised categories.
In the Rasch analysis performed on the MRQ using the revised categories, items A5, A11, A13 and A20 were removed as disordered thresholds (i.e., persons at a higher moral reasoning stage demonstrated a probability of endorsing a more endorsable item lower compared to that of persons at a lower moral reasoning stage and vice versa) remained evident. All other items did not have disordered thresholds using the revised categories. This is in agreement with the preliminary validation that also identified these four items as items that likely caused a model misfit based on the factor analytic approach. Figure 3 presents the threshold map for the remaining 26 items and Table 3 shows the uncentralised item thresholds that indicate all category responses have been used as expected consistently.

Threshold map for remaining 26 items.
Uncentralised item thresholds.
Despite the removal of items A5, A11, A13 and A20, the Rasch analysis continued to find adequate reliability estimates (i.e., a
Overall model fit
In terms of the overall fit of the model, the χ2 test for the overall fit to the Rasch model remained significant at χ2 (234,
Individual item and person fit
Examining the individual item and person fit outputs, fit residuals should lie within the range of ± 2.5 for an item to be considered fitting to the Rasch model (Tennant & Conaghan, 2007; Tennant & Pallant, 2006) though Andrich and Marais (2019) stated that ‘there are no absolute criteria for interpreting fit statistics’ (p. 196). In this case, all items (with Bonferroni adjustment) except for A9 (2.71) and 26 (−2.65) were within the threshold and
Fit statistics for MRQ with revised categories.
Item characteristic curves
Item characteristic curves (ICC) are reviewed as part of any Rasch analysis (e.g., when the

ICC of item A9.

ICC of item A26.
Though the fit residuals of items A9 and A26 were slightly beyond the ± 2.5 range, their χ2 probability with Bonferroni adjustment was not less than .01. In view of this and the modest item misfit, it was concluded that all 26 items measure a common underlying construct (i.e., moral reasoning). With regard to person fit, only 10 respondents excluding extreme cases had a fit residual, out of the ± 2.5 range, between −3.69 and 3.11. This could indicate anomalies in the score patterns of these respondents, which may have reflected fatigue. As there was no data entry error and given the good overall model fit, these respondents were not removed.
Local dependency and differential item functioning
Violations of local dependency were investigated based on the residual correlation matrix (Pallant & Tennant, 2007). From the Rasch analysis, the maximum inter-item residual correlation (
Differential item functioning (
To appreciate the extent of the item bias, the ICC with level plots for item A10 of the MRQ (Figure 6) was generated and reviewed. Visually, there is item bias between the secondary one (S1) and secondary four (S4) levels at 0 to 0.5 logits and above 2.5 logits. The uniform

ICC with level plots for item A10.
Dimensionality
Further to the non-significant item-trait interaction χ2 statistic, evidence from the principal components analysis (

The findings presented so far point to a good overall model fit based on the Rasch model, and support the unidimensionality of the MRQ. The person-item threshold distribution that places student (person) and item location estimates on the same logit scale (Figure 8) shows that the items and thresholds spanned almost the range of person scores except for some who scored very high on the MRQ. This could be explained by the MOE's expectation that more secondary school students fall within the conventional to post-conventional levels of moral development.

Person-item threshold distribution.
Further analyses suggested that inferences drawn from the measure would not be confounded by students’ demographic attributes (i.e., gender, school or educational stream). Females did have slightly higher moral reasoning scores, but did not differ significantly from males (

Person-item measure threshold distribution by gender.

Person-item measure threshold distribution by school.

Person-item measure threshold distribution by educational stream.

Person-item measure threshold distribution by level of study.
Discussion
Based on the analyses presented in six areas (i.e., threshold ordering and reliability, overall model fit, individual item and person fit, item characteristic curves, local dependency and differential item functioning, and dimensionality), there was adequate evidence to affirm the unidimensionality and hence the intended purpose of the MRQ. The MRQ items also proved to be functioning as anticipated, based on how the data fit the Rasch model.
The Rasch analysis presented item A7 as the least endorsable item on the log-linear scale (i.e., item location = .998), with adjacent category thresholds −.4 and 2.4 logits (Figure 13). Based on the polytomous Rasch model as expressed by equation (1) (Andrich & Marais, 2019), the probabilities of a student with a proficiency at the zero logit scoring 0, 1 and 2 are .38, .57 and .05, respectively; the probabilities of an average student with a mean of the order of 1.624 (Figure 8) scoring 0, 1 and 2 are .23, .65 and .11, respectively. In the same vein, the most endorsable item on the log-linear scale, item A15 (i.e., item location = −.779) had adjacent category thresholds of −1.2 and −.3 (Figure 14). Based on equation (1), the probability of a student with proficiency at the zero logit scoring 0, 1 and 2 are .11, .37 and .52, respectively, and that of a student with a mean proficiency of the order of 1.624 are .00, .00 and .94, respectively.

Threshold probability curve of item A7.

Threshold probability curve of item A15.
Considering these and that the MOE expects most secondary school students to fall within the conventional to post-conventional levels of Kohlberg's stage-based model, it appears that the targeting of the MRQ (Figure 8) could be further refined and subsequent versions of the MRQ could include more ‘difficult to endorse’ items to assess the conventional to post-conventional levels of moral reasoning.
More could also be done to establish the MRQ as a measure of moral reasoning independent of cognition and intelligence, given that various previous studies have reported correlations in the .20 to .50 range between moral judgments and measures of intelligence, aptitude and achievement (Rest, 1979; Thoma & Dong, 2014). The data used in this study were from students from mainstream secondary schools and hence the MRQ appears to be fit for purpose in Singapore mainstream secondary school students. To ascertain that there is no
Though the Rasch analysis suggested that inferences drawn from the measure would not be confounded by students’ demographic attributes (i.e., gender, school or educational stream), the disproportionate sample by gender and educational stream, owing to availability sampling, could optically suggest otherwise. Hence, a more representative sample could be invited to participate in subsequent studies. With more data, measurement invariance by educational stream and gender could be affirmed.
Disordered thresholds that were identified through this study called for the MRQ scoring matrix to be revised. Hence, the revised scoring matrix should be used moving forward so that a log-linear person-measure of the MRQ can be established for the meaningful comparison of respondents’ moral reasoning.
Based on the triangulation of evidence across all of these analyses, the present study together with the preliminary validation present the MRQ as an instrument that can be used in Singapore secondary schools to monitor students’ development in the area of moral reasoning. As an accessible instrument with sound psychometric properties founded upon both CTT and RMT, the MRQ would be suitable for use on a large-scale basis. A further advantage of this instrument is that minimal training is required for teachers to administer and score the test. This adds further support to the notion that the MRQ can provide a practical means by which students’ development in moral reasoning can be monitored, hence addressing a major gap identified in this context (Lim & Chapman, 2021b).
Conclusion
This paper detailed how the RMT approach was used to validate the moral reasoning scale based on the MRQ, how the analyses were interpreted and how identified issues were resolved. The RMT approach undertaken in this study served as an elaboration of the CTT-based factor analytic approach used within the preliminary validation of the MRQ.
The Rasch analysis found evidence to support, amongst the reported psychometric properties, the unidimensionality and intended purpose of the MRQ, though issues related to disordered thresholds were identified. This led to a revised scoring matrix upon which further analyses found that invariant comparisons of persons and items could be drawn. Hence, it appears that the MRQ presents a viable scale free of
By its nature, the ‘validation process never ends, as there is always additional information that can be gathered to more fully understand a test and the inferences that can be drawn from it’ (AERA, APA & NCME, 2014, p. 21). While this study is an extension of the preliminary validation and presents the MRQ as an instrument holding considerable promise for use within the Singapore context, further research might be needed to support adoption on a widespread basis. As a concluding example, other Rasch analysis software could be applied to ascertain the findings of this study.
