Abstract
Keywords
Introduction
This study examines measurement invariance (MI) of student perceptions of teaching quality. National studies conducted in different countries support the validity to use student perceptions to describe and study variation in teaching quality (e.g., Downer et al., 2015; Ferguson, 2012; Maulana & Helms-Lorenz, 2016; Sauerwein & Theis, 2021; van der Lans et al., 2019; Wagner et al., 2013). However, these studies do not indicate how such descriptions and results compare between countries. The aim of this study is to explore MI of student perceptions in five different countries to reveal a potential indication of how results obtained with student perceptions gathered through surveys compare internationally.
To date, studies examining measurement invariance of student perceptions of teaching quality are relatively rare. Notable exceptions are the studies by André et al. (2020) and Scherer et al. (2016). These studies report evidence of partial-invariance between countries. More specific, André et al. (2020) and Scherer et al. (2016) found evidence supporting (partial) metric invariance but no support for scalar invariance. Both studies applied the Multiple Group Confirmatory Factor Analysis (MGCFA) method which is rooted in the factor analysis framework. A novelty of this study is that it applies the Partial Credit Model (PCM; a polytomous Rasch model) to examine MI of student perceptions of teaching quality.
Masters’s (1982) PCM, and Muraki’s (1992) Generalized (G)PCM, are popular methods for the assessment of MI of cognitive tests in the International large-scale assessments (ILSA’s), like the Program for International Student Assessment (PISA), Progress in International Reading and Literacy Study (PIRLS), and the Trends in Mathematics and Science Study (TIMSS). The popularity of (G)PCM in ILSA’s might be explained by the flexibility it offers for international comparisons. Specifically, (G)PCM allow researchers to relate scores on one instrument to those of another, which techniques are referred to as “scaling to achieve comparability” or “linking” (Kolen & Brennan, 2014). In ILSA’s between-country comparisons are challenged by variation in curricula. It is impossible to administer the exact same item content in all countries due to variation in curricula. Therefore, linking is used to enhance international comparisons (e.g., Oliveri & von Davier, 2011). Although linking is not unique to the (G)PCM, this model provides additional flexibility to applications of it (Kolen & Brennan, 2014). In the above traditional use, linking is used to increase comparability of cognitive tests of different content. Another potential benefit of linking are its applications to adjust for non-invariance of the same test or questionnaire administred in different countries (e.g., Oliveri & Von Davier, 2011, 2014). Given the high likelihood to find evidence of partial invariance in international comparisons of student perceived teaching quality, the second aim of this article is to further explore whether and how linking can benefit international comparisons of student perceived teaching quality in situations of non-invariance.
The research questions are as follows:
To what extent are scores of student perceptions of teaching quality invariant across countries?
How does perceived teaching quality in different countries compare?
Background
Conceptualization of Teaching Quality
This study applies a conceptualization of teaching quality that is grounded in the literature on teaching and teacher effectiveness (e.g., Hattie, 2008; Muijs et al., 2014; van de Grift, 2014). Studies on teaching and teacher effectiveness have repeatedly found some behaviors to be effective, meaning that they contribute to student learning and school success. Examples of such effective teaching behaviors include providing students with clear examples, having students think aloud, and requesting students to reflect on their learning approaches. In this study, manifestations of effective teaching behavior are conceptualized as representing indications of teaching quality.
The variety in effective teaching behaviors is typically categorized to and/or summarized by five to seven broader factors or domains (Bell et al., 2019; Muijs et al., 2014). Prior research in Indonesia, South Korea, the Netherlands, South Africa, Spain, and Turkey applied CFA and MGCFA and provides evidence that in all six countries the variety in effective teaching behaviors is well-represented by a six-factor structure (André et al., 2020; Inda-Caro et al., 2019; Maulana & Helms-Lorenz, 2016). These studies termed these factors as domains and the six domains are labeled as safe and stimulating learning climate, efficient classroom management, clear and structured instruction, activating teaching, teaching learning strategies, and differentiation. The domains and an example item related to each domain are presented in Table 1.
The Six Domains, Their Conceptualization, and One Example Item of the “My Teacher” Questionnaire.
The present study extends on the work by André et al. (2020). More specifically, it introduces and examines the invariance of a complementary conceptualization. In this complementary conceptualization all effective teaching behaviors are hierarchically ordered along one latent continuum of teaching quality. This conceptualization is grounded in theories on teacher development proposed by Berliner (2004) and Fuller (1969). Theories on the development of teaching quality generally describe its acquisition as unfolding across one single continuum. Furthermore, the theories describe the continuum as a sequence of five phases (Berliner, 2004) or three stages (Fuller, 1969).
Van de Grift et al. (2011) used these theories on teacher development to logically derive a single continuum of effective teaching behaviors. Their proposed model matched the identified phases and stages described by studies on teacher development with the six domains of effective teaching. Based on this match, they hypothesized a hierarchical ordering of the six domains starting from those including the least complex teaching behaviors—the acquisition of which marks the novice teacher that starts learning to teach—and ending with most complex effective teaching behaviors—the acquisition of which marks the expert teacher. Being well aware of the natural deviations from such stage-like hierarchical orderings, Van de Grift et al. (2011) suggested that the ordering should be assessed by probability. Figure 1 sketches their proposed representation. In Figure 1, the

A non-empirical example of the theorized continuum of teaching quality.
Evidence related to conceptualization has been gathered in multiple studies and using a mixture of classroom observation and student questionnaire methods. Evidence obtained with both methods confirmed and further specified this hierarchical ordering in effective teaching behaviors (Maulana et al., 2015a; van de Grift et al., 2014; van der Lans et al., 2015, 2017, 2018, 2019). The ordering in domains approximately follows that presented in Table 1, with the exception of the final two domains. The questionnaire method estimates the domain teaching learning strategies as most complex, whereas the observation method follows the ordering as presented in Table 1. The current evidence-base is, however, mostly restricted to the Dutch context only. Notable exceptions are Indonesia (Maulana et al., 2015b), Cyprus (e.g., Kyriakides et al., 2018), and Turkey (Telli et al., 2020). To date, no studies have addressed the international invariance of the ordering in effective teaching behaviors.
Student Perceptions of Teaching Quality
This study examines teaching quality as perceived by students. The term “perception” highlights that students’ item scores reflect their subjective experiences in the corresponding teachers’ classes. This means that any two students in the same class can have different experiences and, thus, different perceptions. When this study mentions about the probability that teachers display effective teaching behaviors, this probability is estimated based on students’ perceptions. When the study refers to estimations of teaching quality, it, in all instances, refers to the student perceived teaching quality.
Empirical evidence indicates that student perceptions vary primarily as a function of teachers’ teaching quality (e.g., van der Lans & Maulana, 2018; van der Scheer et al., 2019; Wagner et al., 2013). Concerns with student perceptions mostly involve the potential for bias (e.g., Marsh & Roche, 2000; Spooren et al., 2013). Unlike classroom observers, students are not trained to score teaching quality using the predetermined standards. It is unclear which norms or standards students apply when scoring behaviors of their teachers. As will be discussed somewhat later in the article, the present examination of MI may provide some insights about whether the strength of student perception biases varies between countries.
Partial Credit Model: A Polytomous Rasch Model Approach to Study Cross-Country Comparisons
Our prior research applied the Rasch model to gather evidence supporting an ordering in effective teaching behaviors. Mathematically, the Rasch model can be expressed as (Rasch, 1960):
Where βp estimates the student perceived position of the teacher on the continuum of teaching quality and δi estimates the location of effective teaching behaviors on the same continuum. Furthermore, the δi expresses what increase in teaching quality is predicted if teachers successfully display the effective teaching behavior
The Rasch model is applicable to dichotomous item responses. The PCM extends on the Rasch model by introducing an item step parameter
By using the PCM, the study keeps connection with multiple prior within-country studies indicating that students’ perceptions of effective teaching behavior fit the Rasch model (Bacci & Caviezel, 2011; Bradley et al., 2006; Kyriakides et al., 2009; Maulana et al., 2015a; van der Lans et al., 2015). Also, it can provide a complementary perspective with prior studies that used (MG)CGA (e.g., André et al., 2020; Scherer et al., 2016).
Rasch-type models and factor analytic models
Several popular software packages like Mplus (Muthén & Muthén, 2019) and mirt (Chalmers, 2012) enable researchers to rescale parameters estimated using confirmatory factor analysis (CFA) into PCM parameters. These possibilities may give the impression that the two models themselves are identical. However, despite being mathematically identical, factor analytic models and Rasch-type models are conceptually different. The difference becomes most tangible in how the two models estimate model-data fit. Because factor analytic techniques enjoy considerable popularity, the above-mentioned conceptual difference is briefly explained.
Factor analytic fit tests are conceptually associated with classical test theory (CTT). Central to CTT is the argument that single observations are unreliable and that reliable estimates can be derived by averaging over multiple parallel observations (Graham, 2006). Factor analysis treats items as potentially parallel observations associated to one (or more) common factor(s). Expressed in a variance–covariance matrix, factor analytic fit tests assess the prediction of uniform item-covariance. Misfit to a one-factor model indicates that some item(s) are not essential tau-equivalent parallel. For more details, see Graham (2006) or Jöreskog (1971). Because factor analysis considers items to be parallel “replications” of the same latent factor/continuum, fit is typically estimated for the latent factor/continuum and variation in item parameters is typically interpreted as nuisance.
Contrary to factor analysis, Rasch-type models suggest that items vary in complexity (δi) (more commonly referred to as difficulty) due to which items are no parallel observations (Brennan, 2010; Guttman, 1954). Expressed in a variance–covariance matrix, Rasch-type models fit tests assess the prediction that item-covariance decreases as a function of the distance between item (step) locations on the continuum (Browne, 1992; Guttman, 1954). This decreasing pattern is known as simplex structure and violates typical criteria set by factor analysis to assess parallelism (for details see: Jöreskog, 1978). In Rasch-type models, item parameters are no nuisance parameters which explains why fit is estimated per item and model fit is typically expressed by the joint item fit.
Partial Credit Model and Measurement Invariance: Interpretation and Meaning of Non-Invariance
Rasch models typically examines MI in terms of Differential Item Functioning (DIF; French et al., 2019; Mazor et al., 1994). In this study, two types of DIF are distinguished: uniform-DIF (U-DIF) and non-uniform-DIF (NU-DIF; Walker, 2011). U-DIF estimates between-country differences in the location of the same effective teaching behavior on the continuum. It signals that the teaching behavior (item) is associated with higher teaching quality in one country compared to another and that this difference is uniform across the continuum of teaching quality. NU-DIF, instead, estimates between-country differences in the slope or steepness with which the probability on an item response increases. It signals that the strenght of association of the teaching behavior (item) with the continuum of teaching quality varies between countries (Smith Walker, 2011).
Figure 2 visualizes two possible scenarios of NU-DIF which have different implications. When NU-DIF is constant (Scenario 3 in Figure 2), item slopes are parallel within countries but the strength of association between student perceived teaching behavior and the continuum of teaching quality varies between countries. When NU-DIF is inconstant (bottom scenario in Figure 2), then the item slopes within one or more of the countries are not parallel. This implies that in one or more countries no hierarchical ordering in teaching behaviors, as described above, can be derived.

Four possible DIF-scenarios: (a) U-DIF with constant difference in location, (b) U-DIF with inconstant difference in location, (c) NU-DIF with constant difference in slope, and (d) NU-DIF with inconstant difference in slope.
Likewise, two scenarios can be derived for U-DIF. When U-DIF is constant (top scenario Figure 2), the students perceive all (or most) effective teaching behaviors as more complex. Because the direction and size of the shift in complexity is constant, we deem it more likely that this constant shift is due to differences between students’ perceptions (e.g., between-country differences the subjective standards and norms applied by the students [i.e., strictness]), than that it represents differences in actual manifestations of effective teaching behaviors. Finally, when U-DIF is inconstant the evidence indicates between-country differences in how students hierarchically order the effective teaching behaviors. This scenario is likely when the actual manifestation of effective teaching behaviors in classrooms varies between countries.
We deem U-DIF as most plausible, but also argue that it has less severe consequences for measurement of teaching quality. Evidence suggesting between-country variation in the actual manifestation of teaching behaviors in classroom, for example, does not suggest real departures from the hypothesized continuum. It seems valid to apply linking in an attempt to improve between-country comparisons. The presence of NU-DIF, however, may result in more severe consequences. The slope-parameter provides information about an item’s association with the latent continuum (Embretson & Reise, 2000; Fox, 2010) where lower slope parameters indicate lower association of an item with the continuum. Extending this interpretation, NU-DIF indicates that the student perceptions of effective teaching behaviors (items) are not related to the continuum (trait) in the same way across countries (Smith, 2002; Walker, 2011). Such differences in association seem unrelated to differences in actual teaching behaviors manifested in classrooms and introduces room to speculate about between-country differences in the impact of perception biases. Finally, NU-DIF may also indicate that the continuum derived by van de Grift et al. (2011), and which echoes prior theory on teacher development, does not generalize to other countries. The study will not apply linking to adjust (or correct) for NU-DIF.
Linking: Utility of Partial Credit Model Approach for International Empirical Research
The PCM offers approaches to adjust for non-invariance in the form of linking (Ndosi et al., 2011; Oliveri & Von Davier, 2011, 2014; Tennant et al., 2004). Application of linking have been referred to as “quasi-international calibration” (Oliveri & Von Davier, 2011, 2014), “top-down purification” (Tennant et al., 2004), and “splitting of non-invariant items” (Ndosi et al., 2011). These differences in terminology express that the techniques are used for different reasons as well as that they differ in some technical details, nonetheless they follow the same underlying logic. In this study, quasi-international calibration is applied. Quasi-international calibration fixes invariant items and splits the non-invariant items by country (Oliveri & Von Davier, 2011, 2014). The resulting continuum combines emic effective teaching behaviors, which have culturally-general location in the hierarchy, and etic effective teaching behaviors, which have culturally-specific locations (Ndosi et al., 2011).
Context of the Current Study
The Netherlands
International comparisons in secondary and primary education show that students attending Dutch schools perform above average, comparable to other high performing European and Asian educational systems (Mullis et al., 2016; Organisation for Economic Co-operation and Development [OECD], 2018). Teacher education for secondary education is divided into two different tracks. Teaching the lower levels of secondary education requires a second-degree teacher qualification, which takes four years of training (bachelor degree). Teaching the higher levels of secondary education requires a first degree teacher qualification; a subject-relevant master degree and an additional master at one of the university-based teacher education institutes. The first degree certification also allows teachers to teach the upper grades in higher levels of secondary education, i.e., higher vocational (“havo”) and pre-university (“vwo”). The teaching profession does not have an above average status, and the quality of teachers is generally high with the large majority mastering the basic teaching skills well (OECD, 2016c).
South Korea
The South Korean educational system is among the top performing systems compared to most other countries in PISA and TIMSS (Mullis et al., 2016; OECD, 2018). Secondary school teacher training is offered as a four-year bachelor program which confers the second class certificate, later promoted to the first class by on-the-job experience, qualified to teach both at middle (7–9 grades) and high (10–12 grades) schools. For teaching at schools, the certificate holders should pass the highly competitive recruitment examination, a recent average of 10:1 pass rate, but securing a tenure job until 62 years (Korean education statistic center [KEDI], 2020). South Korea’s student performance reveals a low percentage of underachieving students, and high percentages of excellent students. The South Korean system emphasizes on teaching quality and ongoing development in the teaching profession. Teaching profession is regarded as a highly-respected and high-status profession. Teachers are recruited from the top graduates, with strong financial and social incentives including social recognition as well as opportunities for career advancement and beneficial occupational conditions (Kang & Hong, 2008; OECD, 2016b).
Indonesia
The Indonesian educational system is among the lower performing countries in PISA (OECD, 2016a). Among many other components in the education system, Indonesian teachers play an important role in ensuring the success of the education system (Jalal et al., 2009). Teacher education for secondary education is offered as a four-year program at universities (Bachelor degree). Teacher certification is tied directly to their ability to demonstrate useful competencies, including meeting minimum levels of subject matter proficiency (de Ree, 2016b). Fasih et al. (2018), however, found that teacher certification is uncorrelated with student’s learning outcomes. They suggest that this is due to the teacher training program which doesn’t require implementation or demonstration of knowledge and skills
South Africa
The South African educational system is developing, but currently its performance is from an international perspective ineffective. Based on TIMSS 2015, the country was ranked second last in mathematics and last science (Mullis et al., 2016). Moreover, of 139 participating countries, South Africa scored number 137 for overall quality of education (Baller et al., 2016). Teacher training programs consist of a four-year Bachelor degree course offered at higher education institutions. In addition, students qualified with specific content Bachelor degrees, for example, Engineers and Scientists, can complete a Post Graduate Diploma to become a qualified secondary school teacher. This Post Graduate Diploma equips potential teachers with competencies and pedagogical knowledge to teach diverse groups of students (Machingambi, 2020). Although significant improvements in basic and tertiary education is detected, the quality of education and teacher education is still not on par with other developing countries (van der Berg, 2015). For example, Taylor et al. (2013) showed that in six South African universities, only 6% of the curriculum for teacher training and development include how a teacher should teach a student to read. The education system still encounters various challenges which have been argued as related to the English second language instruction barrier, insufficient subject knowledge of some teachers, lack of accountability of teachers, frequent absenteeism of teachers from classes, and socioeconomic status of most students (Howie et al., 2012; Mbiti, 2016).
Spain
Spain performs around the average on PISA and TIMMS, but regional differences are relatively large (Hippe et al., 2018). The Southern region scores just above 470 points on PISA, whereas the capital of Madrid and the North-West score above 500 and closer to the Dutch average performance. Teacher training for primary education takes four years and is completed with a university degree (
Method
Sample and Data Management
In total, five participating counties including Indonesia, South Korea, The Netherlands, South Africa and Spain collected survey data using the My Teacher Questionnaire (MTQ) from students of 4,918 teachers. Most survey data came from the Netherlands (
Inclusion and exclusion criteria
In all countries, data of one school year were selected. Furthermore, a number of Dutch teachers (
Sample Descriptives for Each of the Five Countries.
Analyses were performed on two types of samples: (a) the complete sample and (b) the five randomly selected subsets. The complete sample has a nested design in which students grouped in the same class all score the same teacher. Analyses need to correct for the nested data structure (Hox, 2002). When corrections are not applied, the size of standard errors is likely underestimated which in turn increases the probability of type 1 errors. In the context of item fit tests, type 1 errors imply that we remove (or flag) items that actually fit. Hence, using the complete sample to assess item fit would imply an unnecessary strict assessment of item fit. Multilevel statistics can effectively remove bias due to the nested design, but these are not standard available in PCM estimation software. Therefore, the subsets were constructed by randomly selecting one student per class. These five subsets have equal sample size with
Missing values
The overall number of missing values was low (0.8% of all item responses), but some of the returned questionnaires show multiple missing values. We excluded questionnaires showing more than five missing values (1.5% of all questionnaires), of which 11 questionnaires were from Indonesia, two from South Korea, 55 from the Netherlands, 341 from South Africa, and 29 from Spain. Reasons for why South Africa has the largest number of missing values in the questionnaires are unclear. Presumably, the reasons are likely related to the conditions of the students during the survey in the country which may include low literacy (difficulty in understanding certain questions), disruptions (surveys were done in the class of between 36 and 47 students), low familiarity with responding to surveys, limited resources (e.g., no pens or pencils), and the insufficient support from the teachers or administrators of surveys.
Measurement Procedures and Model
My Teacher Questionnaire (MTQ)
The MTQ was constructed to measure student perceptions of teaching quality. This questionnaire is based on previously validated versions (eg., Maulana et al., 2015a; van der Lans et al., 2015). This version of the MTQ comprises 41 items that operationalize six domains: safe learning climate, efficient classroom management, clear and structured instruction, activating teaching, teaching learning strategies, and differentiation (see also Table 1 in the background section). Response categories were provided on a 4-point Likert-type scale, ranging from 1 (never) to 4 (often), which were recoded into: 1 = 0, 2 = 1, 3 = 2, 4 = 3. Recoding was required for the intended PCM analysis.
Translation procedure
In the five countries, the questionnaire was translated from English to the target language and back-translated in accordance with the guidelines of the International Test Commission (Hambleton, 2001; van de Vijver & Tanzer, 2004). This procedure was recommended because it takes into account both the linguistic as well as the cultural and psychological aspects involved. The target language is as follows: Dutch for the Netherlands, Korean for South Korea, Bahasa Indonesia for Indonesia, English for South Africa, and Spanish for Spain. In each country, the translation and back-translation process involved two researchers highly knowledgeable about the technical and conceptual details of the MTQ and two university experts who are proficient in both English and the target languages. During the process, issues and discrepancies were discussed thoroughly and resolved subsequently by the core research team. Although the process was quite long and laborious, the issues discussed were relatively minor and revolved around choosing the most representative word equivalence and the accuracy of word choice. The research team confirmed the relevance of the MTQ items in their own national contexts, providing evidence for face validity.
Measurement model
This study applies the Partial Credit Model (PCM; Masters, 1982). The PCM is chosen because it (a) keeps connection with multiple prior within-country studies indicating that students’ perceptions of effective teaching behavior fit the Rasch-type models (Bacci & Caviezel, 2011; Bradley et al., 2006; Kyriakides et al., 2009; Maulana et al., 2015a; van der Lans et al., 2015), (b) can be generalized to include a discrimination parameter (Muraki, 1992), which is important to assess NU-DIF, and (c) can handle items with different numbers of response categories. The latter two advantages anticipate on flexibility possibly required in future research.
Analysis Plan
Step 1: Model and item fit
As a first step, the presence of the hierarchical ordering was evaluated in the separate countries. Tests assessing dimensionality involved (a) principal component analysis (PCA), (b) simplex analysis (Browne, 1992; Guttman, 1954), and (c) Mokken’s H-coefficient (results only in Supplementary File Chapter 1; van der Ark, 2007).
PCA is not specifically developed to assess hierarchical ordering, but instead is a general factor analytic approach. It was estimated using the R package
As a second step, item fit was estimated using the Mean square (MS) item-infit and outfit coefficients. The traditionally advised cutoff criterion for MS infit and oufit is 1.20 (Bond & Fox, 2007), but more recent simulation studies show the necessity to accommodate criteria to the number of items and sample size (Seol, 2016). The number of items included is 41, the
Step 2. Evidence of MI
U-DIF and NU-DIF were assessed using the R package
Validating DIF results
False-positive DIF results can occur in samples that have different distributions of background variables. Imagine that an item has DIF for gender and that gender is unequally distributed among the countries. To validate the results in step 2, DIF analyses were conducted using a selection of the complete sample that matched the five country samples on student gender and student age. The selection of these two variables was based on preliminary DIF-analyses using the R package
Step 3: Linking though quasi-international calibration
To answer the second research question differences in country-average student perceived teaching quality were explored between the standard international calibration approach, which assumes that all items are invariant, compared to a quasi-international calibration approach. Differences between calibration methods were expected because of prior results that indicate partial measurement invariance (e.g., André et al., 2020; Scherer et al., 2016). In case that calibration results differed, model fit estimates were compared to indicate what, from a purely data-driven approach, calibration method to prefer.
Quasi-international calibration methods
Two approaches of quasi-international calibration were applied, namely concurrent and separate. In the concurrent approach, all item parameters were estimated in one step by fixing the invariant items to be equal and estimating country-unique item parameters for non-invariant items (Oliveri & von Davier, 2011, 2014). The analysis was performed using the R packages
Applications of concurrent and/or separate quasi-international calibration are relatively novel. Also, various psychometric models can be used, though the results might have different interpretations. Available evidence concerning the concurrent calibration method indicate that it is quite robust. Arai and Mayekawa’s (2011) simulation study, for example, examined the number of invariant items required to validly perform concurrent calibration. Their results indicated that concurrent calibration may be valid with few, perhaps even less than five, invariant items. In an empirical study by Chen et al. (2009), this finding is corroborated. Another simulation study by Liu et al. (2011) examined whether the invariant items need to cover the complete continuum. Their results signal that this might not be a requirement.
Fit of calibrations
No uniform standard currently exists to estimate the fit of quasi-international concurrent or separate calibration. Prior work applied other psychometric models than the here applied PCM (Ndosi et al., 2011; Oliveri & Von Davier, 2011, 2014; Tennant et al., 2004) and each report another estimate of model and/or item fit. This study reports country-mean item and person MS-outfit statistics. The outfit-statistic equals the Chi-square value divided by its degrees of freedom (df). Outfit values of 1.00 indicate complete model fit and the further values depart from 1.00 the lower the model fit is. The country-mean outfit statistics are supplemented with the Minimum and Maximum to give an impression of the distribution. Unfortunately, the R package
Results
Step 1: Screening of Model and Item Fit in the Separate Country Data
Results of the PA method and the simplex analysis are presented in Table 3. Guttman’s simplex analysis indicates adequate fit of the data to the predicted simplex correlation structure in each country (RMSEA < 0.08). When applying Horn’s PA method, the number of extracted factors varies but in all countries is greater than one. This was expected because the conceptualization predicts (six) local clustering’s on the continuum. Furthermore, the PA method is sensitive to large sample size. In this study sample sizes ranged from
Summary of Results for the One-Dimensionality Analysis of the MTQ Student Perception Survey for Indonesia, South Korea, Netherlands, South Africa, and Spain.
Four items were found to misfit the continuum in multiple countries using the MS-infit and MS-outfit. These items were not considered in the analysis of MI (for details, see Supplementary File, Chapter 1).
Step 2: Evidence of MI
Table 4 summarizes the results of the NU-DIF and U-DIF. The columns indicate the two criteria, namely McFadden’s pseudo R-square and the Chi-square test, and indicate whether the item was flagged for U-DIF and/or NU-DIF. TRUE means that an item was flagged more than once in the five samples and according to both criteria. Results show that none of the items are (repeatedly) flagged for NU-DIF, but also that all but four items are repeatedly flagged for U-DIF. The four invariant items are: “
Overview of Items Flagged for Uniform-DIF (U-DIF) and Non-Uniform DIF (NU-DIF). TRUE Means That Items Are Flagged in More Than One of the Five Subsets.
Table 5 summarizes the pooled item location parameters of the six domains. The domains “efficient classroom management,” “clear and structured instruction,” “activating teaching,” “teaching learning strategies,” and “differentiation” are similarly ordered along the continuum in all five countries. The domain “safe learning climate,” however, clearly has different locations between countries. In the Netherlands and Spain (Europe), the domain “safe learning climate” is located at the start of the ordering and near “efficient classroom management.” In South Africa, Indonesia, and South Korea, the domain is positioned third or fourth and located closer the domain “activating teaching”. Furthermore, in South Korea and Indonesia (Asia), the specific items referring to “respect” are perceived by the students as located at the far end of the continuum of teaching quality. In terms of the conceptualization introduced above, this would imply that Indonesian and South Korean students associate these behaviors with “expert” teaching. This contrasts with the European students which position items referring to “respect” at the start of the continuum.
Overview of DIF Between the Six Domains.
Step 3: Linking Through Quasi-International Calibration
Table 6 summarizes differences in the country median (
Country Average Teaching Quality Scores and Fit of Teaching Quality Scores When Using the: (1) Raw Total Scores, (2) Standard International Calibration (Assuming Item Invariance), (3) the Concurrent Quasi-International Calibration, and (4) the Separate Quasi-International Calibration.
Sample size for South Africa dropped substantially because of list-wise deletion. Please see the Supplementary File Chapter 2 for comments and thoughts on this.
Pearson correlations indicate that the two quasi-international calibrations are similar, to the standard international calibration,
The concurrent quasi-international calibration has superior person fit estimates compared to the standard international calibration. The mean person MS-outfit ranges from 0.75 in South Korea to 1.47 in South Africa in the standard international calibration and from 0.93 in South Korea to 1.10 in South Africa in the concurrent calibration. Fit of the separate calibration method is unknown. Results of the separate quasi-international calibration were found to be sensitive to the ordering of the chain. If the chain is ordered differently, the results changes. Thus, the separate calibration may yield highest discrimination, but its results are unreliable. The method needs further development. Wright maps are presented in the Supplementary File at the end of chapters 3, 4, and 5. The Wright maps present a quick overview of the match between item locations and person locations on the continuum of teaching quality.
Conclusions and Discussion
The current study aims to investigate measurement invariance (MI) of student perceptions of teaching quality across countries including Indonesia, South Korea, the Netherlands, South Africa and Spain. Furthermore, the study explores potential indication of differences in student perceived teaching quality across the five countries, based on results generated from the first aim.
Research Question 1
The first research question is, “
Although most items are flagged for U-DIF, we found four invariant items showing no NU-DIF and no U-DIF. This means that these items are statistically and content-wise interpreted similarly in the five countries. The four items are “
Research Question 2
The second research question is, “
Overlooking the results of the quasi-international concurrent- and separate calibrations, then the ordering is relatively stable for the perceived teaching quality of South Korean, Dutch, and South African teachers. In all three calibration methods, South Korean teachers teaching quality is perceived highest by their students, and Dutch teachers teaching quality is perceived fairly high. South African students perceive the teaching quality in their lessons as relatively low. Although reasons for why students perceived their teachers more beneficially in South Korea and the Netherlands compared to South Africa are not identified in this study, discussing a conjecture about this may guide future research further.
The superior student perceived teaching quality of South Korean teachers seems to be consistent with the academic performance of their students as documented in ILSA’s (OECD, 2018). The South Korean educational system is regarded among the top performing systems compared to most other countries in PISA and TIMSS (Mullis et al., 2016; OECD, 2018). South Korea’s student performance reveals a low percentage of underachieving students, and high percentages of excellent students. The South Korean system emphasizes teaching quality and ongoing development in the teaching profession. Teachers are recruited from the top graduates, with strong financial and social incentives including social recognition as well as opportunities for career advancement and beneficial occupational conditions (Kang & Hong, 2008; OECD, 2016b). These personal and contextual factors pertaining to South Korean schools may likely contribute to their academic excellence, which could partly be reflected in this study by students’ perception of the their teachers’ teaching quality.
Similarly, the position of the Dutch teachers is consistent with the academic performance of their students as documented in ILSA’s (OECD, 2018). In general, the quality of teachers is generally high with the large majority mastering the basic teaching skills well (OECD, 2016c). Teacher qualification in The Netherlands follows a relatively high level of academic loading. Teaching the higher levels of secondary education, i.e., higher vocational (“havo”) and pre-university (“vwo”), requires a first degree teacher qualification (also known as academic teacher qualification). This qualification is obtained with a subject-relevant master degree in addition with a master at one of the university-based teacher education institutes. The second degree teacher qualification takes four years, but does not require a subject relevant master degree. For the Dutch sample, the years of teaching experience are known (this is unknown for all other countries). The number of beginning teachers included in the Dutch sample is relatively large and, thus, likely deviating from the other country samples. The Dutch teachers’ age (likely somewhat younger) might be argued to have contributed to explaining a relatively high student perceived teaching quality, though most studies indicate that beginning teachers have lower teaching quality (Kini & Podolsky, 2016).
The comparatively poor performance of South African teachers is also consistent with results of student academic performance documented in ILSA’s (Mullis et al., 2016). The country has been continuing to work toward educational excellence, although basic infrastructure and cultural factors like multiple official languages remain a big challenge. Students are generally instructed in English as a second language (Howie et al., 2012). Teacher training institutes and professional development are still relatively weak. A recent review of teacher training programs of six South African universities suggested that only 6% of the curriculum for teacher training and development include how a teacher should teach a student to read Taylor et al. (2013).
Finally, results for Spanish and Indonesian teachers varied between the concurrent and separate calibration methods, with the Spanish teachers being close to the Dutch teachers according to the concurrent calibration, but scoring much lower in the separate calibration and Indonesia ranked lowest in the concurrent calibration and third (and average) in the separate calibration. In ILSA’s Spain performs around the average on PISA and TIMMS and Indonesia performs poorly compared to other countries in PISA (OECD, 2016a). Hence, the results of the concurrent calibration demonstrate more overlap with the outcomes of ILSA’s compared to the results of the separate calibration. Yet, this overlap might also be explained by similarity in applied calibration methods. Calibration and linking methods applied by ILSA’s are conceptually more comparable to concurrent calibration than separate calibration.
In sum, there is a tendency that results based on the concurrent calibration in terms of perceived teaching quality seem to be more consistent with results of ILSA’s in terms of student academic performance. This tendency provides an important insight because teaching quality has been shown to be the most significant factor for student learning and outcomes (Hattie, 2008). Although it is tempting to view this tendency as evidence of the validity of concurrent calibration, we suggest that it is currently too early to make such conclusions and that further research on the stability and consistency of separate and concurrent calibration methods is required.
Limitations and Directions for Future Research
Although the present study has multiple strengths, it is also subject to some limitations. This study relies on convenience sampling. The Dutch sample disproportionally includes perceptions of the younger students (mean age 13 years) and likely included a sample of younger teachers. The data from South Korea, South Africa, and Spain cover only several regions of the country. Hence, we caution against generalizations of findings until replications with more representative samples are available.
It was not possible to apply the linking while taking into account the nested data structure due to the limited availability of technical software for estimating such models currently. Although the random selection of the five sample subsets takes into account the hierarchical structure of the data in its unique way, it remains unknown to what extent the results will differ when between-teacher variance is modeled statistically. Future research is advised to further explore possibilities to apply linking of international data on perceived teaching quality and taking into account the nested data structure, when the technical support will be available.
The quasi concurrent- and separate calibration methods provided distinct results. This inconsistency complicates practical applications of the quasi-international calibration method. It remains inconclusive which of the two calibration methods, either concurrent or separate calibration, should be preferred to increase fairness of cross-country comparisons. The concurrent calibration is conceptually less complex, but it applies strict assumptions about the invariant items, which are assumed to have identical item location parameters between countries (Oliveri & Von Davier, 2011, 2014). This strict assumption does not apply to the separate calibration method (Stocking & Lord, 1983). From an applied perspective, our findings indicate that the separate quasi-international calibration has largest impact on the country comparison, but its fit is unknown and the outcomes are dependent on the applied chain sequence. Hence, the present study suggests that both methods require further development before this approach can be applied to data about perceived teaching quality.
A Final Note
The present study is part of a larger project that attempts to construct an infrastructure that can be used to measure effective teaching globally and to use this infrastructure to report results concerning country-average differences in teaching quality. The infrastructure includes countries of different cultural values, which obviously creates a need to maximize flexibility while keeping with important principles of measurement. Results show the complexity of building this type of infrastructure and at the same time underline its importance for the field of teaching and educational effectiveness in general. Currently, most empirical evidence is accumulated based on research using raw mean and sum scores of teaching quality. Our results suggest that these raw scores might be biased estimators of teaching quality. Furthermore, the study suggests that bias might, at least partially, be corrected by using a quasi-international calibration method. Whether the application of these methods will lead to novel or alternative insights about teaching and its effectiveness remains inconclusive. We will continue to build on this infrastructure to better understand teaching effectiveness and how to measure it globally.
Supplemental Material
sj-pdf-1-sgo-10.1177_21582440211040121 – Supplemental material for Student Perceptions of Teaching Quality in Five Countries: A Partial Credit Model Approach to Assess Measurement Invariance
Supplemental material, sj-pdf-1-sgo-10.1177_21582440211040121 for Student Perceptions of Teaching Quality in Five Countries: A Partial Credit Model Approach to Assess Measurement Invariance by Rikkert M. van der Lans, Ridwan Maulana, Michelle Helms-Lorenz, Carmen-María Fernández-García, Seyeoung Chun, Thelma de Jager, Yulia Irnidayanti, Mercedes Inda-Caro, Okhwa Lee, Thys Coetzee, Nurul Fadhilah, Meae Jeon and Peter Moorer in SAGE Open
Supplemental Material
sj-pdf-2-sgo-10.1177_21582440211040121 – Supplemental material for Student Perceptions of Teaching Quality in Five Countries: A Partial Credit Model Approach to Assess Measurement Invariance
Supplemental material, sj-pdf-2-sgo-10.1177_21582440211040121 for Student Perceptions of Teaching Quality in Five Countries: A Partial Credit Model Approach to Assess Measurement Invariance by Rikkert M. van der Lans, Ridwan Maulana, Michelle Helms-Lorenz, Carmen-María Fernández-García, Seyeoung Chun, Thelma de Jager, Yulia Irnidayanti, Mercedes Inda-Caro, Okhwa Lee, Thys Coetzee, Nurul Fadhilah, Meae Jeon and Peter Moorer in SAGE Open
Supplemental Material
sj-pdf-3-sgo-10.1177_21582440211040121 – Supplemental material for Student Perceptions of Teaching Quality in Five Countries: A Partial Credit Model Approach to Assess Measurement Invariance
Supplemental material, sj-pdf-3-sgo-10.1177_21582440211040121 for Student Perceptions of Teaching Quality in Five Countries: A Partial Credit Model Approach to Assess Measurement Invariance by Rikkert M. van der Lans, Ridwan Maulana, Michelle Helms-Lorenz, Carmen-María Fernández-García, Seyeoung Chun, Thelma de Jager, Yulia Irnidayanti, Mercedes Inda-Caro, Okhwa Lee, Thys Coetzee, Nurul Fadhilah, Meae Jeon and Peter Moorer in SAGE Open
Supplemental Material
sj-pdf-4-sgo-10.1177_21582440211040121 – Supplemental material for Student Perceptions of Teaching Quality in Five Countries: A Partial Credit Model Approach to Assess Measurement Invariance
Supplemental material, sj-pdf-4-sgo-10.1177_21582440211040121 for Student Perceptions of Teaching Quality in Five Countries: A Partial Credit Model Approach to Assess Measurement Invariance by Rikkert M. van der Lans, Ridwan Maulana, Michelle Helms-Lorenz, Carmen-María Fernández-García, Seyeoung Chun, Thelma de Jager, Yulia Irnidayanti, Mercedes Inda-Caro, Okhwa Lee, Thys Coetzee, Nurul Fadhilah, Meae Jeon and Peter Moorer in SAGE Open
Supplemental Material
sj-pdf-5-sgo-10.1177_21582440211040121 – Supplemental material for Student Perceptions of Teaching Quality in Five Countries: A Partial Credit Model Approach to Assess Measurement Invariance
Supplemental material, sj-pdf-5-sgo-10.1177_21582440211040121 for Student Perceptions of Teaching Quality in Five Countries: A Partial Credit Model Approach to Assess Measurement Invariance by Rikkert M. van der Lans, Ridwan Maulana, Michelle Helms-Lorenz, Carmen-María Fernández-García, Seyeoung Chun, Thelma de Jager, Yulia Irnidayanti, Mercedes Inda-Caro, Okhwa Lee, Thys Coetzee, Nurul Fadhilah, Meae Jeon and Peter Moorer in SAGE Open
Footnotes
Author’s Note
Declaration of Conflicting Interests
Funding
Ethical Statement
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
