Abstract
Keywords
Introduction
Large-scale educational tests, such as college admission tests, are an inherent part of education systems throughout the world. Test scores have been used tremendously by educational authorities internationally and nationally for making decisions related to gatekeeping, accountability, and policymaking. The increasing power of test scores has spawned an ever-growing demand for test preparation inside or outside schools aimed at helping students to prepare for tests (Bray, 1999; Buchmann et al., 2010; Ross, 2008). Some schools provide students with test preparation courses to achieve greater accountability. This approach ensures that the schools are directly responsible for equipping students with the knowledge and skills needed to excel in exams, reflecting the schools’ commitment to students’ academic excellence. Besides, a large number of commercial educational institutions have built well-established test preparation training systems that not only provide group courses and private tutorials but also publish practice books or even develop online training systems. The underlying rationale for these test preparation activities is that test preparation is conducive to maximizing test results.
Although the test preparation industry is flourishing, there has been insufficient research exploring how, and to what extent, test preparation is effective. Existing literature seems to converge on the notion that test preparation, in most cases, has a positive effect on test performance (e.g., Appelrouth et al., 2017; Briggs, 2001, 2009; Bunting & Mooney, 2001; Hausknecht et al., 2007; Powers & Rock, 1999). However, the effect shown in those studies is inconsistent and inconclusive. Some studies detected a moderate to strong effect size (e.g., Hausknecht et al., 2007,
Evaluating the Evidence of Test Preparation
Working Definition
Various terms have been used in the literature to describe teachers’ and students’ test preparation behaviors, such as coaching, teaching to the test, test-oriented instruction, test familiarization, private tutoring, test-wiseness, etc. In this study, we use “test preparation” as an inclusive term, which refers to “any intervention procedure specifically undertaken to improve test scores, whether by improving the skills measured by the test or by improving the skills for taking the test, or both” (Messick, 1982, p. 70). In this broad definition, all learning activities, no matter whether they were driven by teachers or students, as long as the main intention was to improve performance in a specific test, could be considered as test preparation activities. Based on this understanding, the indicator of the effects of test preparation activities could be either the observed gain in a test-taker’s test scores after some form of test preparation has taken place, or the gap of test performance between those who engaged in test preparation with their unprepared cohorts (Anastasi, 1981; Briggs, 2001; Messick, 1982).
A large number of empirical studies have reported teachers and students’ engagement in test preparation activities (e.g., Anastasi, 1981; Appelrouth et al., 2017; Briggs, 2009; Lai & Waltman, 2008). In this article, we categorize test preparation activities as either student-driven or instructor-driven based on Briggs’s (2009) classification. Student-driven test preparation has various forms, including but not limited to self-studying test preparation booklets, using online materials or computer programs, practicing sample test items or taking practice tests, and asking about the experience from previous test-takers. Student-driven test preparation has been consistently reported as the most used test preparation method (Briggs, 2009). In comparison, instructor-driven test preparation mainly comprises of coaching or training courses on admission tests or other high-stakes tests in both formal schools and commercial educational institutes (Bangert-Drowns, Kulik, & Kulik, 1983; Briggs, 2009).
Previous Empirical Research on the Test Preparation Effect
Although the effect of test preparation has been investigated by a considerable number of empirical studies, only a few used a (quasi) experimental design. Most studies used self-report surveys to investigate participants’ test preparation histories, then compared test performance of prepared students to that of unprepared students (e.g., Briggs, 2001; Domingue & Briggs, 2009; Hu & Trenkic, 2021; Powers & Rock, 1999). The effects reported were inconsistent. Briggs (2001) and Domingue and Briggs (2009) derived data from the National Educational Longitudinal Study in the United States to explore the relationship between students’ use of different test preparation methods and the score gains on the SAT or ACT. The results showed that the use of a private tutor and enrollment in commercial coaching courses are the only test preparation methods that have small but positive effects on SAT-mathematics scores (about 13–15 points); while no forms of test preparation had statistically significant effects on SAT-verbal scores and ACT scores.
Studies conducted in different test contexts reported small or even negative effects of instructor-driven test preparation (i.e., Griffin et al., 2013; Trenkic & Hu, 2021; Xie, 2013). For example, Xie (2013) examined the test preparation effect on scores of College English Test, and the multiple regression and structural equation modelling results revealed that preparation for the test by drilling did have a significant effect on improving test scores (score gains .27 SDs), while other test preparation approaches such as developing general language skills did not affect test scores. Griffin et al. (2013) investigated whether attending a commercial test preparation course affects performance in the Undergraduate Medical and Health Sciences Admission Test (UMAT). They found that enrollment in commercial coaching courses had inconsistent effects on students’ performance in different sections of the test and for different levels of student ability. Their results showed that test preparation had no effect on student performance in the problem-solving and “understanding people” sections of the test, while it had positive effects for high ability students on their performance in nonverbal or abstract reasoning and negative effects for lower ability students.
Given the inconsistent findings of the aforementioned studies, it would be problematic to draw a conclusion about the test preparation effect. Besides, some studies using self-report surveys lacked control of confounding variables and ignored the heterogeneity of test preparation (Briggs, 2009). Thus, the findings from these studies may have limitations in accurately capturing the effect of test preparation. Besides, specific characteristics of test preparation activities, such as material use, duration, and students’ strategy use, etc., were rarely collected and reported in most self-report surveys. As a result, empirical research has so far been unable to provide a comprehensive picture of test preparation activities and their effects; thus, a research synthesis is required.
Previous Reviews
Tracing back to the 1980s and 1990s, a considerable number of review articles synthesizing (quasi) experimental studies on test preparation effects were published primarily on coaching, but the effect reported in these articles varied. One of the earliest and influential meta-analytic reviews conducted by Bangert-Drowns, Kulik, and Kulik (1983) investigated the coaching effect on achievement tests using data from 30 controlled studies. This meta-analysis review reported a significant effect of coaching programs on ACT scores (average ES = .25), which was affected by the level of training intervention. Specifically, they found that test preparation programs designed to improve broad cognitive skills yielded the largest effects (ES = .66), and extensive programs, which included drill practice, had medium effects (ES = .43), while short test-oriented sessions had smaller effects (ES = .17). Bangert-Drowns et al. argued that their finding was in line with what exam boards had long claimed—that long-term preparation on broad knowledge and abilities could have a larger effect than simple drill or practice on items. However, there were some inherent limitations to this research due to the imbalance in the number of studies included, raising concerns about the generalizability of the findings. Most interventions (22 out of 30) included in this meta-analysis were short test-oriented programs, while only one included an intervention focused on broad skill training. The effect of coaching of broad skills was highly likely to be overestimated by pooling the effect size from only one study.
In the meta-analysis they published later on the effect of coaching on aptitude tests, Kuilk, Bangert-Drowns, and Kulik (1984) found that aptitude tests such as SAT and intelligence tests were less affected by coaching than other aptitude tests. Coaching on the SAT had a small effect (estimate ES = .15), whereas coaching on other aptitude and intelligence tests had a substantial effect (estimate ES = .43). This finding was echoed by Becker’s (1990) comprehensive synthesis of the effect of coaching programs on SAT scores, in which she concluded that although test preparation improved SAT scores in general, its effect was small and varied widely in terms of test domain. According to her study, test preparation interventions raised SAT-verbal scores by .09 SDs and SAT-math scores by .16 SDs, which implied that the math section in aptitude tests is more susceptible to test preparation than the verbal section.
Similar results were reported in a more recent systematic review of the effects of preparatory courses on SAT scores (Montgomery & Lilly, 2012). Ten experimental studies published between 1970 and 2000 related to the effect of SAT preparation courses yielded an average of 23.5 points of score gain on the verbal subtest (800 points in total) and 32.7 points gain on the math subtest (800 points in total) separately. Moreover, long coaching programs (more than 8 hours) posted a significantly larger effect on SAT-math scores than short programs (8 hours or less).
In addition to reviews on coaching effects, there were also some reviews focused on the effect of taking tests repeatedly on score increases, which is known as the practice test effect or the retest effect. The earliest meta-analysis of practice effect was conducted by Kulik, Kulik, and Bangert-Drowns (1984). Their study, which synthesized data from 40 studies related to the effect of taking practice forms of the test prior to the criterion aptitude or achievement test, revealed that taking practice tests that have identical form with the criterion test yielded a significantly larger effect on test scores (ES = .42), compared with practice tests that used alternative forms (ES = .23). Kulik, Kulik, and Bangert-Drowns (1984) finding is broadly in line with two later meta-analyses on the retest effect for cognitive ability tests, conducted by Hausknecht et al. (2007) and Scharfen et al. (2018). Hausknecht et al.’s (2007) study reported an adjusted overall effect size of .26, while Scharfen et al.’s (2018) study revealed that retaking cognitive ability tests multiple times can lead to score gains of .5 SDs, and they also found that the retest effect stagnates after taking the test more than three times. However, considering that it is a gradual process for students to learn domain-specific knowledge and develop relevant abilities, whether practice tests on their own can increase test scores on proficiency tests or knowledge tests is still questionable and needs further exploration.
To sum up, previous reviews identified a small coaching effect and practice effect on test performance. However, there were several limitations in these reviews, making it difficult to generalize the findings. To be specific, previous reviews on test preparation effects concentrated on the coaching effect for college admission tests in the US context or the practice effect on cognitive ability tests, whereas the effect of test preparation on other proficiency tests or knowledge tests of a specific content domain in a wide range of educational testing contexts was not addressed by existing reviews, thus impeding our understanding of the variance of test preparation effect.
Moreover, most reviews were published decades ago and featured test preparation effect from articles published before the year of 2000. Findings drawn from these articles may not be applicable in today’s educational environment. For these reasons, this meta-analysis is conducted to provide more up-to-date conclusions about test preparation effects and expand results to multiple educational contexts.
Theoretical Underpinning of Test Preparation Effect
To explain the effect of test preparation, various interpretations have been made in previous empirical studies and reviews, such as the improvement of memory and knowledge retention (e.g., Hausknecht et al., 2007), the development of proficiency or abilities (e.g., Messick, 1982; Xie, 2013), the increase of test-wiseness (e.g., Madaus, 1988), the increase in motivation and confidence (e.g., Hong & Peng, 2008; Xie & Andrew, 2013), or the reduction of test anxiety (e.g., Messick & Jungeblut, 1981). Arendasy et al. (2016) summarized four possible models to explain the causes of test preparation effects from existing literature. The first postulates that score gains should be attributed to the familiarization of test-taking and test structure, which reduces construct-irrelevant variance. For test-takers, the first few test questions are easier because they do not need to familiarize themselves with the structure of the test. However, such gains caused by familiarization cannot reveal students’ abilities because they are not relevant to the construct being assessed. The second model, according to Arendasy et al., states that engaging in test preparation improves test-wiseness and test-taking skills, such as guessing and eliminating wrong answers, which can improve the likelihood to correctly solve test items, without affecting the underlying construct. The third model ascribes the score gain to teaching construct-relevant materials, which further results in the development of test-specific cognitive abilities or an increase in domain-specific knowledge. The fourth model is based upon the premise that the increase in test scores is caused by the improvement of generalizable cognitive abilities.
These four models form a comprehensive theoretical framework of test preparation effect. There are numerous potential factors influencing the effect of test preparation that can be deduced from the four models, which are investigated in the current paper. Moreover, given that Arendasy et al.’s (2016) explanations were mainly drawn from retest effect literature, we will examine the applicability of this framework in the test preparation context and explore other possible underlying mechanisms of test preparation effect.
Potential Moderator Variables
In the aforementioned empirical studies and reviews, it has been shown that instructional characteristics in test preparation intervention (e.g., methods, the use of practice tests, instructional emphasis, duration, transferability) contribute to the effect on test score improvement. In addition, a body of literature has indicated that test-specific features (e.g., domain, type of tasks, cognitive ability) and individual differences (e.g., age, ability level, prior attainment, previous test experience) explain variations in test preparation effects (e.g., Arendasy et al., 2016; Briggs, 2009; Kulik et al., 1984; Powers, 1986; Powers & Swinton, 1984). In line with previous empirical evidence, we examined whether, and the extent to which, different implementations of test preparation in different testing contexts have varied effects on individuals’ test performance. Moreover, the effectiveness of test preparation is likely to be contingent on study design (Becker, 1990; Briggs, 2009; Hausknecht et al., 2007); thus, design characteristics were also examined in our moderator analysis.
Design characteristics of test preparation intervention
The following moderator variables have been selected as test preparation design characteristics: type of intervention, level of intervention, instruction on strategy use, material use, and duration of the intervention.
We also analyzed the
Moreover, we examined whether the
Furthermore, we investigated if different
Test-specific characteristics
The following two moderating variables associated with test-specific features were seen as relevant: academic domain and nature of the test.
We examined whether the
Moreover, we included
Student characteristics
There is plentiful evidence that individual differences are possible moderators of students’ test preparation effect (Appelrouth et al., 2017; Xie & Andrew, 2013). For example, Kulik et al. (1984) found that the average test preparation effect is highest for high-ability students (ES = .82). Similarly, Griffin et al. (2013) found that coaching had a significant effect on the reasoning section of UMAT for high ability students, while for those of lower ability, the effect was negative.
Besides, it has been indicated that students’ past test experience might moderate test preparation effect. Domingue and Briggs’s (2009) study revealed that only students with previous test-taking experience can significantly benefit from private tutoring on the ACT.
In addition, test preparation processes could also operate differently for students from different education stages and age groups due to differences in self-regulated learning abilities, which tend to increase at higher ages. Kitsantas (2002) argued that students who use more self-regulatory skills (such as goal setting, keeping records and monitoring, and self-evaluation) during preparation have higher test performance.
We therefore investigated whether student characteristics such as
Study design
It is also possible that studies with a more rigorous experimental study design yielded greater effects than studies with lower quality. Previous review articles indicated that randomized experiments reported significantly higher effects of test preparation on test success than quasi-experiments (Kulik et al., 1983, 1984); however, only a small proportion of studies included in previous reviews used a randomized design with a control group (Bangert-Drowns, Kulik, and Kulik, 1983; Briggs, 2009). Thus, in this study, we also examined whether
Method
Literature Search
The literature search was initially carried out in December 2019 using five electronic bibliographic databases:
Inclusion and Exclusion Criteria
The inclusion and exclusion criteria of this meta-analysis were:
1) The included articles examined the effect of test preparation activities as listed in the working definition. The treatment condition(s) had to include any form of test preparation intervention (i.e., test preparation courses or computer programs) or instructional intervention toward test-taking for one specific test. Articles focused on other educational activities, whose main purpose was not for test preparation, even though they used test scores to indicate students’ attainment, were excluded (e.g., Hensley et al., 2021).
2) The included articles investigated test performance of large-scale tests designed for educational purposes and conducted in real-life educational contexts. Articles investigated the preparation effect on course tests or tests developed by researchers (e.g., Bol & Hacker, 2001), such as memorization tests, cognitive tests in operational contexts, and neuropsychological tests administered for clinical diagnoses, were excluded.
3) The included articles examined test preparation effect on a non-self-reported measure of test performance. Articles used self-reported test scores as dependent variables were excluded.
4) The included articles used either an experimental or quasi-experimental design with at least one treatment group and one control group, which allowed drawing causal inferences regarding the effect of test preparation on test performance. Initial equivalence between the treatment and the control groups must be assured. Included studies must provide pretest measures or covariate-adjusted measures from which we could compute pretest- and/or covariate-adjusted effect sizes (Ye et al., 2023). Studies without a control condition and/or studies without initial equivalence between the treatment and the control were excluded (e.g., Gaye, 2001; Lane, 2009). In addition, articles that reported effects that cannot be attributed to test preparation intervention were excluded. For example, some articles asked students to report whether they participated in test preparation or not and made comparisons among their scores (e.g., Li & Xiong, 2018; Xie, 2013).
5) Sufficient data (i.e., means, standard deviations, sample sizes) should be reported in each included study to enable the computation of effect sizes.
6) Sufficient information about the test preparation programs should be reported, such as the materials utilized, the teaching methods employed, the learning activities students undertook, the duration of the program, etc. Articles without a detailed description of the test preparation procedures were excluded.
7) Included studies could be conducted at any education level and were not limited to any specific study site, but only studies published in English were included to ensure the reliability and coherence of coding the key elements of test preparation intervention.
8) Only SAT-related studies conducted after 1978 and ACT-related studies conducted after 1985 were included, as example items of the two tests were not provided by the testing bodies to the public prior to the two years (A [Mostly] Brief History Of The SAT And ACT Tests, n.d.), so teachers and students had very limited access to test-related materials. Any SAT-related studies conducted before 1978 and any ACT-related studies conducted before 1985 were excluded.
Literature Filtering
Of the 5629 retrieved records in total, 2228 duplicates were removed. We then screened titles, abstracts, keywords, and subheadings of each article for possible inclusion. The first author independently screened all articles, and a trained research assistant screened 50% of the articles. For these studies in which the abstract did not provide sufficient information, we read sections related to the research design to gather more information. Studies not examining issues related to test preparation effect or not using quantitative methods were excluded. The interrater reliability for the title-abstract screening was .92, reflecting a good level of agreement. The disagreements were solved by discussion. After the screening stage, 163 studies were retained. The first author and the trained research assistant engaged in full-text screening independently, reading the full texts of all 163 studies to determine whether they met all inclusion criteria. The interrater reliability for full-text screening was .87. At this stage, disagreements primarily centered around inclusion criteria 4, 5, and 6. For example, some studies reported themselves as experimental studies but lacked a complete control condition, or their effect could not be attributed to the test preparation intervention because covariates were not controlled during the data analysis. These issues required careful reading and discernment by the raters, which posed challenges in terms of judgment. The two raters discussed their disagreements until a consensus was reached. As a result, a total of 28 studies met all the inclusion criteria and were included in the subsequent analysis. The literature searching and processing stages are shown in Figure 1.

Flow diagram of literature search and processing of records.
Data Extraction
The included studies were systematically analyzed by making use of two coding schemes: one on the study level and another on the effect size level. In the first scheme, five groups of variables were coded for each study, including general study descriptors (e.g., study type and publication year), test preparation characteristics (e.g., duration of intervention, type of intervention, level of instruction), student characteristics (e.g., gender, ability level, education level), test-related variables (e.g., subject domain, nature of test), and study characteristics (e.g., type of comparison group, allocation). In the second coding scheme, raw data were collected to calculate the particular effect size, and all information about the dependent variable (the test scores, or assessment on long-term retention) was gathered.
We took an iterative approach to extract data whereby three coders refined the classification of each variable as they progressed through the included studies during coding trials to ensure that the classifications best characterized the literature. The utilized classification of each coding variable is listed here.
Nature of publication.
Publications were classified as either journal articles, dissertations, or reports (conference reports, industrial reports, or exam reports). The year of each publication was specified.
Country
The country in which the study was conducted was specified.
Nature of test
The nature of the test was classified as either curriculum-based or non-curriculum-based. Taking the two widely used college admission tests in the United States as examples, ACT is a curriculum-based test while SAT is a non-curriculum-based test.
Stakes of the test
The stakes of the test that students prepared for were classified as either high stakes or low stakes. High-stakes tests refer to tests with important consequences for students (Madaus, 1988), while low-stakes tests do not have important consequences associated with test performance. It is important to note that whether the test is high-stakes or low-stakes is relative to test-takers. For example, in one of the included studies, Hardison and Sackett (2008) examined the effect of short-term, rule-based coaching on standardized writing tests using writing prompts from the College Level Examination Program (CLEP) to examine students’ writing performance. The CLEP can be high stakes for students who want to use it to receive college credit. However, in Hardison and Sackett’s (2008) study, students were not attending the real CLEP but used CLEP prompts to assess their writing improvement after the intervention; thus, the test in this study was coded as low stakes.
Test type
Based on the evaluation of all included articles, we classify the large-scale educational tests in the meta-analysis as either college admission tests (such as SAT, ACT) or others.
Subject domain of the test
The subject domain focused by the target test in each individual study was classified as either verbal, quantitative/mathematics, science, foreign language, or mixed. If outcomes from multiple domains were reported in the article, then the outcome on each domain were recorded and separate effect sizes were reported. If the study reported an overall combined score of multiple domains, then the subject was coded as “mixed.”
Format of the test
The inclusion of multiple-choice questions in the test was classified as yes or no, as was the inclusion of short-response questions. In some included articles, authors did not mention information about the test format; thus, the coders checked the official website of the test to gather the information.
Test administration
The administration of the test was classified as either mock, indicating that mock tests were used for the posttest, or real, indicating that participants in the study sat for the real test after test preparation intervention.
Type of test preparation intervention
The test preparation intervention was classified as commercial courses, in-school courses, after-school courses, or self-study. Note that except for specialized courses provided by commercial test preparation centers, we regarded paid courses offered in public or private schools as commercial courses as well, because those paid courses were selective and some were taught by teachers from testing centers or test preparation institutes (e.g., Filizola, 2008). Courses provided by schools or universities were coded as either in-school courses or after-school courses (i.e., extracurricular activities).
Breadth of test preparation intervention
Based on previous studies (Bangert-Drowns et al., 1983; Kulik et al., 1984; Hill et al., 2008), we classified the breadth of test preparation intervention as broad, specific, or narrow. When the intervention focused on a general knowledge/skill area (but may have some overlap subjects with the tested content), it was coded as broad. When the intervention targeted the same subject or area as the test, but did not drill on the test, it was coded as specific. When the intervention only focused on the test-specific content and format or drilled on the exact same tested items, it was coded as narrow.
Strategy use
The provision of inclusion in each of the five test preparation strategies was classified as yes or no. The five strategies investigated by the current meta-analysis were adapted from Xie and Andrew’s (2013) study:
Test preparation management (TPM) strategy, which resembles metacognitive strategies and refers to test preparation practices through analyzing test papers to identify frequently assessed areas, learning marking rubrics and sample responses to evaluate one’s own answers, conducting self-assessment of personal strengths and weaknesses, and effectively managing time for one's own test preparation process;
Test-taking strategy (TTS), which refers to the practices that test takers explore and practice test-taking skills for different sections of the test, including guessing, context clues, eliminating incorrect answers, taking notes in the margin, and time management for test-taking, etc.;
Drill, which refers to the intensive and repetitive practice of a narrow range of skills and knowledge tested by the test, such as memorizing frequently tested knowledge or facts and repeatedly practicing past test questions and remembering exemplar responses;
Socio-affective (SOAF) strategy, which refers to test-takers’ use of social strategies to seek support from teachers and peers, and their use of affective strategies to motivate themselves and to reduce test anxiety;
Broad knowledge skills development (BKSD) strategy, which refers to the learning strategies that test takers use to develop broad knowledge or skills via extensive exposure to and functional uses of knowledge or skills in authentic contexts.
If relevant features of any of the previous strategies were described in a study, then the study was coded as “yes” for the use of the particular strategy. However, if no relevant feature was mentioned in any of the previous strategies, then the strategy was deemed not to have been used in the study and was coded as “no.”
Material use
The use of each of the three types of materials was classified as yes or no. Materials included a commercial test preparation workbook, a timed practice test, sample items, and computer-based programs. If an included study did not mention the use of a specific material, we assumed it did not use the material and coded it as “no.”
Time and duration of the intervention
We recorded the time (indicated by the number of contact hours) and duration (indicated by weeks) of each intervention.
Education level
Participants’ education level was classified as either tertiary, secondary, primary, or mixed. If the study invited participants from a wide population and mixed with their educational levels, it should be coded as mixed (see Farnsworth, 2013, as an example).
School type
For included studies conducted in primary and secondary settings, we further classified two types of schools that provided test preparation courses: public school or private school. Some experiments did not provide enough description for the school type, thus they were coded as “unknown.”
Previous test experience
We recorded participants’ previous test experience. Four categories of participants’ experience were found in the included studies: yes, no, parts of students have relevant experience, or not mention.
Allocation
Participant allocation to condition was classified as random allocation or nonrandom.
Sample size
Studies with small sample sizes might report larger, positive effect sizes than studies with larger sample sizes (Cheung & Slavin, 2016; Slavin & Smith, 2009). Thus, we recorded the sample size of each study and evaluated whether it served as a significant moderator. Additionally, the sample size was controlled as a covariate in the analysis of other moderators.
Comparison group
We identified two types of comparison groups in the included studies: no-test-preparation condition and different test preparation conditions. Participants in a no-test-preparation condition usually did not receive any instruction or received instruction after the intervention (e.g., Moss et al., 2012). Some included studies compared the effects of different types of test preparation, which could be characterized as a business-as-usual condition in many instances; that is, two versions of a course were run—one provided specific test preparation tasks or instructions, and one provided the normal course as usual. For example, Winke and Lim (2017) compared the effects of three different interventions—explicit instruction, implicit instruction, and control—on second language listening test performance. We therefore extracted two comparisons from this study: explicit instruction vs. control and implicit instruction vs. control. The control condition provided students with instructions on American culture, which was regarded as broad instruction on second language learning, thereby coded as a different test preparation condition.
Coding and Interrater Reliability
Three rounds of coding training were performed to allow coders to discuss and refine the classification to characterize each individual article. Within the coder training, 20% of included studies (
Risk of Bias (RoB) Assessment
To ensure that the accuracy of the meta-analysis was not compromised by including study designs of varying quality, we conducted a comprehensive RoB assessment for all effect sizes individually. Since we amalgamated results from both randomized and nonrandomized studies, we applied the RoB 2 tool for RCTs (Sterne et al., 2019) and the ROBINS-I tool for nonrandomized studies (Sterne et al., 2016). The RoB assessment for all included studies was conducted by the first author and the trained research assistant. The interrater reliability between the two raters was .94. Disagreements were solved by discussion.
Figures 2 and 3 depict summary plot results of the RoB assessment for nonrandomized cases and randomized cases, respectively. There were 14 included studies that reported 41 cases that were nonrandomized, which were assessed via the ROBINS-I tool; while 15 included studies reported 46 randomized cases that were assessed via the RoB 2 tool. As shown in the two figures, the majority of the included cases across all research designs were assessed as having a low risk. The characteristics of the research design were further analyzed in the moderator analysis to illustrate whether varying features of the design influence the magnitude of the test preparation effect. In addition, a sensitivity analysis was performed by conducting the meta-regression models, retaining studies that were assessed as low risk of bias in all domains. The restricted model results were then compared with the full model.

ROBINS-I weighted summary plot.

RoB 2 weighted summary plot.
Statistical Analytical Strategy
A correlated and hierarchical random-effects, meta-analysis model was conducted using R version 3.6.3 (R Core Team, 2019). For the analyses, we followed the procedure described by Borenstein et al. (2009) for effect size calculation, integration, and meta-regression for moderator analysis. We used the standardized mean difference effect size to measure the effects of test preparation interventions between treatment and control groups on student test performance. The data on the effects of the included studies were gathered in an Excel sheet. Cohen’s
Many studies included in our meta-analysis either reported multiple outcomes using the same participant groups or nested several treatments and control conditions in one study. Including those dependent effect sizes can bias the results of the analysis, as studies reported more effect sizes would take a larger account. Therefore, we employed robust variance estimation (RVE) to handle dependent effect sizes (Tanner-Smith et al., 2016), which accounts for both hierarchical effect cases (i.e., multiple experiments are nested) and correlated effect cases (i.e., different measures with the same participants) with small-sample corrections.
Heterogeneity in the effect sizes was estimated using the Cochran Q-test. The I2 statistic indicates that the proportion of the variation in effect sizes is due to true between-study heterogeneity rather than sampling error, and I2 > 75% represents considerable heterogeneity (Higgins & Thompson, 2002). Meta-regressions and subgroup analysis were performed to examine the moderating effects of various factors that differed across studies.
To investigate the moderating effects of possible factors that varied across studies, we conducted both subgroup analyses and meta-regressions. Although subgroup analysis could recognize whether there were differences between subgroups by determining whether the confidence intervals around their effect sizes overlap, heteroscedasticity or multicollinearity may lead to biased estimates (Steel & Kammeyer-Mueller, 2002). To examine the overall impact of a moderator, we performed separate meta-regressions for each predictor, controlling the sample size as the covariate (Cheung & Slavin, 2016; Slavin & Smith, 2009).
Assessment of Publication Bias
Publication bias is one of the threats to the validity of meta-analyses (Van Aert et al., 2019). It occurs as a result of some studies not being published due to the lack of statistical significance in their results (Banks et al., 2012). Carter et al. (2019) suggested that if the probability of publication bias is high, the analysis should not rely only on the random effect model. For assessing and reporting publication bias, we conducted a moderator test between published and unpublished studies to explore the possibility that published studies had significantly higher effect sizes than unpublished studies. Funnel plots were also produced in R Studio to inspect asymmetry and heterogeneity. Considering that we have a relatively small sample of empirical studies, the funnel plots and Egger’s regression test were used to evaluate whether small-study effects existed.
Results
Descriptive Statistics
Overall, findings from 28 primary studies reporting 92 effect sizes were extracted for the meta-analysis. Descriptive information of all included studies is reported in Table 1. The majority of studies investigated test preparation effects on college admission tests such as ACT and SAT (
Descriptive characteristics of the included studies
Results of the subgroup analysis
Overall Meta-analysis of the Effect of Test Preparation
In the overall model, a significant positive effect of test preparation on large-scale criterion test performance was found:

Forest plot of all recorded effect sizes.
The heterogeneity between the studies’ effect sizes was significant with
This meta-analysis also investigated the transferability of the test preparation effect. A small number of included studies (
Moderator Analysis
We examined whether test-specific characteristics, test preparation–related factors, student-related factors, and study design characteristics explained the variance in effect sizes. Effect sizes for subgroups are presented in Table 2. Results of the meta-regressions are presented in Table 3.
Results of meta-regressions
Test-specific characteristics
Nature and type of the test
According to the subgroup analysis, test preparation had a significant effect on high-stakes tests (
Subject domain
The average test preparation effects per subject domain have been listed in Table 2. The results revealed small differences between subject domains. The meta-regression results showed that compared with EFL tests (
Format of the test
The use of multiple-choice questions (MCQ) in the test did not lead to a significant difference regarding the effect of test preparation,
Test administration
The meta-regression indicated that the difference between test administration ways was not significant (
Test preparation intervention design
Type of test preparation intervention
The subgroup analysis showed that after-school courses (
Breadth of test preparation intervention
Test preparation effects were moderated by the instructional breadth of test preparation intervention. When interventions targeted the same subject or area of the test, but their focus was not narrowed to the exact tested content, the effect was significant,
Strategy use
The results of the subgroup analysis of the effect of test preparation strategies used during the interventions on student test performance can be found in Table 2. The meta-regression results suggested that test preparation had a significantly larger effect when students were taught test-taking skills during the interventions (
Material use
The moderator analysis also examined whether different material usage in test preparation intervention resulted in the variation of effect sizes. Interventions where students used commercial workbooks had a larger effect on the improvement of test performance (
When test preparation interventions provided students with a timed practice test that simulated the real test condition, the effect (
Besides, two included studies reported three effect sizes about the effect of test preparation interventions delivered virtually by computer-based programs. The estimate effect was
Time and duration of the intervention
The meta-regression results showed that the number of contact hours was not a significant moderator of the effect,
Student-related characteristics
Education level
The test preparation effect was similar for tertiary students (
School type
The subgroup analysis showed that the effect of test preparation programs provided in private schools was significant (
Previous test experience
Among all included studies, 74% of cases reported whether participants had previous testing experience or test preparation experience on the specific test. The subgroup analysis showed that students with previous exposure to the test or test preparation seemed to benefit more from test preparation (
Study design characteristics
About 53% of included interventions used a randomized design. However, no statistically significant difference of the estimated effect sizes for studies with a random allocation and those without was identified in the meta-regression (
Sensitivity analysis
To assess the robustness of the findings in this study, we performed a sensitivity analysis. Initially, all studies were included in the meta-regression model. To control for potential bias, we subsequently conducted a follow-up analysis, including only studies with a low risk of bias as assessed by the RoB tools (
When the meta-regression models were re-run using only the low-risk-of-bias studies, the significance and direction of most regression coefficients remained unchanged (see Appendix A for details). However, the use of socio-affective strategies, which was a significant moderator in the full model, became nonsignificant (
Publication bias
According to the analysis, the publication type, publication year, and the sample size were not significant moderators, indicating that there was less risk of publication bias. There were no significant differences between studies in dissertations or research reports and those that were published in journals (see Table 3 for the meta-regression results). Besides, the risk of publication bias was also evaluated by inspecting the funnel plot (see Figure 5), which gives an impression of the relationship between observed effects and standard error for asymmetry. For this meta-analysis, no systematic relationship was indicated by the funnel plot. The Egger’s regression test (

A funnel plot showing the relationship between standard error and observed effect size for the meta-analysis.
Discussion
The results of this meta-analysis show that test preparation has a statistically significant overall effect on test performance across a broad range of different domains and different tests in authentic educational contexts. Our findings suggest that the test preparation effect is robust across different test conditions, for both large-scale high-stakes and low-stakes tests, curriculum-based and non-curriculum-based tests.
The test preparation effect can be explained by four groups of explanations, as elaborated by Arendasy et al. (2016) and Lievens et al. (2007). First, test preparation may reduce construct-irrelevant factors, such as unfamiliarity and test anxiety, leading to an increase in test scores. Second, the development of test-wiseness and test-taking skills during test preparation increases the likelihood of solving test items correctly. Third, domain-specific knowledge or test-specific cognitive skills could be developed by test preparation. Fourth, generalizable latent cognitive ability and/or broad knowledge could be enhanced by test preparation intervention.
The moderator analysis of the observed test preparation effects provides some support for the first three explanations, as discussed in the following. However, the fourth model was not supported by our meta-analysis. Our finding suggests that the test preparation effect is unlikely to transfer to other tests, which is in line with previous studies (see Arendasy et al., 2016; Freund & Holling, 2011; Koenig et al., 2008) that found that preparing for a test by either coaching or repeated practice on sample items failed to improve general cognitive ability.
The first and second models of test preparation effects were corroborated by this meta-analysis. The analysis shows that students who had some previous test or test preparation experience possibly benefited from test preparation more than those who had not. This indicates how increasing test familiarity can help students prepare for tests more effectively. We also found that test preparation interventions that included instructions or practices on test-taking skills had a significantly greater effect than those that did not include test-taking skills training. However, test-taking skills such as guessing, wrong answer elimination, and time management may not be able to influence the underlying construct the test examined. Such score increases might not represent improvements in the underlying construct highlighted by the test, thus undermining the validity of the interpretation of test scores.
In this meta-analysis, the breadth of test preparation intervention was categorized as narrow, specific, and broad, differentiating the alignment of teaching/learning content and the test. Findings indicate that interventions with a specific test-related focus and ones that even solely drilled students on tested content had a significantly larger effect on test performance than those that developed generalizable skills. Although test-oriented teaching and learning has received a lot of criticism, our findings confirm its effectiveness in increasing test scores and probably provides some empirical support for the first and third model suggested by Arendasy et al. (2016), according to which the increases of test scores could be attributed to test familiarization and an increase of domain-specific knowledge or specific cognitive skills (Xie, 2013).
In fact, the framework of models of test preparation effect was mainly built upon evidence of retest effect on cognitive ability tests (Lievens et al., 2007). To some extent, the retest effect and test preparation effect shared similarities in terms of mechanisms. For example, both retesting and test preparation allow students to familiarize themselves with the test and test-taking skills. Additionally, retesting is an important and widely utilized way of test preparation. Therefore, it is plausible to explain the effects of test preparation using the framework of causes of retest effects. However, applying this framework directly to test preparation contexts has unavoidable limits due to the natural differences between retesting and test preparation. Compared to retesting, test preparation is a more complicated learning process that includes various cognitive and metacognitive components, where students interact dynamically with not only test papers and relevant materials but also their teachers, peers, and even their parents.
This meta-analysis identifies that using workbooks is a significant moderator of the test preparation effect, as is developing test preparation management strategies during test preparation intervention; both significantly benefited test performance. These findings cannot be explained by the current framework drawn from retest effects (Arendasy et al., 2016; Lievens et al., 2007). Instead, learning and cognition theories provide hints as to how to explain the findings.
A test workbook typically contains an introduction to the test structure, test format, sample test questions and answers, and techniques, as well as some practice tests and review of knowledge content (see
The meta-regression analysis of all included studies demonstrated that incorporating socio-affective strategies into the test preparation process was a significant moderator of test preparation effects. A possible explanation is that working collaboratively with teachers and peers allows students to share information about the test and gain feedback about their performance, which further optimizes their learning. Literature on collaborative learning showed that students can benefit from collaborative work, specifically for solving high-complexity tasks with more cognitive load (see Kirschner et al., 2011; Retnowati et al., 2017). Besides, some studies suggested that the use of socio-affective strategies can ease students’ test anxiety (Saeidi & Khaliliaqdam, 2013), leading to an increase in test scores. However, in the meta-regression restricted to studies with low risk of bias, the effect of socio-affective strategies on test preparation did not reach statistical significance. This reduction in significance indicates that the influence of socio-affective strategies on test preparation effects might be exaggerated due to potential methodological flaws or biases. Future research with higher-quality design may provide more definitive conclusions about the role of socio-affective strategies in improving test preparation outcomes.
An unexpected but important finding of this meta-analysis is that sample items and timed practice tests were not significant moderators of the effect of test preparation when analyzed across studies. In previous meta-analyses of practice effect on cognitive ability tests (Hausknecht et al., 2007; Kulik et al., 1984; Scharfen et al., 2018), the use of sample items and practice tests has shown great impact on memory retention and the improvement of cognitive abilities. However, this meta-analysis finds little evidence of practice effect. The reasons might be that compared to cognitive ability tests, educational tests assess not only cognitive abilities, but also the understanding, evaluation, and reflection of knowledge, as well as problem-solving ability and critical thinking ability. This might also be the reason why memorization strategy is not a significant moderator in our study. Preparing for educational tests should be a comprehensive process that covers aspects of different skills other than knowledge retention. Besides, retest effects were only robust when using identical test forms in retesting sessions, while the retest effect for alternative forms of tests was found to be significantly smaller (Scharfen et al., 2018). In real educational contexts, it is rarely possible to practice test papers or items that are identical to those in real tests. We therefore infer that the practice effect can be smaller in educational tests. More quantitative and experimental studies are needed to test this inference.
Another unexpected finding of this study is that the effects of test preparation were significant for both high-stakes and low-stakes tests, but there was no difference between the two types of tests. This result seems to be unreasonable, as previous studies have shown that students were more motivated and tried harder in high-stakes tests (Knekta & Sundström, 2019). One of the possible explanations for this finding is that the results were derived from (quasi) experimental studies in which participants were given a test preparation intervention and were required to prepare for the test regardless of the test’s stakes. Such a design did not simulate real-world situations, where students might not study or pay less effort for low-stakes tests. Another reason could be the self-selection bias: students who opted to participate in the (quasi) experiments might be highly motivated and better engaged than those who did not participate. Thus, whether the test was high-stakes or low-stakes did not affect students’ learning and thus did not moderate test preparation effects.
According to this meta-analysis, the effect of commercial test preparation courses did not significantly exaggerate additional test preparation courses provided by schools or universities. Besides, there was no significant difference in effectiveness between courses provided by private schools and public schools. Actually, there has been a long debate on educational equity issues and commercial educational courses (shadow education) in previous literature (Bray, 1999; Zhang & Bray, 2020). Some researchers were concerned that students from low SES groups might be disadvantaged because they cannot get equal access to commercial courses (Buchmann et al., 2010; Zhang, 2013). However, it is not clear from the current state of the literature that there is good evidence for how the quality of test preparation instruction varies, or whether test preparation activities are more prevalent in certain kinds of schools. More evidence from future research is needed.
Lastly, compared to test preparation courses (coaching) provided by teachers, students’ self-test preparation was under-investigated. Current experimental studies examined whether a specific instruction or teaching pattern was effective, regarding test preparation as teacher-driven activities, while students’ potentials and initiatives were neglected. With more studies and better theory, it will be possible to identify features and dimensions of students’ test preparation practices that are responsible for greater score gains.
Limitations
There were some limitations caused by the characteristics of some of the primary studies included in the meta-analysis. First, many studies provided insufficient information on the test preparation intervention, making it difficult to code some moderators. For example, most studies described the procedures of test preparation intervention very briefly without detailed description of teachers’ instructions. Therefore, we were unable to code the quality of instructions and did not include this factor in the moderator analysis. Second, some included studies that evaluated the effect of a long-time test preparation intervention—that is, a course that took 2 hours per week and lasted for 30 weeks, though they used a random allocation for participants. It can be problematic to attribute the score changes simply to the intervention without appropriate techniques for controlling other factors (such as other activities related to test preparation), thus the effect size might be inflated.
Besides, the findings in the meta-analysis are primarily based on studies carried out on college admission tests in the U.S. context. Although the findings indicate that the magnitude of the effects in the United States was similar in other contexts, and the difference between different tests was not significant, some caution should be taken when generalizing results to other testing contexts, especially tests of other domains (i.e., science, medicine, foreign language except English) or in less investigated regions (i.e., regions in the Middle East, Southeast Asia, and Africa).
There were also limitations in terms of the interpretation of the results. There was no solid theory developed in the test preparation area to analyze its process and explain the effect, thus we interpreted the test preparation effect mainly based on the framework of retesting and some theories from cognitive science, which might not fit well and leave out of the scope other possible frameworks, such as self-regulation or instructional support. However, it is difficult to construct a valid framework for test preparation effect based on those studies without a rigorous design or when lacking precise descriptions of teaching and learning strategies used for test preparation. More rigorous and thoroughly described experiments on specific material use and strategy use during test preparation are needed.
Conclusion
This meta-analysis provides evidence of a significant positive effect of test preparation on test performance, particularly those test preparation interventions focused on relevant content and skills or drilled on tested content. It also highlights the moderating effects of the use of workbooks, encouraging the use of socio-affective strategies and the development of test-taking skills. This study serves as a comprehensive quantitative synthesis of test preparation, contributing to the growing literature on the effectiveness of test preparation and advancing our understanding of its strengths and limitations. Further research on test preparation should include greater details of teachers’ instructions and students’ practices during the test preparation process, thus providing additional insights for using test preparation as effective learning and teaching activities.
