Abstract
Keywords
Digital educational applications (“apps”) are an increasingly appealing tool for promoting young children’s school readiness and basic literacy and math skills. In particular, apps that run on touchscreen tablets and smartphones are now a ubiquitous feature of children’s homes and schools. For example, a recent study on app usage in schools noted that there are over 2,500 education apps available to school leaders (S. Baker & Gowda, 2018), and the market for educational software is estimated in the billions of dollars in the United States (Richards & Stebbins, 2014). Similarly, parents are now confronted with an ever-increasing number of apps to improve children’s academic achievement; the number of educational and reference apps in Apple’s App Store has increased from 80,000 in 2015 to 200,000 in 2018 (Hirsh-Pasek et al., 2015; Pendlebury, 2018). More recently, the spread of the COVID-19 pandemic has ignited efforts by research and policy organizations to offer free and easy-to-use educational apps as a scalable strategy for helping young children acquire and maintain basic literacy and mathematics skills (U.S. Department of Education, 2020).
Despite the proliferation of educational apps designed for young children from preschool to Grade 3, effectiveness research on the causal impact of educational apps is in its infancy. Reviewing research on school-based educational apps, Haßler et al. (2016) concluded that “the fragmented nature of the current knowledge base, and the scarcity of rigorous studies, makes it difficult to draw firm conclusions” (p. 139). More specifically, because children use apps in diverse ways from watching YouTube, to browsing the Internet, to playing video games (Radesky et al., 2020; Xie et al., 2018), and for a variety of other purposes, rigorous experimental designs are needed to isolate the causal effects of educational apps. Research over the past decade has focused on the potential and pitfalls of the medium—that is, touchscreen technologies—rather than the content and quality of activities on interactive apps (Madigan et al., 2019; Wexler, 2019; World Health Organization, 2019).
This meta-analytic review focuses on a specific type of intervention—namely, educational apps designed to improve the literacy and mathematics skills of preschool to third-grade children—in order to quantify mean effects and to identify factors that may enhance or diminish their effectiveness (Guernsey et al., 2012; Haßler et al., 2016; Papadakis et al., 2018). Given the proliferation of apps targeting children ages 3 to 9 (American Academy of Pediatrics, 2016) and the importance of building foundational literacy and math skills necessary for future academic success (National Research Council, 2015; Yoshikawa et al., 2016), our review focused on studies of educational apps for preschool to Grade 3.
Defining “Educational Apps”
It is critical to define the term
Within the academic domains of literacy and math, educational apps can also target improvement in constrained or unconstrained skills from preschool to third grade (Lipsey et al., 2018; McCormick et al., 2020; Paris, 2005; Snow & Matthews, 2016). Constrained skills are often more sensitive to direct teaching interventions, have a ceiling, and are mastered by most children. For example, one-to-one tutoring, small-group instruction, and whole-classroom interventions typically have their largest impact on constrained skills such as letter knowledge, print awareness, and phonemic awareness in literacy and counting, sorting shapes, and simple sums in math (Pearson et al., 2020; Wong et al., 2008). In contrast, unconstrained skills include broader domains of knowledge and include outcomes like math problem solving and vocabulary.
What Is Known About the Effectiveness of Educational Apps?
Although children are spending more time on educational apps in both school and home contexts (Rideout & Robb, 2020), there is surprisingly little causal evidence about their effectiveness or the features that enhance or diminish their effectiveness. To date, there is mixed evidence that educational apps improve student outcomes. Although there is some evidence that educational apps can improve early-grade math skills (Schaeffer et al., 2018), a narrative review of apps for preschool-aged children concluded that “more large-scaled randomized trials of apps are needed” (Griffith et al., 2020, p. 11). One way to synthesize the existing research with timely and rigorous evidence is to use meta-analytic methods to combine results from small- to medium-sized experiments and quasi experiments and to explore potential sources of treatment effect heterogeneity.
During the past 5 years, scholars in diverse fields such as developmental pediatrics, cognitive psychology, educational technology, and early education have published reviews of educational apps. As shown in Table 1, none of these previous review studies have attempted to conduct a meta-analysis that combines effect sizes from intervention studies or to explore how intervention, participant, or methodological factors explain variation in effects. A consistent conclusion in all the reviews is the need for more randomized experimental designs that provide stronger causal evidence regarding the effectiveness of educational apps and examination of the factors that moderate the effectiveness of educational apps on young children’s learning (Griffith et al., 2020; Hainey et al., 2016; McTigue et al., 2020).
Findings From Recent Reviews of Educational Apps
Research Questions and Hypotheses
Both theoretical and empirical research drawn from the science of learning suggest interactive educational applications can support active, engaging, targeted, and varied practice (Bjork, 1994; Griffith et al., 2020; Hirsh-Pasek et al., 2015; Pashler et al., 2007). There are several potential mechanisms through which educational apps may improve student learning, including the medium, the context, and the affordances of gamified learning. First, touchscreen technologies do not require young children to have the fine-motor skills needed to use computer keyboards and the mouse (Flewitt et al., 2015; Kucirkova, 2014), making them an engaging medium and easy-to-use technology for young children. Second, educational apps are typically employed in one-to-one or small-group contexts that provide additional practice for students to master basic skills. Similar to tutoring interventions, apps may provide young children with more time on task and supplemental supports to master basic literacy and math skills (Nickow et al., 2020). Third, app designers are increasingly incorporating principles of gamified learning (Chou, 2016) such as learning goals, interactive activities, scaffolding, and rewards. Recent meta-analyses of digital games and gamified learning have shown medium-sized impacts on student learning and motivation outcomes (Clark et al., 2016; Sailer & Homner, 2020; Wouters et al., 2013). Importantly, educational apps may afford opportunities for developers to personalize learning by helping children and adults select appropriately leveled activities that support co-engagement with math content (Berkowitz et al., 2015). Although touchscreens, mobile devices, and computers that run educational apps are a ubiquitous feature of children’s homes and classrooms (Clarke, 2014; Rideout & Robb, 2020), no meta-analysis to date has examined the potential effects, noneffects, or adverse effects of educational apps on children’s academic skills or explored the sources of treatment heterogeneity.
What Are the Main Effects of Educational Apps on Literacy and Math Skills?
This meta-analytic review was motivated by two aims. Our first aim was to examine whether and to what extent educational apps produced positive and consistent main effects on preschool to Grade 3 students’ literacy and math outcomes. We hypothesized that educational apps would improve both literacy and math outcomes by providing targeted opportunities for children to practice and develop academic skills that supplement traditional instruction particularly in school and classroom contexts. This hypothesis was based on meta-analytic reviews of one-to-one tutoring and small-group instruction provided by teachers, parents, or volunteers (Lipsey et al., 2012; Nickow et al., 2020) that demonstrate small and medium-sized effect sizes in literacy (ES = 0.35) and math (ES = 0.38).
What Study Characteristics Moderate the Effectiveness of Educational Apps?
Our second aim was to examine whether the effects of educational apps were moderated by methodological, participant, and intervention characteristics. Like other one-to-one tutoring and small-group interventions in the preschool and early elementary grades (Dietrichson et al., 2017), educational apps also vary along numerous methodological, participant, and intervention characteristics. Importantly, the average effect from a meta-analysis may conceal variability in treatment effects across studies. In particular, we explored the role of moderators that have been well known to explain variation in effect sizes in educational and behavioral intervention research, including the type of outcome, type of control condition, participants’ grade level, and intervention dosage (Lipsey et al., 2012; Lipsey & Wilson, 1993). In addition, we examined the moderating role of intervention characteristics, particularly the quality of app activities and the type of skills they target (Hirsh-Pasek et al., 2015; McCormick et al., 2020).
Type of Assessment Outcome Measure and Control Group Activities
Prior research suggests that the type of outcome measure and control group activities moderate intervention impacts. In intervention studies involving preschool to Grade 3 children, average treatment effects are usually larger on researcher-developed measures that are closely tied to practice activities than standardized achievement tests (Lipsey et al. 2012; Paris, 2005). In many ways, improvement on a standardized outcome measure provides an index of far transfer (Barnett & Ceci, 2002; National Academies of Sciences, Engineering, and Medicine, 2018), highlighting whether students have mastered a broad domain of transferable knowledge that is not overly aligned with intervention activities (Lipsey et al., 2012; R. Wolf et al., 2020).
In addition to the type of assessment outcome, primary studies often find that the nature of the counterfactual may influence the magnitude of mean effects. That is, when studies compare educational apps to an active placebo group rather than a passive group that is untreated, the magnitude of the treatment contrast in student outcomes may be attenuated (Griffith et al., 2020; Xie et al., 2018). For example, intervention studies of educational apps in math can include active placebo group activities where children in the control condition receive a literacy app (e.g., Berkowitz et al., 2015), or vice versa (e.g., Neuman, 2015). In an active placebo condition, there is a more rigorous test of the content of the app activities since both treatment and control students are completing educational activities utilizing the same medium.
Participants’ Grade
Next, we examined whether the effectiveness of educational apps depends on the grade level of participating students in light of correlational research that paints a mixed portrait of whether educational apps, in particular, and screen time, in general, can help or hurt young children’s academic achievement. Past research has focused on highlighting the effects, non-effects, and potential adverse effects of screen time and app usage with young children and has typically focused on either preschool (e.g., Griffith et al., 2020) or K–12 students (e.g., Cheung & Slavin, 2012). To our knowledge, no studies have attempted to compare mean effects for preschool and school-aged children. For example, some large-scale correlational studies have suggested that excessive screen time may have unintended negative consequences on young children’s language and literacy development, communication skills, and socioemotional and health outcomes (Hutton et al., 2020; Madigan et al., 2019). In other words, the quality of the activities that children participate in may matter as much as the amount of time using mobile or interactive technologies (American Academy of Pediatrics, 2016). Accordingly, some policymakers (World Health Organization, 2019) have recommended that caregivers of preschool-aged children (3–4 years old) provide no more than 1 hour of sedentary screen time and the use of high-quality apps should ideally promote shared use and high-quality language interactions.
On the other hand, some scholars have argued that young children can thrive in a digital world where screen time and apps are a normal feature of daily life in school and home (Shapiro, 2018). A synthesis that focused on the effects of touchscreen devices found more promising evidence that young children could benefit from touchscreen devices but did not attempt to isolate the particular effects of educational apps on student achievement outcomes (Xie et al., 2018). A question that has yet to be explored is whether the effectiveness of educational apps depends on the participants’ grade level. Therefore, we examined whether educational apps would be more or less effective for children in preschool versus kindergarten to Grade 3.
Intervention Dosage
An important malleable factor under the control of app designers and researchers is the amount of time that children are expected to work on an educational app. Existing research provides mixed findings on the relationship between intervention dosage and student outcomes. For example, meta-analytic evidence from tutoring studies involving one-to-one and small-group instruction has revealed limited differences in mean effects based on varying measures of intervention dosage such as the number of days per week or the total number of weeks that programs are offered to students (Nickow et al., 2020). The relationship between app usage on mobile and interactive technologies and student outcomes remains suggestive because findings are largely informed by nonexperimental research. For example, some correlational evidence indicated that more screen time may predict lower student achievement scores for both younger and older students (Hutton et al., 2020, World Health Organization, 2019), but correlational and survey research does not provide direct evidence on the causal effects of time spent using educational apps on student learning (Rideout, 2017; Kris, 2015; Livingstone, 2016).
Quality of App Activities and the Skills They Target
Importantly, there is growing evidence that educational apps must include high-quality activities that rest on research-based principles for improving learning more generally. In particular, educational apps should foster (a) active, engaged, and meaningful learning, supported by high-quality social interactions and clear learning goals (Hirsh-Pasek et al., 2015), and (b) deliberate practice that is focused, is active, includes regular feedback, and interleaves varied activities across different contexts (Bjork, 1994; Pashler et al., 2007).
Notably, researchers and developers have begun to develop apps that incorporate principles on how people learn and tested their efficacy in real-world settings. Berkowitz et al. (2015) conducted a randomized controlled trial (RCT) of the
Method
Selection Criteria and Literature Search Procedures
The studies included in our review met the following five selection criteria. Each included study had to (a) evaluate the effects of an interactive educational app, (b) include an outcome measure of math or English language literacy skills, (c) provide sufficient empirical information to calculate an effect size, (d) include students from preschool to Grade 3 (approximately ages 3–9), and (e) use an experimental or quasi-experimental design to compare the postprogram performance of treatment students to control students who participated in either an active placebo or passive control group activity. We excluded studies using single-group pre-posttest designs because they fail to protect against most threats to internal validity (Shadish et al., 2002).
To identify primary studies, we searched (a) electronic databases and targeted internet sites, (b) reference lists of previous research syntheses, and (c) ancestral searches based on reference lists of included articles. Because the original iPhone was released in 2007, followed by Apple’s App Store and Google Play in 2008, we limited our search to studies published in English from January 2008 to June 2020.
Electronic Databases
Figure 1 displays a PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) screening flowchart describing our literature searching procedures. To identify published and unpublished studies, we searched electronic databases (Academic Search Premier, PsyInfo, Education Source, ProQuest Dissertations and Theses, Web of Science) and identified an initial sample of 306 studies. We also conducted searches of the gray literature by hand-searching abstracts from annual meetings for the Society of Research on Educational Effectiveness and the What Works Clearinghouse’s reviews of early literacy and math intervention studies. A full list of keywords for our searches is available in the online supplemental materials (Appendix 1). During the screening phase, we removed 78 duplicates, 149 studies that failed to meet inclusion criteria based on our review of the titles and abstracts, and 48 studies after we reviewed the full-text articles. An initial sample of 31 included studies published from January 2008 to June 2019 was identified. We replicated this search process to update the review through June 2020 and found five additional studies that contributed to the final sample of 36 studies.

Visual representation of the literature search and inclusion results.
Procedures for Coding Studies
We developed a codebook to extract information from each of the 36 studies. The codebook was based on previous meta-analytic reviews of literacy intervention studies and prior research indicating the factors that would influence student outcomes (Durlak et al., 2011; Guo et al., 2020; J. S. Kim & Quinn, 2013; Lipsey et al., 2012; Marulis & Neuman, 2013). In particular, we coded for the content of educational app (math or literacy) and key methodological, participant, and intervention characteristics.
Table 2 indicates that over 90% of the studies were published in the past 5 years (2015–2020), suggesting this review is providing the most updated information on the effectiveness of educational apps. Most studies were published in peer-reviewed journal articles (67%), employed RCT designs (92%), and were closely split between preschool (42%) and K–3 samples (58%). In terms of discrete grade level, 42% were preschool, 19% in kindergarten, 17% in Grade 1, 8% in Grade 2, and 6% each in Grade 3 and mixed grades. There was also substantial variability in the mean quality of educational apps (
Descriptive Characteristics of Included Studies (N = 36)
Descriptive Characteristics of Each of the 36 Educational Apps
Methodological Moderator Variables
Two critical methodological features that influence student outcomes are the type of outcome measure and counterfactual activities. First, because prior syntheses provide substantial evidence that mean effects would be larger for measures developed by study authors than standardized outcome measures (R. Wolf et al., 2020), we dichotomously coded for the type of outcome used in the study. Researcher-developed outcomes were aligned with the intervention activities and measured more narrow domains of knowledge on specialized topics. In contrast, standardized outcomes were less aligned with the intervention and assessed broader domains of transferable knowledge (Kraft, 2019; Lipsey et al., 2012). Second, as described earlier, we coded for whether studies used an active placebo group—where control students completed activities on an app targeting a different domain—or a passive, “business-as-usual” control group. Approximately one half of the studies used standardized outcomes and one quarter used active placebo control groups.
Participant Moderator Variables
To compare mean effects by grade level, we coded for the grade level/age of participating students and created a dichotomous code indicating whether the sample was preschool- or school-aged (there were no studies that included both prekindergarten and older children).
Intervention Moderator Variables
To determine the moderating role of intervention features, we coded for three features. First, we coded for intervention dosage, or the amount of app usage. Based on prior research (Marulis & Neuman, 2013), we coded for (a) frequency, that is, the total number of sessions during the intervention; (b) intensity, that is, the length of each session in minutes; and (c) duration, that is, the length of the intervention from beginning to end. On average, studies included 32 sessions at about 21 minutes per session over the course of 87 days.
Second, we created an overall app quality score to assess the extent to which an educational app fostered learning that was active, engaging, meaningful, socially interactive, and had clear learning goals (Hirsh-Pasek et al., 2015). We downloaded each educational app in our review where possible and rated the following five criteria: (a) Do the activities promote active learning? (active), (b) Do the activities promote engaging learning? (engaging), (c) Do the activities promote meaningful learning? (meaningful), (d) Do the activities promote social interactions between children and adult caregivers? (social interaction), and (e) Do the activities have clear learning goals that foster educational aims? (learning goals). Each dimension was scored as low (e.g., app has no well-defined learning objective and is purely for entertainment), moderate (e.g., app has a vague literacy or math learning objective), or high (e.g., app has a clearly defined learning objective). When apps were not available for download, we used YouTube videos of app demonstrations or narrative descriptions of app functionality included in the research articles or supplementary online materials provided by app developers to assess the skills measured by the app. The mean quality score in our sample was 2.4. Details on the scoring rubric are included in the online supplemental materials (Appendix 2).
Third, we coded for the skills targeted and measured by primary researchers. Using previous coding systems (McCormick et al., 2020; Snow & Matthew, 2016), we coded both the skills targeted by the app and the skills that were measured by the outcomes used in the study. Constrained skills in both literacy and math can be improved by direct teaching, have a ceiling, and are mastered by most typically developing children. In contrast, unconstrained skills develop over time, require more varied experience, and are critical to higher order and more complex tasks like math problem solving and reading comprehension.
Publication Bias
One challenge of the vastly expanding market of apps and relevant research is that some high-quality studies that met our criteria and addressed our research questions may not be published through peer-reviewed academic channels. These alternative sources are known as “gray literature” (Marsolek et al., 2018) and can present issues for creating a truly systematic review of the literature. We therefore tested for publication bias and file drawer effects by (a) testing a moderator effect of a dichotomous published peer-reviewed article indicator, (b) using a trim and fill analysis (Duval & Tweedie, 2000), and (c) plotting a cumulative meta-analysis forest plot (Borenstein et al., 2009).
Coder Reliability
We created a codebook to collect information from each study and developed a procedure for estimating the reliability of the study codes. Two raters coded all moderator variables in our sample of 36 studies. Kappa coefficients adjust for chance agreement between raters and the mean kappa was
Analytic Strategy
Calculation of Effect Sizes
To conduct a meta-analysis of continuous outcomes such as math and literacy achievement scores, we computed a standardized mean difference, or effect size. For each study, we computed Hedges’s
Robust Variance Estimation
First, we used robust variance estimation (RVE) to adjust standard errors to account for the correlation among effect sizes with studies. RVE allows syntheses to avoid a loss of information resulting from computing an aggregated, within-study average effect size. Following Tanner-Smith and Tipton (2014, p. 17), we applied RVE to our data set by computing weights for effect size
Using Aggregated Effects to Supplement RVE Analyses
Because studies of educational apps vary along a number of dimensions and because we were interested in making inferences back to the population of studies from which our studies were sampled, we used a random effects model to pool the study-specific effect sizes and to generate an aggregated effect size (DerSimonian & Laird, 1986). The random effects model includes both a within-study weight (inverse of the study variance) and a between-study variance component. We made an a priori decision to employ a random effects model, because we expected that the dispersion of effect sizes would reflect true variance in mean effects.
In our data set, the most common dependency among effect sizes within studies involved correlated effects, which arises when multiple effect size estimates measure a single construct. Therefore, in addition to RVE analyses, we created an aggregated mean effect (i.e., a single average effect size for each study) to synthesize mean effects and to assess the robustness of results across two analytic methods. Using an aggregated mean effect size allowed us to maintain the assumption of statistical independence and to report heterogeneity statistics that are not available with RVE.
Measures of Heterogeneity
The meta-analysis of aggregated mean effects allows us to report the
Sensitivity Analyses
To assess the sensitivity of our findings, we begin by reporting meta-analytic results from unconditional models that report combined impacts on overall achievement and separately for literacy and math. Next, we examine whether our results are replicated after controlling for whether the study was (a) published in a peer-reviewed journal, (b) used an RCT design, (c) used an active control group, and (d) used a standardized outcome assessment. We used the trim and fill method to assess the potential impact of missing, unpublished studies on mean effects (Duval & Tweedie, 2000).
Results
Main Effects of Educational Apps on Literacy and Math Skills
To address our first research aim, we used RVE and aggregated random effects models to synthesize findings from 36 studies and 285 effect sizes. As shown in Table 4, the RVE yielded a positive Hedges’s
Results of Estimating Unconditional Meta-Regression Model with RVE and Aggregated Mean Effects
These mean effect sizes, however, mask substantial treatment effect heterogeneity. As shown by the results of the overall results of the aggregated effects model in Table 4, the
Moderators of Educational App Effectiveness
Main Effects of Educational Apps Controlling for Between-Study Methodological Factors
The meta-regressions reported in Table 5 highlight potential methodological factors that moderated the impact of educational apps on student outcomes. Controlling for whether studies were published in peer-reviewed journals, used an RCT design, had active control groups, and targeted a literacy or math domain, the RVE meta-regression results indicate that studies using standardized outcomes produced impacts that were, on average, about 0.42 standard deviations (
Results of Meta-Regression Model with RVE Controlling for Publication Status, Experimental Design, Control Group Activities, and Type of Assessment Outcome
Within- and Between-Study Moderators of Educational App Effectiveness
Table 6 displays a series of RVE meta-regression models that isolate the relationship between each respective participant and intervention moderator variable controlling for the type of outcome assessment. We controlled for whether outcomes were assessed with standardized measures in all models because it was a strong moderator of mean effects. Because standardized outcomes varied both within and between studies, we modeled separate within and between effects by including the study mean centered covariate along with the study mean value. Thus, all subsequent models include the controlled effects of participant and intervention characteristics. In Model 1, there was a statistically significant association between the mean effects and participants’ grade, indicating that effects were 0.18 standard deviations higher, on average, in studies involving preschool-aged children than Kindergarten to Grade 3 children. Moving to Model 2, there was inconsistent evidence that intervention dosage measures were related to effect sizes. In particular, the meta-regressions indicated that log2 duration and log2 intensity measures did not predict outcomes once the assessment type was included in the model. Model 3 indicated that app quality ratings were not significantly associated with mean effect sizes, controlling for the type of assessment outcome. Furthermore, Model 4 indicated that the type of skill assessed by the apps was significantly associated with mean effects controlling for the type of assessment outcome.
RVE Results With Intervention Characteristics as Moderators, Controlling for Type of Assessment Outcome
After removing nonsignificant moderators of mean effects, we fit a final Model 5 that included participant grade, the type of outcome assessment (within and between studies), and the skills measured by educational apps. Importantly, the results of Model 5 indicated that educational apps produced mean effects that were approximately 0.17

Predicted differences in effect size.
Sensitivity Analyses for Publication Bias
We examined the effects of study design to explore the potential role of publication bias in two sensitivity analyses using the aggregated mean effect size per study as the unit of analysis. We assessed publication bias using the trim and fill analysis as shown in Figure 3 (Duval & Tweedie, 2000). There were six imputed study results that were missing on the left of the funnel plot that represented potentially unpublished studies with smaller mean effects. Imputing these six mean effects yielded a mean effect size of

Trim and fill plot.

Cumulative forest plot of aggregated mean effects from 36 studies.
Discussion
Educational apps are an increasingly ubiquitous feature of young children’s lives at home and school, yet little is known about their effectiveness or the factors that diminish or enhance their impact on student achievement outcomes. Moreover, there is an urgent need to understand what works, for whom, and under what conditions as educators increasingly turn to easy-to-use technology interventions to support children’s early literacy and math learning during the school closings triggered by the COVID-19 pandemic. To improve the rigor and relevance of the research base on educational apps, we undertook this meta-analysis of preschool to Grade 3 educational apps in math and literacy to advance two research goals. First, we examined the mean effects of 36 educational apps on preschool to Grade 3 children’s math and literacy outcomes to quantify the extent to which apps improve student outcomes. Second, we examined the degree to which the effectiveness of educational apps was moderated by several methodological, participant, and intervention characteristics.
What Are the Main Effects of Educational Apps on Literacy and Math Skills?
In the domains of math and literacy, there was convergent evidence that educational apps improved student achievement outcomes relative to a counterfactual condition in which children participated in typical school instruction or received a placebo control activity. Meta-analytic results based on RVE yielded medium-sized impacts in literacy (ES = 0.35) and math (ES = 0.29). The magnitude of these impacts is similar to the effects of tutoring interventions (ES = 0.37) and early elementary literacy interventions (ES = 0.39) based on recent meta-analyses of experimental and quasi-experimental intervention studies (Gersten et al., 2020; Nickow et al., 2020). Furthermore, the mean effects from our meta-analysis are consistent with recent meta-analyses of digital games and gamified learning interventions (Clark et al., 2016; Sailer & Homner, 2020; Wouters et al., 2013). These findings support theories that emphasize the affordances of educational apps in promoting active learning, deliberate practice, and gamified learning in one-to-one or small-group contexts to improve basic literacy and math skills (Griffith et al., 2020; Hirsh-Pasek et al., 2015).
Despite these promising findings, the 36 educational apps were unique in several ways. In particular, our meta-analysis included mostly high-quality apps that incorporated principles on how people learn, and the activities in turn were designed to promote learning that was active, engaging, meaningful, interactive, and focused on a clear learning goal (Hirsh-Pasek et al., 2015). For example, the educational apps in our review scored highly on all these dimensions (
What Study Characteristics Moderate the Effectiveness of Educational Apps?
The potential pitfall of the current research base, however, is that the mean effects paint an overly simplistic and optimistic assessment of the value of educational apps. In many ways, the mean effect size overall, and separately for math and reading, masks true variance in mean effects. Given the dispersion in mean effects across studies, what were the key sources of treatment effect heterogeneity?
First, there was clear evidence that outcome measures matter. Whether a primary study used a researcher-designed or standardized outcome emerged as the most powerful moderator variable. For example, mean effects were nearly 0.28
Second, meta-regressions that controlled for the type of assessment outcome revealed two additional moderators of app effectiveness. Controlling for assessment type and participant grade, measures of constrained skills produced mean effects that were 0.17
Third, meta-regression results indicated that mean effects were larger in studies involving preschool-aged children than kindergarten to third-grade children. The magnitude of the preschool advantage over the K–3 grades is also consistent with recent meta-analytic results comparing effect sizes from RCTs of educational interventions with preschool- versus school-based children (Kraft, 2019; Lipsey et al., 2012). Although it is beyond the scope of this review to fully explain these findings, several hypotheses merit further scrutiny. One important hypothesis is that preschool-aged children may benefit from educational apps that foster joint attention and co-engagement in school contexts. Importantly, the majority of the studies involving preschool-aged children (87%, 13 of 15) were implemented in school contexts. This finding raises the question of whether and how co-engagement can enhance the effectiveness of educational apps in preschool center–based contexts (McTigue et al., 2020). During the preschool years, the essential condition for language development among young children is co-engagement and joint attention (Taylor, 2016), yet there is growing concern that mobile devices foster solo rather than co-use (through adult support) of educational activities that support literacy and math skills on interactive mobile technology (Bus et al., 2015; Radesky et al., 2014; M. Wolf, 2018). However, primary studies that directly compare the effectiveness of apps used in school and home contexts are needed given the increasing use of apps in children’s nonschool contexts (Rideout & Robb, 2020).
Finally, our meta-analytic results revealed no consistent association between intervention dosage and mean effect size across studies. These results are broadly consistent with reviews of recent intervention research finding no consistent association between measures of treatment dosage and student outcomes (Kraft et al., 2018; Lynch et al., 2019). How do we explain these findings? Most likely, our measure of quantity may reflect lower limits of time students spend on educational apps intended to run on mobile technology. That is, surveys of children’s device usage report substantially higher levels of screen time than studies where the time on an app is tightly controlled for the duration of the intervention study. In other words, intervention research may not capture associations between the quantity of app usage and outcomes because the maximum is constrained by the end point of the study. Direct measures of children’s app usage that leverage data from the mobile technology are needed to provide more accurate estimates of the time children spend using apps (Radesky et al., 2020; Roberts et al., 2016).
In addition to measuring relations between dosage and app effectiveness, we sought to measure the quality of the activities using principles from the learning sciences (Hirsh-Pasek et al., 2015; Pashler et al., 2007). These null findings, however, are inconclusive given the small number of studies in our review and the fact that most apps in our meta-analysis attempted to include research on how people learn. Accordingly, researchers should aim to pinpoint which aspects of activity, engagement, social interactions, and meaningfulness are most critical for supporting student learning outcomes. A strength of our review was the attempt to code five dimensions of app quality, which clearly indicated that few educational apps foster social interactions between children and their adult caregivers. Thus, one implication of our review is that quality depends on both the specific activities in an educational app and the rigor of the experimental design. For example, the
Limitations
Findings from this study highlight limitations that should inform future research. First, intervention studies have generated intent-to-treat estimates of the causal impact of offering children and parents an opportunity to use an educational app on a narrow set of outcomes. To our knowledge, no study has attempted to connect measures of children’s actual engagement in app activities to a wider set of outcomes. To improve the research base, more fine-grained measurement could inform practical recommendations about app use in particular and help shed light on potential adverse effects on children’s social, emotional, and attentional outcomes. For example, descriptive research indicates that apps are simply one type of activity available to young children as they use smart phones and tablets to watch YouTube and play video games (Radesky et al., 2020). Direct measures of children’s engagement (D’Mello et al., 2017)—that is, their actual time on task, their accuracy in completing digital activities, and their task orientations—are needed to connect the active ingredients in an educational app to a broader set of cognitive, behavioral, and motivational outcomes.
Second, there is a dearth of effectiveness research done at scale limiting the external validity of our findings. Current research on educational apps is lagging behind the push to scale-up easy-to-use remote learning interventions, which often include educational apps. For example, the U.S. Department of Education’s Institute of Education Sciences has curated a website of over 100 apps that can be accessed for free (U.S. Department of Education, 2020), yet no evidence of effectiveness is provided for educators and parents. Notably, it is striking to find in our review that only six of the 36 educational apps in our study would meet ESSA (Every Student Succeeds Act) Tier I standards, requiring evidence from at least one well-executed RCT with more than 350 students.
Finally, we used stringent inclusion criteria to include only experimental and quasi-experimental evaluations of educational apps. Although our sensitivity analyses suggest that findings are robust to potential publication bias, there is a clear need to build on our review by casting a broader net that captures a wider set of unpublished studies. For example, we did not include evidence from the growing number of correlational studies that examine relationships between app use and student achievement. Leveraging school district administrative data involving 258 apps used by over 390,000 students, S. Baker and Gowda (2018) found that the average correlation between the amount of time students spent on educational apps was .01 in math and .00 in English language arts assessments. Although these results do not address selection bias, they underscore the need to combine descriptive, correlational, and experimental evidence in answering a broader set of questions and concerns facing decision-makers. In many ways, our study represents a first attempt to begin building a stronger foundation of evidence to understand whether and how educational apps deployed in real-world contexts can support student learning.
Conclusion
The purpose of this review was to synthesize results from experimental and quasi-experimental studies evaluating the impact of educational apps on children’s math and literacy skills and to identify study characteristics that moderated mean effects. Although educational apps have positive effects on children’s math and literacy skills, the effects are larger in studies involving preschool-aged children rather than kindergarten to Grade 3 children, studies using researcher-developed outcomes rather than standardized outcomes, and studies measuring constrained skills rather than unconstrained skills. Our findings suggest that the next generation of research on educational apps needs to improve both the internal and external validity of findings, evaluate effectiveness at much larger scale, use multiple outcome measures of student learning, and determine whether apps confer lasting benefits on a wider range of skills. Some efforts are currently under way to improve the quality of research (Molnar, 2020), but the marketplace for education apps remains “chaotic and unregulated” (Papadakis et al., 2018, p. 156). Although apps are increasingly advertised to educators and parents as a low-cost and scalable strategy for improving learning, apps are not uniformly high-quality and rarely evaluated rigorously by independent researchers.
More collaborative research across disciplinary silos is clearly needed to address these research gaps. In particular, we encourage learning scientists to incorporate principles on how people learn into the design of high-quality activities in literacy and math apps, developmental pediatricians and psychologists to examine how the quantity and quality of adult mediation in school and home contexts support learning, and intervention researchers to explore whether direct measures of engagement mediate the effects of educational apps on student outcomes (Cherner et al., 2014; Griffith et al., 2020; McTigue et al., 2020; Papadakis et al., 2018; Radesky et al., 2020). In short, we encourage the field to move beyond the broad question—do apps work—to the more targeted question: How and under what conditions do high-quality educational apps support children’s early literacy and math skills? The limitations of this meta-analytic review highlight the collaborative research needed to shed light on this question.
Supplemental Material
sj-docx-1-ero-10.1177_23328584211004183 – Supplemental material for Measures Matter: A Meta-Analysis of the Effects of Educational Apps on Preschool to Grade 3 Children’s Literacy and Math Skills
Supplemental material, sj-docx-1-ero-10.1177_23328584211004183 for Measures Matter: A Meta-Analysis of the Effects of Educational Apps on Preschool to Grade 3 Children’s Literacy and Math Skills by James Kim, Joshua Gilbert, Qun Yu and Charles Gale in AERA Open
Footnotes
Authors
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
