Abstract
Introduction
In 2012, the National Collegiate Athletic Association (NCAA) approved using a four-team postseason playoff tournament to determine a national champion in college football at the FBS level (the highest level of intercollegiate competition), starting in 2014. The decision effectively doubled the number of teams involved in the postseason tournament, and there was immediate discussion, which has continued through now, about whether an 8-team tournament (or larger) would be even better. In this paper, we address the question of the optimal size of this postseason tournament.
Each year since at least 1936, a national champion has been chosen among FBS (formerly Division I-A) college football teams. Initially, the champion was chosen by polls of experts. These expert polls included the Associated Press (AP) poll of sportswriters (the first major national poll) from 1936 to 1997, and various polls of coaches from 1950 to 1997 (e.g., United Press International from 1950 to 1990, USA Today/CNN from 1991 to 1996, and USA Today/ESPN in 1997). The highest-ranked team in the polls was declared national champion; in the few years that the AP and coaches’ polls disagreed, the national championship was shared between the two polls’ winners (National Collegiate Athletic Association). In 1998, a new system, the Bowl Championship Series (BCS), was created. In the BCS, expert poll rankings and analytical rankings (called “computer rankings”) were combined to determine a ranking of the top teams. From 1998 to 2005, the BCS-designated national champion was unofficial (and for the 2003 season, the AP poll chose a different champion). From 2005 to 2013, the top two teams in the BCS played in a designated postseason game, with the winner being named the official national champion (National Collegiate Athletic Association). Beginning in 2014, a new system has been in place: A panel of experts selects four teams to play a three-game, two-round single-elimination tournament, the winner of which is named national champion.
The progression from polls (effectively a one-team tournament) to the BCS championship game (a two-team tournament) to a four-team tournament has been motivated partly by economics (the paid attendance and television revenue of playoff games is substantial), but even more so by a grassroots feeling that it is necessary to determine a champion “on the field” (i.e., by playing games) rather than by poll or computer, because neither voters nor algorithms are guaranteed to identify the absolute best team(s). Before the BCS system, the top teams in the polls rarely played each other at the end of the season, and it could be difficult to differentiate between the best teams. The novelty of the BCS system was that the two highest-ranked teams were guaranteed to play each other at the end of the season, and the argument in favor of having even more teams in the championship tournament is that even the top two can be difficult to differentiate from a larger set of very good teams so letting the teams play each other is the most fair way to sort out which one is really the best. (Even in the BCS system, there could be significant disagreement as to the selection of the two playoff teams, for example in 2004 when three major-conference teams (USC, Oklahoma, and Auburn) each won all of their regular-season games.)
On the other hand, the outcomes of sporting events contain enough randomness that the winner of a game is not necessarily the better team, and it is possible that a committee, although it might make mistakes, could have a higher probability of correctly identifying the best team than playoff games that are subject to football’s inherent randomness. Today’s tournament selection committee has two advantages over polls of the past: better information (many more games are televised to a national audience, video recording/playback capability allows them to see multiple games that are played at the same time, and more and deeper statistical information is available about teams and players) and better analytics (many of the top quantitative rating and evaluation systems were not developed in the polling era). As a result, the playoff selection committee is likely to have a smaller error in its evaluation of teams than polls had in the past. The progression of tournament size has actually been opposite what intuition might suggest is optimal: As human experts have been given the tools to make better judgments and decrease their likelihood of error, the tournament size has expanded, increasing the chance that a team correctly identified as the best by the human experts will fail to win the tournament.
In this paper, we investigate the optimal size of the college football national championship tournament by taking into account the relative magnitudes of the randomness inherent in college football and the errors in team evaluation by humans and algorithms.
Literature review
Optimal tournament design has been studied before, but none of the existing literature is sufficient to answer our research question. One main stream of optimal tournament size research (e.g., Dizdar 2013; Fullerton and McAfee 1999) has focused on the issue of effort, especially in research tournaments. Given assumptions on the technology and knowledge available to each firm that might enter a research competition, and on their probabilities of winning, these papers use game-theoretic models to estimate how much effort each competitor would spend, and use that analysis to determine the optimal number of participants, how to select participants, etc. Others (e.g., Chen, Ham, and Lim 2011; Hochtl et al. 2010; Sheremeta and Wu 2012) try to empirically test such predictions. These papers all start with the basic assumption that firms have more than one way to spend effort, so they might choose to put forth less effort in competitions where they are less likely to win (and thus a firm that is likely to succeed might also not need to put in maximum effort). In our work, we sidestep this issue, presuming that every team in a national championship tournament has just one football goal (to win the tournament) and will put forth maximum effort.
A second stream of research in designing optimal tournaments is not the composition, but rather the structure. Glenn (1960), Marchand (2002), Scarf and Bilbao (2006), and Seals (1963), among others, compare different tournament setups such as round-robin, pure knockout, and hybrids. In our work, we assume that the NCAA will retain its single-elimination (knockout) round-based format. Glickman (2008), Hwang (1982), and Schwenk (2000) look at adaptive approaches where tournaments may be re-seeded between rounds; in our work, we assume that the NCAA will not re-seed, so fans can make travel plans in advance (as is the case currently for the existing 68-team NCAA basketball tournament).
Other research, assuming a non-reseeded knockout tournament, looks at the optimality of the
standard seeding of teams into tournament slots. Appleton (1995), Groh et al. (2012),
Jennessy and Glickman (2016), Horen and Riezman (1985), Ryvkin (2005), and Vu (2010)
investigate how to seed teams so that various objectives are optimized. For example, Horen
and Riezman (1985) show that under some assumptions about team strength and head-to-head win
probability, for a 3-round, 8-team tournament the standard seeding method does not maximize
the probability that the best team will win. Jennessy and Glickman (2016) show the same
empirically for 16-team tournaments, using a Bayesian approach that considers uncertainty in
team strength. However, we assume that the NCAA will retain standard seeding, to retain
fairness properties that Vu (2010) calls
The objective function when referring to an “optimal” tournament can be defined in
different ways. Most research assumes a goal of maximizing the probability that the
tournament winner will be the best team (Appleton 1995; David 1988; Glenn 1960; Glickman
2008; Jennessy and Glickman 2016; Hwang 1982; Marchand 2002; Schwenk 2000; Seals 1963; Vu
2010); others maximize the probability that the winner will be a certain team (Vu 2010),
maximize the quality of the winner’s result in research tournaments (Dizdar 2013; Fullerton
and McAfee 1999), minimize the fraction of unimportant games (Scarf and Bilbao 2006; Scarf
and Shi 2008), maximize the average rank of the winner (Scarf and Bilbao 2006), maximize the
average revenue of the tournament (Vu 2010), maximize the probability of the top two teams
meeting in the final (Jennessy and Glickman 2016), maximize the consistency between expected
number of wins and team strength (Jennessy and Glickman 2016), etc. Sokol (2010) also
considered the number of significant upsets in a tournament as a driver of fan interest. In
this paper, we consider two objectives. The primary objective is the probability of
correctly identifying the best team, i.e., the probability that the best team wins the
tournament; we refer to this as the
Finally, and critically for our work, in the previous literature the information about each team, including its strength relative to competing teams, is almost always assumed to be known deterministically. Of all the work cited above, only Glickman (2008) and Jennessy and Glickman (2016) consider uncertainty in the strength of each team; their Bayesian models address how to seed (Jennessy and Glickman 2016) or re-seed (Glickman 2008) a tournament, not how many teams should be included.
So, none of the existing literature exactly addresses the question of how many teams should be in a tournament like the college football championship given both uncertainty in team strength estimation and randomness of game results.
The remainder of the paper is organized as follows: In Section 3, we describe our underlying models of the uncertainty and randomness in the system. In Section 4, we discuss how we populate our model with empirical data and simulate tournament results. Section 5 discusses parameterizing by the relative magnitudes of randomness and uncertainty, and we use the method of Curry and Sokol (2016) to estimate those current relative magnitudes and their effects on tournament outcomes. Finally, in Section 6 we show the simulation results, and in Section 7 we discuss the implications of our work for the optimal size of the national college football championship tournament and conclude with some final remarks.
Models
In this section, we describe the core model. Here and in the remainder of the paper, we
refer to a random variable by an uppercase letter and a specific realization of it by the
corresponding lowercase letter (e.g.,
We let
In other words, in the absence of randomness, team
Of course, true team strengths
We assume that for each observer
When teams
The outcome of a game is assumed to be based on a random process; even a perfect observer
(for whom
In reality, there is no data to tell us true team strengths
Thus, for game
There are a variety of observers who publish their estimates of team strengths
As we note in Section 4, both the Sagarin ratings’ predictions
so
The fraction of the variance in
Our tournament simulation has four basic steps: Draw observed team strengths
Generate true team strengths
Seed the tournament based on
observed team strengths Simulate the winner and loser of each tournament game based on true
team strengths
Types of simulated tournaments
There are four types of tournament setups that we simulate. In some, like the current
football playoff system, the top (observed) teams are the tournament participants
regardless of whether or not they are champions of their conferences. We refer to this
type of tournament as a
Some proposed tournament setups have included guaranteeing that the highest-ranked non-Power-Five team would be included in the tournament. This guarantee could be included in both fully-open and partially-open tournaments, yielding the full set of four tournament types that we test (see Table 1). The non-Power-Five teams are from the American Athletic Conference (AAC), Conference USA (C-USA), Mid-American Conference (MAC), Mountain West Conference (MWC), and Sun Belt Conference (Sun Belt), which collectively are called the “Group of Five” conferences, plus any teams that are independent (not playing in a conference, but part of the FBS; in our simulations we do not include Notre Dame in this category because they are viewed like a Power-Five team, and in fact they play in a Power-Five conference for non-football sports).
Tournament types tested
Tournament types tested
Because we are going to compare different types of tournaments as well as tournament sizes, we split the process into two parts. In each run of the simulation, we first generate a set of teams with observed and real strengths, using Steps 1 and 2. Then, we simulate each type and size of tournament using Steps 3 and 4. We next describe in more detail each of the steps.
Drawing observed team strengths
We use Sagarin rating data for the past eleven years, 2009-2019, as the set of empirically observed team ratings (we use only ratings for teams in the NCAA’s FBS). There were 120 FBS teams in 2009-2011, 124 in 2012, 125 in 2013, 128 in 2014-2016, and 130 in 2017-2019. The ratings varied from a high of 105.35 (Clemson in 2016) to a low of 30.72 (Massachusetts in 2019). The overall distribution of ratings passes the Anderson-Darling and Kolmogorov-Smirnoff tests for normality, but we observed that the tails are not quite a good fit in the normal probability plot. Because the behavior of the upper tail (i.e., the best teams) is a primary focus of this paper, we therefore chose to not model the observed ratings with a normal distribution; instead, we used the eleven years (1383 data points) of Sagarin data as an empirical distribution. Tables 10, 11, and 12 in Appendix 1 show the full set of Sagarin ratings from 2009 to 2019.
For partially-open tournaments, it is important to know which teams are the Power-Five conference champions and which is the top-rated non-Power-Five team. Therefore, we keep that data separate, and draw from those empirical distributions separately. Tables 13 and 14 show the Power Five conference champions and top-rated non-Power-Five teams from 2009-2019.
For each of the simulated data sets (one for each run of the simulation), we draw
Generating true team strengths
To simulate games in a tournament, we need each team’s true strength
Given the set of observed team strengths, we use a conditional probability approach to
randomly generate true team strengths. The normality of the overall distribution of
Sagarin ratings allows us to model the distribution of team observed strength
In Appendix 3, we show that
In Equation (9), μSag and
For a single value of
Seeding the tournament
Because there are approximately 128 teams playing FBS-level football each year, and 128
is a convenient power of 2 for a single-elimination tournament, we use 128-team
tournaments in our simulations. In a full 128-team tournament, the teams are seeded into a
7 round single-elimination structure, in order of observed rating. In the first round, the

Structure of a 3-round size-8 tournament.
In a size-128 tournament with fewer than 128 teams, a team automatically advances to the next round if it has no opponent in the current round. For example, in Figure 1, if there were only three teams, Teams 1, 2, and 3 would have no opponents in the first round, so they would automatically advance to the second round. In the second round, Team 1 would again have no opponent (since it would normally play the winner of the game between Teams 4 and 5), so it would automatically advance to the third round, where it would play against the winner of the second-round game between Teams 2 and 3. Automatic advancement in the absence of an opponent is called a bye.
The teams that play in the tournament are selected as in Table 1, according to their observed team strengths. For example, in an 8-team partially-open tournament where Power-Five conference champions and the highest-rated Non-Power-Five team are all guaranteed places in the tournament, the eight teams in the tournament would be the five Power-Five conference champions, the highest-rated Non-Power-Five team, and the highest-rated two other teams. In partially-open tournaments, we assume that teams are seeded based on their ratings without regard to conference championship or Power-Five status, similar to the NCAA basketball tournament. For example, if a conference champion team is the 8th-highest-rated team out of those participating in the tournament, then that team will be seeded 8th despite being one of the first five teams that was automatically selected for the tournament.
Simulating the tournament
For each simulated tournament, we calculate the probability of each team winning based on
the teams’ true (simulated) team strengths and the variance
Let
Let
The simulation procedure described above depends on two different parameters: the variance
Bounding and reducing the set of parameter values
The ability to parameterize can be valuable for extending this work to other tournaments; however, for the college football national championship tournament using Sagarin ratings as the observed team strengths, we can significantly reduce the relevant set of parameter values.
First, we deduce an upper bound on σ
A second upper bound on σ
Equation (8) also provides a value for
σ
Finally, we can also derive an approximate lower bound on reasonable values of
σ
The distribution of the average difference between observed margin
Taken together, the bounds yield
4.5 ≤ σ
Estimating the actual value of σ
E
Sag
We know from Equation (8) that our model’s
variance in prediction error is equal to
The rematch data is shown in Appendix 4, in
Table 15. We first adjust each line and each outcome by 3 points in favor of the
road team to account for the value of playing at home, which models generally value at
approximately three points (see, for example, (Sagarin)). We then use the models of Curry
and Sokol (2016) to estimate the fraction of variance in the line estimation error that is
due to randomness. Their models’ maximum likelihood estimates are that approximately 167
or 194 of the variance in the line estimation error is due to
Even having an estimate for the randomness component of total variance and an estimate
for the Sagarin ratings’ estimation error, we still do not know the variance contributed
by the error in the tournament selection committee’s evaluation of teams. Of course, zero
is a (unattainable) lower bound, but we test other values of
σ
Tables 6, and 9 show the results of our simulations. The results show the expected tradeoffs: The larger the tournament, the higher the validity, while effectiveness varies depending on how much of the observed prediction error is due to randomness and how much is due to incorrect team strength estimates.
Validity of tournaments when σ
E
∈ {0, 1, 2, 3, 4, 5, 6, 7}
and
Validity of tournaments when σ
Effectiveness of tournaments when σ
Validity of tournaments when σ
Effectiveness of tournaments when σ
Simulation results for σ
Simulation results for σ
Simulation results for σ
Simulation results for σ
Even in the case where the standard error of the committee’s team strength estimates is as high as 7 points, the validity results for fully-open tournaments show that the true best team is about 35% likely to be ranked highest; of course, as the standard error decreases, the validity increases to the expected maximum of 100% when the committee makes no errors.
As a result, except where the only team to automatically qualify for the tournament is the top non-Power-Five team (which is unlikely to be the true best), a 1-team tournament clearly has the highest effectiveness as long as the committee’s standard error of team strength estimate is 4 points or lower. When that standard error is 5, the 1-team tournament generally has the highest effectiveness by 1-2%, and for standard errors of 6 or 7 larger tournaments are better. For all tournament types and sizes, the effectiveness of the most-effective tournament decreases as the standard error of the committee’s estimates increases.
Another consequence of the committee’s estimation error being relatively low is that the simulation results show effectiveness tiers not by number of rounds of a tournament (e.g., tournaments of size 5-8 require three rounds to determine a champion; 9-16 teams require 4 rounds, etc.), but by the number of games that the top-ranked team needs to play. For example, the effectiveness of a 3-round tournament with 8 teams is more similar to the effectiveness of a 4-round tournament with 9 teams than to a 3-round tournament with 7 teams. The reason is that in a 7-team tournament, the top-ranked team (which is reasonably likely to be the true best team) gets a bye in the first round; without an 8th team, the top-ranked team automatically advances to the second round. So, the top-ranked team has only two chances for in-game randomness to cause it to be upset, whereas in an 8- or 9-team tournament, the top-ranked team has three such chances.
Our results show, therefore, tiers of similar effectiveness: Tournaments of size 2 or 3 have similar effectiveness, as do tournaments of size 4 through 7, tournaments of size 8 through 15, and tournaments of size 16 and greater. (Tournaments of size 32 to 63, size 64 to 127, and size 128 each require the top-ranked team to play an additional game; however, the probability of the top-ranked team defeating the 32nd, 64th, or 128th-ranked team is sufficiently high even with in-game randomness that there is not much impact on effectiveness.)
All of these observations hold whether the in-game randomness has a variance of 167 or 194, just with slightly different magnitudes.
The current playoff system is a fully-open 4-team tournament. Most of the main criticisms and defenses of this tournament system can be phrased in the language of effectiveness and validity. Validity-based arguments, that the best team might be left out of the tournament without expansion and/or guarantees of inclusion, include that the current tournament might leave out the true best team in some years, every Power-Five conference winner should have a chance to play for the championship, and top non-Power-Five teams deserve a chance. Effectiveness arguments, that in an expanded tournament the best team might be less likely to win, include that the field might be diluted in an expanded tournament. Aside from validity and effectiveness, there are also fan-based and economic-based arguments: Fans want to see the championship decided by teams playing each other, and a larger tournament with more playoff games might also have the economic benefit of increased revenue and increase the number of fans who can attend a playoff game. In this section, we address the first two sets of questions by observing our simulated tournaments’ validity and effectiveness, and then discuss the implications on the fan-based and economic arguments.
In Section 6, we show simulation results for committees that range from perfect
(σ
Tables 3 show the validity and effectiveness of the current 4-team fully-open tournament, as well as the validity and effectiveness of all four types of 8-team tournaments we tested. Increasing the tournament from 4 to 8 while retaining its fully-open character decreases effectiveness by 4-6%, while increasing validity by 2-9%. However, the need for tradeoff decreases when all conference champions and the top non-Power-Five team are guaranteed spots in an 8-team tournament; in that case, effectiveness decreases by just 2-3% while validity increases by 1-4%. In essence, the simulation results suggest that substituting an 8-team tournament with guarantees for conference champions and the top non-Power-Five team does not create significant changes in validity or effectiveness. The effectiveness is not significantly decreased by the increase in number of teams, and while the validity does not increase significantly, giving opportunities to all conference champions and the top non-Power-Five team does not hurt effectiveness.
Tables
3 also show that 7-team tournaments
have the potential for good effectiveness/validity tradeoffs. A 7-team fully-open tournament
provides validity increases of 2-7% while decreasing effectiveness by just 0.3-1.1% compared
with the current 4-team fully-open tournament, and if σ
On the other hand, newer proposals for a 12-team tournament (e.g., CollegeFootballPlayoff.com (2021)) would have a greater impact on validity and effectiveness. Tables 5 show the validity and effectiveness of the current 4-team fully-open tournament as well as the validity and effectiveness of all four types of 12-team tournaments we tested. The simulation results show that expanding the tournament to 12 teams would decrease effectiveness by 5-6% no matter what, and increase validity by 2-12%. Unlike expansion to 7 or 8 teams where there might be little difference, changing from a 4-team tournament to a 12-team tournament includes a definite effectiveness/validity tradeoff in addition to the economic and fan effects.
Overall, when considering only 4-team and 8-team tournaments, our simulations suggest that the annual debate over tournament size may be much ado about nothing. Replacing the current 4-team fully-open tournament with an 8-team tournament with guarantees for conference champions and the top non-Power-Five team is likely to lead only to small changes in both validity and effectiveness. As a result, decision-makers can give full consideration to fan and economic issues. It is also possible to obtain increased validity with only a very small effectiveness change by switching to a 7-team partially-open tournament, albeit with one fewer playoff game (and the resulting fan and economic effects) than an 8-team tournament would require. On the other hand, if 12-team tournaments are under consideration, there is a distinct validity/effectiveness tradeoff involved: the tournament would be 2-12% more likely to include the true best team, but that true best team would be 5-6% less likely to be correctly identified by winning the championship.
