Sage Journals: Discover world-class research

Abstract

Every year, there is a popular debate over how many teams should take part in the NCAA’s FBS-level college football championship tournament, and especially whether it should be expanded from 4 teams to 8 or even 12. The inherent tradeoff is that the larger the tournament, the higher the probability that the true best team is included (“validity”), but the lower the probability that the true best team will avoid being upset and win the tournament (“effectiveness”). Using simulation based on empirically-derived estimates of the ability to measure true team quality and the amount of randomness inherent in each game, we show that the effect of expanding the tournament to 8 teams could be very small, an effectiveness decrease of only 2-3% while increasing validity by 1-4%, while a 7-team tournament provides slightly better tradeoffs. A 12-team tournament would decrease effectiveness by 5-6%.

Keywords

Simulation sports prediction American college football tournament uncertainty assessment

1 Introduction

In 2012, the National Collegiate Athletic Association (NCAA) approved using a four-team postseason playoff tournament to determine a national champion in college football at the FBS level (the highest level of intercollegiate competition), starting in 2014. The decision effectively doubled the number of teams involved in the postseason tournament, and there was immediate discussion, which has continued through now, about whether an 8-team tournament (or larger) would be even better. In this paper, we address the question of the optimal size of this postseason tournament.

Each year since at least 1936, a national champion has been chosen among FBS (formerly Division I-A) college football teams. Initially, the champion was chosen by polls of experts. These expert polls included the Associated Press (AP) poll of sportswriters (the first major national poll) from 1936 to 1997, and various polls of coaches from 1950 to 1997 (e.g., United Press International from 1950 to 1990, USA Today/CNN from 1991 to 1996, and USA Today/ESPN in 1997). The highest-ranked team in the polls was declared national champion; in the few years that the AP and coaches’ polls disagreed, the national championship was shared between the two polls’ winners (National Collegiate Athletic Association). In 1998, a new system, the Bowl Championship Series (BCS), was created. In the BCS, expert poll rankings and analytical rankings (called “computer rankings”) were combined to determine a ranking of the top teams. From 1998 to 2005, the BCS-designated national champion was unofficial (and for the 2003 season, the AP poll chose a different champion). From 2005 to 2013, the top two teams in the BCS played in a designated postseason game, with the winner being named the official national champion (National Collegiate Athletic Association). Beginning in 2014, a new system has been in place: A panel of experts selects four teams to play a three-game, two-round single-elimination tournament, the winner of which is named national champion.

The progression from polls (effectively a one-team tournament) to the BCS championship game (a two-team tournament) to a four-team tournament has been motivated partly by economics (the paid attendance and television revenue of playoff games is substantial), but even more so by a grassroots feeling that it is necessary to determine a champion “on the field” (i.e., by playing games) rather than by poll or computer, because neither voters nor algorithms are guaranteed to identify the absolute best team(s). Before the BCS system, the top teams in the polls rarely played each other at the end of the season, and it could be difficult to differentiate between the best teams. The novelty of the BCS system was that the two highest-ranked teams were guaranteed to play each other at the end of the season, and the argument in favor of having even more teams in the championship tournament is that even the top two can be difficult to differentiate from a larger set of very good teams so letting the teams play each other is the most fair way to sort out which one is really the best. (Even in the BCS system, there could be significant disagreement as to the selection of the two playoff teams, for example in 2004 when three major-conference teams (USC, Oklahoma, and Auburn) each won all of their regular-season games.)

On the other hand, the outcomes of sporting events contain enough randomness that the winner of a game is not necessarily the better team, and it is possible that a committee, although it might make mistakes, could have a higher probability of correctly identifying the best team than playoff games that are subject to football’s inherent randomness. Today’s tournament selection committee has two advantages over polls of the past: better information (many more games are televised to a national audience, video recording/playback capability allows them to see multiple games that are played at the same time, and more and deeper statistical information is available about teams and players) and better analytics (many of the top quantitative rating and evaluation systems were not developed in the polling era). As a result, the playoff selection committee is likely to have a smaller error in its evaluation of teams than polls had in the past. The progression of tournament size has actually been opposite what intuition might suggest is optimal: As human experts have been given the tools to make better judgments and decrease their likelihood of error, the tournament size has expanded, increasing the chance that a team correctly identified as the best by the human experts will fail to win the tournament.

In this paper, we investigate the optimal size of the college football national championship tournament by taking into account the relative magnitudes of the randomness inherent in college football and the errors in team evaluation by humans and algorithms.

2 Literature review

Optimal tournament design has been studied before, but none of the existing literature is sufficient to answer our research question. One main stream of optimal tournament size research (e.g., Dizdar 2013; Fullerton and McAfee 1999) has focused on the issue of effort, especially in research tournaments. Given assumptions on the technology and knowledge available to each firm that might enter a research competition, and on their probabilities of winning, these papers use game-theoretic models to estimate how much effort each competitor would spend, and use that analysis to determine the optimal number of participants, how to select participants, etc. Others (e.g., Chen, Ham, and Lim 2011; Hochtl et al. 2010; Sheremeta and Wu 2012) try to empirically test such predictions. These papers all start with the basic assumption that firms have more than one way to spend effort, so they might choose to put forth less effort in competitions where they are less likely to win (and thus a firm that is likely to succeed might also not need to put in maximum effort). In our work, we sidestep this issue, presuming that every team in a national championship tournament has just one football goal (to win the tournament) and will put forth maximum effort.

A second stream of research in designing optimal tournaments is not the composition, but rather the structure. Glenn (1960), Marchand (2002), Scarf and Bilbao (2006), and Seals (1963), among others, compare different tournament setups such as round-robin, pure knockout, and hybrids. In our work, we assume that the NCAA will retain its single-elimination (knockout) round-based format. Glickman (2008), Hwang (1982), and Schwenk (2000) look at adaptive approaches where tournaments may be re-seeded between rounds; in our work, we assume that the NCAA will not re-seed, so fans can make travel plans in advance (as is the case currently for the existing 68-team NCAA basketball tournament).

Other research, assuming a non-reseeded knockout tournament, looks at the optimality of the standard seeding of teams into tournament slots. Appleton (1995), Groh et al. (2012), Jennessy and Glickman (2016), Horen and Riezman (1985), Ryvkin (2005), and Vu (2010) investigate how to seed teams so that various objectives are optimized. For example, Horen and Riezman (1985) show that under some assumptions about team strength and head-to-head win probability, for a 3-round, 8-team tournament the standard seeding method does not maximize the probability that the best team will win. Jennessy and Glickman (2016) show the same empirically for 16-team tournaments, using a Bayesian approach that considers uncertainty in team strength. However, we assume that the NCAA will retain standard seeding, to retain fairness properties that Vu (2010) calls envy-freeness (that in every round, each team’s best-possible opponent must be weaker than the best-possible opponent of all lower-seeded teams) and delayed confrontation (that the top 2^k teams may not play each other until only 2^k or fewer teams remain in the tournament). We also assume that the NCAA will start every game with the standard 0-0 score, rather than giving one team some initial points based on an estimate of how much better they are than their opponent, as in the approach of Paine (2014).

The objective function when referring to an “optimal” tournament can be defined in different ways. Most research assumes a goal of maximizing the probability that the tournament winner will be the best team (Appleton 1995; David 1988; Glenn 1960; Glickman 2008; Jennessy and Glickman 2016; Hwang 1982; Marchand 2002; Schwenk 2000; Seals 1963; Vu 2010); others maximize the probability that the winner will be a certain team (Vu 2010), maximize the quality of the winner’s result in research tournaments (Dizdar 2013; Fullerton and McAfee 1999), minimize the fraction of unimportant games (Scarf and Bilbao 2006; Scarf and Shi 2008), maximize the average rank of the winner (Scarf and Bilbao 2006), maximize the average revenue of the tournament (Vu 2010), maximize the probability of the top two teams meeting in the final (Jennessy and Glickman 2016), maximize the consistency between expected number of wins and team strength (Jennessy and Glickman 2016), etc. Sokol (2010) also considered the number of significant upsets in a tournament as a driver of fan interest. In this paper, we consider two objectives. The primary objective is the probability of correctly identifying the best team, i.e., the probability that the best team wins the tournament; we refer to this as the effectiveness of the tournament. We also discuss secondarily the probability that the best team is selected to play in the tournament; we refer to this as the tournament’s validity.

Finally, and critically for our work, in the previous literature the information about each team, including its strength relative to competing teams, is almost always assumed to be known deterministically. Of all the work cited above, only Glickman (2008) and Jennessy and Glickman (2016) consider uncertainty in the strength of each team; their Bayesian models address how to seed (Jennessy and Glickman 2016) or re-seed (Glickman 2008) a tournament, not how many teams should be included.

So, none of the existing literature exactly addresses the question of how many teams should be in a tournament like the college football championship given both uncertainty in team strength estimation and randomness of game results.

The remainder of the paper is organized as follows: In Section 3, we describe our underlying models of the uncertainty and randomness in the system. In Section 4, we discuss how we populate our model with empirical data and simulate tournament results. Section 5 discusses parameterizing by the relative magnitudes of randomness and uncertainty, and we use the method of Curry and Sokol (2016) to estimate those current relative magnitudes and their effects on tournament outcomes. Finally, in Section 6 we show the simulation results, and in Section 7 we discuss the implications of our work for the optimal size of the national college football championship tournament and conclude with some final remarks.

3 Models

In this section, we describe the core model. Here and in the remainder of the paper, we refer to a random variable by an uppercase letter and a specific realization of it by the corresponding lowercase letter (e.g., a would be a single draw from the random variable A).

We let g = (t₁, t₂) denote a college football game between teams t₁ and t₂, where the pair of teams is ordered lexicographically. Let $s_{t_{1}}^{True}$ and $s_{t_{2}}^{True}$ represent the true strengths of teams t₁ and t₂; we assume that those team strengths do not vary during a season. When teams t₁ and t₂ play each other in game g, the expected margin of victory (the line) $l_{g}^{True}$ for team t₁ over team t₂ is the difference of the team strengths:

$l_{g}^{True} = s_{t_{1}}^{True} - s_{t_{2}}^{True} .$ (1)

In other words, in the absence of randomness, team t₁ would beat team t₂ by $l_{g}^{True}$ points. Note that if t₂ is a better (stronger) team than t₁ at the time of game g, then $l_{g}^{True}$ will be negative. (Equation (1) assumes the teams play on a neutral field, i.e., neither team has the advantage of playing at its home stadium. If one team is playing at home, an additional term for home-field advantage would be added.)

Of course, true team strengths $s_{t}^{True}$ are not exactly known, and different observers (e.g., experts, poll respondents, computer rating systems, etc.) will have different estimates of $s_{t}^{True}$ . Based on the results, game statistics, and observed play in previous games, each observer i has an estimate $s_{t, g}^{i}$ of the strength of team t at the time of game g. The difference between $s_{t, g}^{i}$ and $s_{t}^{True}$ is the observation error of observer i for team t, a random variable which we denote as $e_{t, g}^{i}$ . Therefore,

$s_{t, g}^{i} = s_{t}^{True} + e_{t, g}^{i} .$ (2)

We assume that for each observer i, the values of $e_{\cdot, \cdot}^{i}$ are independent and identically distributed across teams and games.

When teams t₁ and t₂ play each other in a game g on a neutral field, as is the case in the postseason tournament, the margin of victory predicted by observer i will be

$\begin{matrix} l_{g}^{i} = s_{t_{1}, g}^{i} - s_{t_{2}, g}^{i} = (s_{t_{1}}^{True} + e_{t_{1}, g}^{i}) \\ - (s_{t_{2}}^{True} + e_{t_{2}, g}^{i}) = l_{g}^{True} + (e_{t_{1}, g}^{i} - e_{t_{2}, g}^{i}) . \end{matrix}$ (3)

The outcome of a game is assumed to be based on a random process; even a perfect observer (for whom $s_{t, g}^{i} = s_{t}^{True}$ for all t and g) will be unable to exactly predict the margin of victory (i.e., the number of points by which team t₁ wins game g). We denote by R the random variable for the error in the prediction, so that for game g = (t₁, t₂) the actual margin of victory m_g of team t₁ over team t₂ is different from $l_{g}^{True}$ by r_g:

$m_{g} = l_{g}^{True} + r_{g} = (s_{t_{1}}^{True} - s_{t_{2}}^{True}) + r_{g} .$ (4)

R includes all of the random factors that might affect the outcome of a football game, such as day-to-day performance variation, weather, the direction a loose ball bounces on the ground, etc. We assume that the values of r are independent and identically distributed across games, and that the distribution of $e_{t, g}^{i}$ is independent of the value $s_{t}^{True}$ .

In reality, there is no data to tell us true team strengths s^True. Rearranging terms in Equation (3) and solving for $l_{g}^{True}$ , or solving Equation (2) for $s_{t}^{True}$ and substituting into Equation (4), yield that for each observer i, the observed margin of victory is

$m_{g} = l_{g}^{True} + r_{g} = (s_{t_{1}, g}^{i} - s_{t_{2}, g}^{i}) + (e_{t_{2}, g}^{i} - e_{t_{1}, g}^{i} + r_{g}) .$ (5)

Thus, for game g, observer i’s prediction error $x_{g}^{i}$ is

$x_{g}^{i} = m_{g} - (s_{t_{1}, g}^{i} - s_{t_{2}, g}^{i}) = e_{t_{2}, g}^{i} - e_{t_{1}, g}^{i} + r_{g} .$ (6)

There are a variety of observers who publish their estimates of team strengths s. In this paper, we first demonstrate the model using the Sagarin ratings from Sagarin as the observer, for two reasons: availability and quality. End-of-season ratings 1 for each college football team are available in the format we need (where the difference between two teams’ ratings is the estimate of the margin of victory in a game between those teams) for all years from 1998-2019 (we omit 2020 because of the different schedules and playing conditions in the COVID year), and the empirical standard error in the Sagarin ratings’ margin-of-victory predictions (i.e., the observations of x^Sag) is less than a point higher than that of the Las Vegas betting line, a common standard for game prediction quality.

As we note in Section 4, both the Sagarin ratings’ predictions L^Sag and the Las Vegas line L^Vegas have normally-distributed errors X^Sag and X^Vegas with means that are not significantly different from zero. Others have also noted and/or used this normal distribution in football (e.g. Gill 2000; Berry 2003; Fanson 2020). Since X is empirically shown to be normally distributed with mean zero, we make the mild assumption that its components E (error in team-strength estimates) and R (in-game randomness) are also both normally distributed and all independent of each other, with variances $σ_{E^{\cdot}}^{2}$ and $σ_{R}^{2}$ , respectively, and mean zero for R. (Because ratings are relative to each other, the mean μ_E of E is not important.) Therefore, for each i (Sagarin and Vegas), for any game the three independent components of the observed prediction error xⁱ are

$e_{t_{1}, g}^{i} \sim N (μ_{E^{i}}, σ_{E^{i}}^{2}), e_{t_{2}, g}^{i} \sim N (μ_{E^{i}}, σ_{E^{i}}^{2}), and r_{g} \sim N (0, σ_{R}^{2})$ (7)

$X^{i} \sim N (0, 2 σ_{E^{i}}^{2} + σ_{R}^{2}),$ (8).

The fraction of the variance in Xⁱ that is attributable to error in team strength estimates ( $2 σ_{E^{i}}^{2}$ ) and to in-game randomness ( $σ_{R}^{2}$ ) is not known. In Section 5, we discuss how we parameterize our results on the fraction of variation attributable to in-game randomness, but first, in the next section, we discuss how we use our model to create the simulated tournaments that we use for our analysis.

4 Simulating tournaments

Our tournament simulation has four basic steps:

Draw observed team strengths s^Obs for each team from an empirical distribution of historical ratings. The observed team strengths correspond to the opinions of the tournament selection committee, so the teams chosen for the tournament and their seeding in the tournament are based on the observed team strengths.

Generate true team strengths s^True for each team based on the observed team strength and a randomly-generated observation error from the distribution of E. The s^True are the teams’ actual strengths, so game outcomes in the simulated tournaments are based on these.

Seed the tournament based on observed team strengths s^Obs.

Simulate the winner and loser of each tournament game based on true team strengths s^True and in-game randomness R.

Types of simulated tournaments

There are four types of tournament setups that we simulate. In some, like the current football playoff system, the top (observed) teams are the tournament participants regardless of whether or not they are champions of their conferences. We refer to this type of tournament as a fully-open tournament. Another approach, like the current NCAA basketball tournament system, is to guarantee participation to conference champions regardless of their ranking. We refer to this type of tournament as a partially-open tournament. For football, most proposals have been to guarantee a spot in the tournament only to winners of the “Power Five” conferences: the Atlantic Coast Conference (ACC), Big 12 Conference (Big 12), Big Ten Conference (Big Ten), Pac-12 Conference (Pac-12), and Southeastern Conference (SEC).

Some proposed tournament setups have included guaranteeing that the highest-ranked non-Power-Five team would be included in the tournament. This guarantee could be included in both fully-open and partially-open tournaments, yielding the full set of four tournament types that we test (see Table 1). The non-Power-Five teams are from the American Athletic Conference (AAC), Conference USA (C-USA), Mid-American Conference (MAC), Mountain West Conference (MWC), and Sun Belt Conference (Sun Belt), which collectively are called the “Group of Five” conferences, plus any teams that are independent (not playing in a conference, but part of the FBS; in our simulations we do not include Notre Dame in this category because they are viewed like a Power-Five team, and in fact they play in a Power-Five conference for non-football sports).

Table 1
Tournament types tested

Tournament type Tournament size k Description

Fully-open 1 ≤ k ≤ 128 k highest-ranked teams regardless

of conference affiliation and conference champion status

Fully-open with non-Power-Five guarantee k = 1 Highest-ranked non-Power-Five team

2 ≤ k ≤ 128 Highest-ranked non-Power-Five team, and

k - 1 highest-ranked other teams

Partially-open 1 ≤ k ≤ 5 k highest-ranked Power-Five conference champions

6 ≤ k ≤ 128 All five Power-Five conference champions, and

k - 5 highest-ranked other teams

Partially-open with non-Power-Five guarantee 1 ≤ k ≤ 5 k highest-ranked teams from among the

highest-ranked non-Power-Five team and

the Power-Five conference champions

k = 6 All five Power-Five conference champions, and

highest-ranked non-Power-Five team

7 ≤ k ≤ 128 All five Power-Five conference champions, and

highest-ranked non-Power-Five team, and

k - 6 highest-ranked other teams

Tournament type	Tournament size k	Description
Fully-open	1 ≤ k ≤ 128	k highest-ranked teams regardless
		of conference affiliation and conference champion status
Fully-open with non-Power-Five guarantee	k = 1	Highest-ranked non-Power-Five team
	2 ≤ k ≤ 128	Highest-ranked non-Power-Five team, and
		k - 1 highest-ranked other teams
Partially-open	1 ≤ k ≤ 5	k highest-ranked Power-Five conference champions
	6 ≤ k ≤ 128	All five Power-Five conference champions, and
		k - 5 highest-ranked other teams
Partially-open with non-Power-Five guarantee	1 ≤ k ≤ 5	k highest-ranked teams from among the
		highest-ranked non-Power-Five team and
		the Power-Five conference champions
	k = 6	All five Power-Five conference champions, and
		highest-ranked non-Power-Five team
	7 ≤ k ≤ 128	All five Power-Five conference champions, and
		highest-ranked non-Power-Five team, and
		k - 6 highest-ranked other teams

Because we are going to compare different types of tournaments as well as tournament sizes, we split the process into two parts. In each run of the simulation, we first generate a set of teams with observed and real strengths, using Steps 1 and 2. Then, we simulate each type and size of tournament using Steps 3 and 4. We next describe in more detail each of the steps.

Drawing observed team strengths

We use Sagarin rating data for the past eleven years, 2009-2019, as the set of empirically observed team ratings (we use only ratings for teams in the NCAA’s FBS). There were 120 FBS teams in 2009-2011, 124 in 2012, 125 in 2013, 128 in 2014-2016, and 130 in 2017-2019. The ratings varied from a high of 105.35 (Clemson in 2016) to a low of 30.72 (Massachusetts in 2019). The overall distribution of ratings passes the Anderson-Darling and Kolmogorov-Smirnoff tests for normality, but we observed that the tails are not quite a good fit in the normal probability plot. Because the behavior of the upper tail (i.e., the best teams) is a primary focus of this paper, we therefore chose to not model the observed ratings with a normal distribution; instead, we used the eleven years (1383 data points) of Sagarin data as an empirical distribution. Tables 10, 11, and 12 in Appendix 1 show the full set of Sagarin ratings from 2009 to 2019.

For partially-open tournaments, it is important to know which teams are the Power-Five conference champions and which is the top-rated non-Power-Five team. Therefore, we keep that data separate, and draw from those empirical distributions separately. Tables 13 and 14 show the Power Five conference champions and top-rated non-Power-Five teams from 2009-2019.

For each of the simulated data sets (one for each run of the simulation), we draw $s_{t}^{Obs}$ for each of 128 observed team strengths at the time of tournament selection: For each Power-Five conference we draw one rating from the set of its champions’ data, we draw one rating from the set of top-ranked non-Power-Five team ratings, and we draw the remaining 122 from the full set of remaining Sagarin ratings (excluding the Power-Five conference champions and the top-ranked non-Power-Five teams). The observed rating distributions for each Power-Five conference’s champion are sufficiently different that we draw once from each conference’s empirical distribution, rather than five times from the combined data set.

Generating true team strengths

To simulate games in a tournament, we need each team’s true strength $s_{t}^{True}$ . The values of $s_{t}^{True}$ are unknown (otherwise, there would be no need for a tournament to determine the best team); what is known is only $s_{t}^{Obs}$ , each team’s observed strength. Therefore, we need to generate for the simulation a set of true strengths $s_{t}^{True}$ based on the observed strengths $s_{t}^{Obs}$ .

Given the set of observed team strengths, we use a conditional probability approach to randomly generate true team strengths. The normality of the overall distribution of Sagarin ratings allows us to model the distribution of team observed strength S^Obs as the sum of independent draws from two iid normal distributions: the true strength S^True (a normal distribution with mean μ_Sag and variance $σ_{True}^{2}$ ) and the estimation error E^Obs (a normal distribution with mean 0 and variance $σ_{E^{Obs}}^{2}$ ). Because we are using the Sagarin data, S^Obs = S^Sag, E^Obs = E^Sag, and $σ_{E^{Obs}}^{2} = σ_{E^{Sag}}^{2}$ .

In Appendix 3, we show that S^True|S^Obs is normally distributed, according to

$N (s_{t}^{Obs} - (s_{t}^{Obs} - μ_{Sag}) \frac{σ_{E^{Sag}}^{2}}{σ_{Sag}^{2}}, (σ_{Sag}^{2} - σ_{E^{Sag}}^{2}) \frac{σ_{E^{Sag}}^{2}}{σ_{Sag}^{2}}) .$ (9)

In Equation (9), μ_Sag and $σ_{Sag}^{2}$ are observed data, and $s_{t}^{Obs}$ is drawn from Sagarin data as described above. In Section 5, we describe how we deal with the unknown $σ_{E^{Sag}}^{2}$ by parameterizing, bounding, and using a natural-experiment approach for estimation.

For a single value of $σ_{E^{Sag}}^{2}$ , we could generate $s_{t}^{True}$ from $s_{t}^{Obs}$ by drawing from the distribution $N (s_{t}^{Obs} - (s_{t}^{Obs} - μ_{Sag}) \frac{σ_{E^{Sag}}^{2}}{σ_{Sag}^{2}}, (σ_{Sag}^{2} - σ_{E^{Sag}}^{2}) \frac{σ_{E^{Sag}}^{2}}{σ_{Sag}^{2}})$ . However, to compare across multiple values of $σ_{E^{Sag}}^{2}$ , we instead generate a z-score z_t for each team, and use the same z-score for each value of $σ_{E^{Sag}}^{2}$ .

Seeding the tournament

Because there are approximately 128 teams playing FBS-level football each year, and 128 is a convenient power of 2 for a single-elimination tournament, we use 128-team tournaments in our simulations. In a full 128-team tournament, the teams are seeded into a 7 round single-elimination structure, in order of observed rating. In the first round, the ith-highest-rated team plays against the (2⁷ - i + 1)th-highest-rated team, for every i = 1, …, 2⁷. In subsequent rounds, previous-round winners are matched so that, if higher-rated teams always win, the sum of the ranks of teams playing against each other in round r would always equal 2^7-r+1 + 1; otherwise, a lower-rated team that beats a higher-rated team would take the higher-rated team’s place in the next round. Figure 1 shows the structure of a 3-round size-8 tournament as an example. This is a common structure for single-elimination tournaments (for example, it is used in the NCAA basketball championship tournament).

Fig. 1

Structure of a 3-round size-8 tournament.

In a size-128 tournament with fewer than 128 teams, a team automatically advances to the next round if it has no opponent in the current round. For example, in Figure 1, if there were only three teams, Teams 1, 2, and 3 would have no opponents in the first round, so they would automatically advance to the second round. In the second round, Team 1 would again have no opponent (since it would normally play the winner of the game between Teams 4 and 5), so it would automatically advance to the third round, where it would play against the winner of the second-round game between Teams 2 and 3. Automatic advancement in the absence of an opponent is called a bye.

The teams that play in the tournament are selected as in Table 1, according to their observed team strengths. For example, in an 8-team partially-open tournament where Power-Five conference champions and the highest-rated Non-Power-Five team are all guaranteed places in the tournament, the eight teams in the tournament would be the five Power-Five conference champions, the highest-rated Non-Power-Five team, and the highest-rated two other teams. In partially-open tournaments, we assume that teams are seeded based on their ratings without regard to conference championship or Power-Five status, similar to the NCAA basketball tournament. For example, if a conference champion team is the 8th-highest-rated team out of those participating in the tournament, then that team will be seeded 8th despite being one of the first five teams that was automatically selected for the tournament.

Simulating the tournament

For each simulated tournament, we calculate the probability of each team winning based on the teams’ true (simulated) team strengths and the variance $σ_{R}^{2}$ of in-game randomness. The calculation is straightforward. Let p_rt be the probability that Team t wins round r of the tournament (defining p_0t = 1 for every team in the tournament), and q_tu be the probability that Team t would beat Team u if they play head-to-head, so

$q_{tu} = \Pr (N (s_{t}^{True} - s_{u}^{True}, σ_{R}^{2}) > 0) = Φ (\frac{s_{t}^{True} - s_{u}^{True}}{σ_{R}}) .$ (10)

Let O_rt be the set of all possible opponents for team t in round r. Then, for each round r and each team t,

$p_{rt} = p_{r - 1, t} \sum_{u \in O_{rt}} p_{r - 1, u} q_{tu} .$ (11)

Let t^* be the team with the highest true strength among all teams. As defined earlier, each simulated tournament is valid if t^* is one of the teams selected to play in the tournament, and the probability that the tournament is effective is equal to p_7,t^*.

5 Parameterizing on randomness

The simulation procedure described above depends on two different parameters: the variance $σ_{R}^{2}$ of the randomness in college football games, and the variance $σ_{E^{Obs}}^{2}$ in the error in team strength estimations (the difference between the observed ratings and the true ones). Neither of those variances is known, but we can obtain results by parameterizing over σ_R and σ_{E
^Obs} (which is really σ_{E
^Sag} because we use the Sagarin ratings as our observed team strengths). We test values of σ_R ∈ {0, 1, 2, …, 16} and σ_{E
^Obs} ∈ {0, 1, 2, …, 13}. Appendix 2 shows the validity and effectiveness of each tournament from 1 to 128 teams, of each of the four types in Table 1, for each of the 17 × 14 = 238 pairs of parameter values.

Bounding and reducing the set of parameter values

The ability to parameterize can be valuable for extending this work to other tournaments; however, for the college football national championship tournament using Sagarin ratings as the observed team strengths, we can significantly reduce the relevant set of parameter values.

First, we deduce an upper bound on σ_{E
^Obs} using the fact that each observed team strength is equal to the team’s real strength plus an error term. Because we assume the errors are iid, Equation (2) implies that $σ_{Obs}^{2} = σ_{True}^{2} + σ_{E^{Obs}}^{2}$ , so $σ_{Obs}^{2} \geq σ_{E^{Obs}}^{2}$ . Since the variance in Sagarin ratings is approximately 169, this gives the upper bound

$σ_{E^{Sag}} \leq 13 .$ (12)

A second upper bound on σ_{E
^Obs} can be derived from Equation (8), the distribution of X^Obs, the error in the observed ratings’ predictions of each game’s margin of victory. Empirically, X^Sag is normally distributed with variance approximately 262 (ThePredictionTracker.com). Equation (8) implies that $σ_{X^{Sag}}^{2} = 2 σ_{E^{Sag}}^{2} + σ_{R}^{2}$ , which gives (for the Sagarin ratings) a tighter upper bound:

$σ_{E^{Sag}} \leq \sqrt{\frac{262}{2}} \approx 11.4 .$ (13)

Equation (8) also provides a value for σ_R given a value of σ_{E
^Sag}:

$σ_{R} = \sqrt{262 - 2 σ_{E^{Sag}}^{2}} .$ (14)

Finally, we can also derive an approximate lower bound on reasonable values of σ_{E
^Obs}. The Sagarin ratings use only game-score information as input 2 . The margin of victory in each game played by team i gives an observation of team i’s true strength relative to its opponent, but that observation is wrong by some amount equal to the effect of the in-game randomness, i.e., a normal random variable with mean zero and variance $σ_{R}^{2}$ .

The distribution of the average difference between observed margin m_g and true strength difference $l_{g}^{True}$ in k games played by team i will have variance $\frac{σ_{R}^{2}}{k}$ , so the error in the observer’s estimate of team i’s strength will have at least that much variance. (It might have more, if the observer imperfectly converts game-score information to team-strength estimates, but as a lower bound the error in the observer’s team strength estimate will have variance at least $\frac{σ_{R}^{2}}{k}$ .) Since $σ_{E^{Sag}}^{2} \geq \frac{σ_{R}^{2}}{k}$ and $σ_{R}^{2} + 2 σ_{E^{Sag}}^{2} = σ_{X^{Sag}}^{2}$ , we can derive a bound of $σ_{E^{Sag}}^{2} \geq \frac{σ_{X^{Sag}}^{2}}{k + 2}$ . By the time teams are chosen for the national championship tournament, most teams will have played 11 games (some may have played one or two fewer games, or one or two more games), which yields an approximate bound of

$σ_{E^{Sag}} \geq \sqrt{\frac{262}{13}} \approx 4.5 .$ (15)

Taken together, the bounds yield 4.5 ≤ σ_{E
^Sag} ≤ 11.4.

Estimating the actual value of σ_{E
^Sag}

We know from Equation (8) that our model’s variance in prediction error is equal to $2 σ_{E^{Sag}}^{2} + σ_{R}^{2}$ ; however, the amount of variance attributable to error in team strength estimates ( $2 σ_{E^{Sag}}^{2}$ ) and to in-game randomness ( $σ_{R}^{2}$ ) is not known. We follow the natural-experiment methodology of Curry and Sokol (2016) to estimate the relative magnitudes of randomness and uncertainty that comprise the total variance in Xⁱ (the error in observer i’s predicted margin of victory). As in Curry and Sokol (2016), we exploit the rare cases in which two teams played a same-year rematch (i.e., they played each other twice in the same college football season); this allows us to estimate σ_{E
^Sag} and $σ_{R}^{2}$ without the need to try to estimate $s_{t}^{True}$ , which would introduce another source of error. We found 63 such matchups from 1997 through 2019, and obtained data on the location, Las Vegas line (predicted margin of victory), and actual margin of victory from OddsShark.com. We used the Las Vegas line because Sagarin data was not fully available, but the two are similar: From 2009 to 2019 the estimation error of the Las Vegas line (ThePredictionTracker.com) was normally distributed with variance $σ_{X^{Vegas}}^{2} = 243$ and mean not significantly different from zero. Because the Las Vegas and Sagarin estimates are both subject to the exact same in-game randomness, we could attribute the difference of $σ_{X^{Sag}}^{2} - σ_{X^{Vegas}}^{2} = 262 - 243 = 19$ in their error variances entirely to error in team strength estimation 3 .

The rematch data is shown in Appendix 4, in Table 15. We first adjust each line and each outcome by 3 points in favor of the road team to account for the value of playing at home, which models generally value at approximately three points (see, for example, (Sagarin)). We then use the models of Curry and Sokol (2016) to estimate the fraction of variance in the line estimation error that is due to randomness. Their models’ maximum likelihood estimates are that approximately 167 or 194 of the variance in the line estimation error is due to $σ_{R}^{2}$ , the in-game randomness 4 . As a result, the estimates of σ_{E
^Sag} are $\sqrt{\frac{262 - 194}{2}} \approx 6$ and $\sqrt{\frac{262 - 167}{2}} \approx 7$ .

Even having an estimate for the randomness component of total variance and an estimate for the Sagarin ratings’ estimation error, we still do not know the variance contributed by the error in the tournament selection committee’s evaluation of teams. Of course, zero is a (unattainable) lower bound, but we test other values of σ_E as well. Specifically, we test values of σ_E equal to 0 (a perfect committee), 7 (approximately equal to the higher estimate of the Sagarin ratings’ error), and every integer in between.

6 Results

Tables 6 , and 9 show the results of our simulations. The results show the expected tradeoffs: The larger the tournament, the higher the validity, while effectiveness varies depending on how much of the observed prediction error is due to randomness and how much is due to incorrect team strength estimates.

Table 2
Validity of tournaments when σ_E ∈ {0, 1, 2, 3, 4, 5, 6, 7} and $σ_{R}^{2} = 167$

Size Fully-open

0 1 2 3 4 5 6 7

1 100.0 89.0 79.6 68.4 60.2 50.9 43.8 35.2

2 100.0 99.1 95.6 88.3 81.5 72.0 63.1 53.3

3 100.0 99.9 98.6 94.6 89.2 81.1 72.5 62.3

4 100.0 100.0 99.6 97.2 93.8 87.5 79.2 69.8

5 100.0 100.0 99.8 98.5 96.1 91.3 84.1 75.6

6 100.0 100.0 99.9 99.1 96.7 92.4 86.2 78.8

7 100.0 100.0 100.0 99.4 98.1 94.5 89.3 82.8

8 100.0 100.0 100.0 99.8 98.9 96.3 91.8 85.8

9 100.0 100.0 100.0 99.8 99.3 97.0 93.0 88.0

10 100.0 100.0 100.0 99.9 99.5 97.8 94.1 89.9

11 100.0 100.0 100.0 100.0 99.7 98.2 95.3 91.3

12 100.0 100.0 100.0 100.0 99.8 99.0 96.4 92.6

13 100.0 100.0 100.0 100.0 99.8 99.1 96.9 93.6

14 100.0 100.0 100.0 100.0 99.8 99.1 97.2 94.3

15 100.0 100.0 100.0 100.0 99.8 99.4 97.9 95.2

16 100.0 100.0 100.0 100.0 99.9 99.6 98.5 96.1

17-128 100.0-100.0 100.0-100.0 100.0-100.0 100.0-100.0 100.0-100.0 99.7-100.0 98.6-100.0 96.4-100.0

Fully-open with non-Power 5 guarantee

0 1 2 3 4 5 6 7

1 0.0 0.0 0.0 0.2 0.4 0.5 0.6 1.1

2 100.0 89.0 79.6 68.6 60.6 51.4 44.4 36.3

3 100.0 99.1 95.6 88.5 81.9 72.5 63.7 54.4

4 100.0 99.9 98.7 94.7 89.5 81.7 73.2 63.4

5 100.0 100.0 99.6 97.2 94.0 87.9 79.7 70.9

6 100.0 100.0 99.8 98.5 96.2 91.5 84.4 76.4

7 100.0 100.0 99.9 99.2 97.0 93.0 86.9 80.0

8 100.0 100.0 100.0 99.4 98.3 94.9 89.8 83.4

9 100.0 100.0 100.0 99.8 99.1 96.6 92.3 86.5

10 100.0 100.0 100.0 99.8 99.4 97.2 93.4 88.8

11 100.0 100.0 100.0 99.9 99.5 98.0 94.7 90.9

12 100.0 100.0 100.0 100.0 99.7 98.5 95.7 91.7

13 100.0 100.0 100.0 100.0 99.8 99.1 96.8 93.1

14 100.0 100.0 100.0 100.0 99.8 99.1 97.1 94.2

15 100.0 100.0 100.0 100.0 99.8 99.1 97.3 94.6

16 100.0 100.0 100.0 100.0 99.9 99.6 98.4 95.9

17-128 100.0-100.0 100.0-100.0 100.0-100.0 100.0-100.0 99.9-100.0 99.6-100.0 98.5-100.0 96.3-100.0

Partially-open

0 1 2 3 4 5 6 7

1 84.2 76.9 70.5 61.2 54.6 46.4 39.7 31.8

2 84.2 83.6 81.5 75.8 70.8 63.3 55.9 47.5

3 84.2 84.0 82.1 78.3 74.0 67.4 60.8 53.2

4 84.2 84.0 82.1 78.7 74.9 69.1 63.4 56.3

5 84.2 84.0 82.2 79.2 75.4 69.9 64.8 57.8

6 100.0 99.4 98.1 95.4 91.8 86.2 80.2 72.4

7 100.0 100.0 99.8 98.4 95.5 91.2 85.3 77.7

8 100.0 100.0 99.9 98.8 96.7 93.5 88.6 82.0

9 100.0 100.0 100.0 99.7 98.1 95.3 91.5 86.0

10 100.0 100.0 100.0 99.8 98.9 96.6 93.1 88.0

11 100.0 100.0 100.0 99.8 99.3 97.5 94.4 90.3

12 100.0 100.0 100.0 99.9 99.6 98.2 95.6 91.9

13 100.0 100.0 100.0 100.0 99.8 98.9 96.8 93.4

14 100.0 100.0 100.0 100.0 99.8 99.0 97.3 94.3

15 100.0 100.0 100.0 100.0 99.8 99.3 98.1 95.4

16 100.0 100.0 100.0 100.0 99.9 99.4 98.3 95.8

17-128 100.0-100.0 100.0-100.0 100.0-100.0 100.0-100.0 99.9-100.0 99.6-100.0 98.6-100.0 96.4-100.0

Partially-open with non-Power 5 guarantee

0 1 2 3 4 5 6 7

1 84.2 76.9 70.5 61.2 54.6 46.4 39.7 31.8

2 84.2 83.6 81.5 76.0 71.0 63.2 55.8 47.3

3 84.2 84.0 82.1 78.5 74.3 67.5 60.9 53.3

4 84.2 84.0 82.1 78.8 75.2 69.3 63.4 56.5

5 84.2 84.0 82.2 79.4 75.8 70.3 65.2 58.5

6 84.2 84.0 82.2 79.4 75.8 70.4 65.4 58.9

7 100.0 99.4 98.2 95.7 92.3 86.8 80.9 73.5

8 100.0 100.0 99.8 98.4 95.7 91.5 85.7 78.5

9 100.0 100.0 99.9 98.8 96.9 93.8 89.0 82.9

10 100.0 100.0 100.0 99.7 98.2 95.6 92.0 86.7

11 100.0 100.0 100.0 99.8 99.2 97.0 93.7 89.1

12 100.0 100.0 100.0 99.8 99.4 97.7 94.9 91.0

13 100.0 100.0 100.0 99.9 99.6 98.3 96.0 92.4

14 100.0 100.0 100.0 100.0 99.8 99.0 97.1 93.8

15 100.0 100.0 100.0 100.0 99.8 99.1 97.6 94.9

16 100.0 100.0 100.0 100.0 99.9 99.4 98.3 95.8

17-128 100.0-100.0 100.0-100.0 100.0-100.0 100.0-100.0 99.9-100.0 99.5-100.0 98.4-100.0 96.1-100.0

Table 3

Effectiveness of tournaments when σ_E ∈ {0, 1, 2, 3, 4, 5, 6, 7} and $σ_{R}^{2} = 167$

Size	Fully-open
	0	1	2	3	4	5	6	7
1	100.0	89.0	79.6	68.4	60.2	50.9	43.8	35.2
2	60.5	60.7	59.9	57.2	54.6	49.9	45.1	39.2
3	63.4	61.3	59.5	56.7	54.2	50.2	46.1	40.4
4	47.0	47.2	47.5	47.6	47.5	46.0	43.3	39.3
5	47.8	47.8	47.8	47.6	47.2	45.9	43.5	40.0
6	48.3	48.4	48.2	47.7	47.1	45.5	43.2	40.0
7	50.3	49.3	48.3	47.3	46.5	45.0	43.1	40.2
8	41.9	42.0	42.1	42.2	42.3	42.0	40.8	38.8
9	42.3	42.2	42.1	42.0	42.0	41.5	40.3	38.6
10	42.2	42.2	42.2	42.1	41.9	41.4	40.2	38.6
11	42.2	42.2	42.2	42.2	42.0	41.5	40.2	38.7
12	42.6	42.7	42.6	42.4	42.2	41.6	40.2	38.6
13	43.6	43.5	43.3	42.9	42.4	41.6	40.2	38.5
14	43.9	43.9	43.6	43.1	42.5	41.5	40.1	38.4
15	44.8	44.3	43.7	42.8	42.0	41.0	39.7	38.0
16	40.9	40.8	40.6	40.3	39.8	39.1	38.3	37.0
17-128	40.7-42.6	40.7-42.4	40.5-41.9	40.0-41.3	39.3-40.5	38.2-39.4	36.8-38.1	35.1-36.8
Fully-open with non-Power 5 guarantee
	0	1	2	3	4	5	6	7
1	0.0	0.0	0.0	0.2	0.4	0.5	0.6	1.1
2	88.8	79.3	71.0	61.2	54.1	45.9	39.4	32.1
3	64.5	63.5	61.8	58.3	55.1	49.9	44.8	38.8
4	56.5	55.4	54.5	52.6	50.8	47.5	43.7	38.7
5	49.2	49.0	49.0	48.6	48.1	46.2	43.3	39.4
6	48.0	48.2	48.3	47.9	47.4	45.9	43.4	40.0
7	49.6	49.0	48.5	47.7	46.8	45.2	42.9	39.8
8	45.2	44.9	44.5	44.1	43.8	42.9	41.2	38.8
9	43.0	42.9	42.7	42.6	42.5	41.9	40.6	38.6
10	42.1	42.2	42.2	42.2	42.1	41.5	40.2	38.6
11	42.1	42.2	42.2	42.2	42.0	41.5	40.2	38.8
12	42.5	42.6	42.6	42.4	42.2	41.6	40.3	38.6
13	43.5	43.4	43.2	42.9	42.4	41.7	40.3	38.6
14	43.9	43.9	43.6	43.1	42.5	41.6	40.2	38.5
15	44.7	44.3	43.7	42.9	42.1	41.1	39.7	38.0
16	41.4	41.3	41.0	40.6	40.1	39.4	38.5	37.1
17-128	40.7-42.6	40.7-42.4	40.5-41.9	40.0-41.3	39.3-40.5	38.2-39.4	36.8-38.1	35.1-36.8
Partially-open
	0	1	2	3	4	5	6	7
1	84.2	76.9	70.5	61.2	54.6	46.4	39.7	31.8
2	53.4	53.5	53.0	50.4	48.4	44.5	40.3	35.2
3	56.4	55.1	53.4	50.4	48.0	44.1	40.3	35.6
4	46.6	46.4	45.6	44.1	42.6	40.2	37.5	33.9
5	47.8	47.4	46.4	44.6	42.9	40.3	37.6	34.0
6	51.0	50.7	50.1	49.0	47.9	45.6	42.9	39.3
7	50.3	49.7	49.1	48.2	47.2	45.4	42.9	39.4
8	44.9	44.8	44.6	44.2	43.6	42.5	40.9	38.4
9	43.7	43.6	43.3	43.1	42.7	41.8	40.6	38.7
10	42.5	42.6	42.5	42.4	42.2	41.5	40.3	38.5
11	42.2	42.2	42.2	42.2	42.1	41.5	40.3	38.7
12	42.5	42.6	42.6	42.4	42.2	41.5	40.3	38.7
13	43.5	43.4	43.2	42.9	42.4	41.6	40.4	38.8
14	43.9	43.9	43.6	43.1	42.5	41.5	40.2	38.6
15	44.7	44.3	43.7	42.9	42.1	41.1	39.9	38.3
16	41.3	41.3	41.0	40.6	40.1	39.4	38.5	37.1
17-128	40.7-42.6	40.7-42.4	40.5-41.9	40.0-41.3	39.3-40.6	38.2-39.4	36.8-38.1	35.1-36.8
Partially-open with non-Power 5 guarantee
	0	1	2	3	4	5	6	7
1	84.2	76.9	70.5	61.2	54.6	46.4	39.7	31.8
2	53.4	53.4	52.9	50.5	48.5	44.5	40.3	35.1
3	56.3	55.0	53.3	50.3	48.0	44.1	40.3	35.6
4	46.1	46.0	45.2	43.8	42.5	40.0	37.2	33.8
5	47.2	46.8	45.8	44.2	42.6	40.0	37.3	34.0
6	47.2	47.0	46.1	44.5	42.8	40.1	37.4	34.0
7	52.1	51.3	50.3	49.0	47.7	45.2	42.5	38.9
8	46.5	46.3	46.2	45.6	45.0	43.5	41.3	38.2
9	44.9	44.7	44.4	43.9	43.2	42.1	40.5	38.1
10	43.1	43.2	43.0	42.9	42.5	41.6	40.4	38.5
11	42.3	42.4	42.4	42.4	42.3	41.6	40.4	38.6
12	42.5	42.5	42.5	42.4	42.3	41.6	40.3	38.7
13	43.4	43.3	43.2	42.9	42.4	41.6	40.3	38.6
14	43.8	43.8	43.6	43.1	42.5	41.7	40.4	38.7
15	44.6	44.2	43.6	42.9	42.2	41.2	39.9	38.3
16	41.7	41.6	41.3	40.9	40.3	39.6	38.6	37.2
17-128	40.7-42.6	40.7-42.4	40.5-41.9	40.0-41.3	39.3-40.6	38.2-39.4	36.8-38.1	35.1-36.8

Table 4

Validity of tournaments when σ_E ∈ {0, 1, 2, 3, 4, 5, 6, 7} and $σ_{R}^{2} = 194$

Size	Fully-open
	0	1	2	3	4	5	6	7
1	100.0	89.0	79.6	68.4	60.2	50.9	43.8	35.2
2	100.0	99.1	95.6	88.3	81.5	72.0	63.1	53.3
3	100.0	99.9	98.6	94.6	89.2	81.1	72.5	62.3
4	100.0	100.0	99.6	97.2	93.8	87.5	79.2	69.8
5	100.0	100.0	99.8	98.5	96.1	91.3	84.1	75.6
6	100.0	100.0	99.9	99.1	96.7	92.4	86.2	78.8
7	100.0	100.0	100.0	99.4	98.1	94.5	89.3	82.8
8	100.0	100.0	100.0	99.8	98.9	96.3	91.8	85.8
9	100.0	100.0	100.0	99.8	99.3	97.0	93.0	88.0
10	100.0	100.0	100.0	99.9	99.5	97.8	94.1	89.9
11	100.0	100.0	100.0	100.0	99.7	98.2	95.3	91.3
12	100.0	100.0	100.0	100.0	99.8	99.0	96.4	92.6
13	100.0	100.0	100.0	100.0	99.8	99.1	96.9	93.6
14	100.0	100.0	100.0	100.0	99.8	99.1	97.2	94.3
15	100.0	100.0	100.0	100.0	99.8	99.4	97.9	95.2
16	100.0	100.0	100.0	100.0	99.9	99.6	98.5	96.1
17-128	100.0-100.0	100.0-100.0	100.0-100.0	100.0-100.0	100.0-100.0	99.7-100.0	98.6-100.0	96.4-100.0
Fully-open with non-Power 5 guarantee
	0	1	2	3	4	5	6	7
1	0.0	0.0	0.0	0.2	0.4	0.5	0.6	1.1
2	100.0	89.0	79.6	68.6	60.6	51.4	44.4	36.3
3	100.0	99.1	95.6	88.5	81.9	72.5	63.7	54.4
4	100.0	99.9	98.7	94.7	89.5	81.7	73.2	63.4
5	100.0	100.0	99.6	97.2	94.0	87.9	79.7	70.9
6	100.0	100.0	99.8	98.5	96.2	91.5	84.4	76.4
7	100.0	100.0	99.9	99.2	97.0	93.0	86.9	80.0
8	100.0	100.0	100.0	99.4	98.3	94.9	89.8	83.4
9	100.0	100.0	100.0	99.8	99.1	96.6	92.3	86.5
10	100.0	100.0	100.0	99.8	99.4	97.2	93.4	88.8
11	100.0	100.0	100.0	99.9	99.5	98.0	94.7	90.9
12	100.0	100.0	100.0	100.0	99.7	98.5	95.7	91.7
13	100.0	100.0	100.0	100.0	99.8	99.1	96.8	93.1
14	100.0	100.0	100.0	100.0	99.8	99.1	97.1	94.2
15	100.0	100.0	100.0	100.0	99.8	99.1	97.3	94.6
16	100.0	100.0	100.0	100.0	99.9	99.6	98.4	95.9
17-128	100.0-100.0	100.0-100.0	100.0-100.0	100.0-100.0	99.9-100.0	99.6-100.0	98.5-100.0	96.3-100.0
Partially-open
	0	1	2	3	4	5	6	7
1	84.2	76.9	70.5	61.2	54.6	46.4	39.7	31.8
2	84.2	83.6	81.5	75.8	70.8	63.3	55.9	47.5
3	84.2	84.0	82.1	78.3	74.0	67.4	60.8	53.2
4	84.2	84.0	82.1	78.7	74.9	69.1	63.4	56.3
5	84.2	84.0	82.2	79.2	75.4	69.9	64.8	57.8
6	100.0	99.4	98.1	95.4	91.8	86.2	80.2	72.4
7	100.0	100.0	99.8	98.4	95.5	91.2	85.3	77.7
8	100.0	100.0	99.9	98.8	96.7	93.5	88.6	82.0
9	100.0	100.0	100.0	99.7	98.1	95.3	91.5	86.0
10	100.0	100.0	100.0	99.8	98.9	96.6	93.1	88.0
11	100.0	100.0	100.0	99.8	99.3	97.5	94.4	90.3
12	100.0	100.0	100.0	99.9	99.6	98.2	95.6	91.9
13	100.0	100.0	100.0	100.0	99.8	98.9	96.8	93.4
14	100.0	100.0	100.0	100.0	99.8	99.0	97.3	94.3
15	100.0	100.0	100.0	100.0	99.8	99.3	98.1	95.4
16	100.0	100.0	100.0	100.0	99.9	99.4	98.3	95.8
17-128	100.0-100.0	100.0-100.0	100.0-100.0	100.0-100.0	99.9-100.0	99.6-100.0	98.6-100.0	96.4-100.0
Partially-open with non-Power 5 guarantee
	0	1	2	3	4	5	6	7
1	84.2	76.9	70.5	61.2	54.6	46.4	39.7	31.8
2	84.2	83.6	81.5	76.0	71.0	63.2	55.8	47.3
3	84.2	84.0	82.1	78.5	74.3	67.5	60.9	53.3
4	84.2	84.0	82.1	78.8	75.2	69.3	63.4	56.5
5	84.2	84.0	82.2	79.4	75.8	70.3	65.2	58.5
6	84.2	84.0	82.2	79.4	75.8	70.4	65.4	58.9
7	100.0	99.4	98.2	95.7	92.3	86.8	80.9	73.5
8	100.0	100.0	99.8	98.4	95.7	91.5	85.7	78.5
9	100.0	100.0	99.9	98.8	96.9	93.8	89.0	82.9
10	100.0	100.0	100.0	99.7	98.2	95.6	92.0	86.7
11	100.0	100.0	100.0	99.8	99.2	97.0	93.7	89.1
12	100.0	100.0	100.0	99.8	99.4	97.7	94.9	91.0
13	100.0	100.0	100.0	99.9	99.6	98.3	96.0	92.4
14	100.0	100.0	100.0	100.0	99.8	99.0	97.1	93.8
15	100.0	100.0	100.0	100.0	99.8	99.1	97.6	94.9
16	100.0	100.0	100.0	100.0	99.9	99.4	98.3	95.8
17-128	100.0-100.0	100.0-100.0	100.0-100.0	100.0-100.0	99.9-100.0	99.5-100.0	98.4-100.0	96.1-100.0

Table 5

Effectiveness of tournaments when σ_E ∈ {0, 1, 2, 3, 4, 5, 6, 7} and $σ_{R}^{2} = 194$

Size	Fully-open
	0	1	2	3	4	5	6	7
1	100.0	89.0	79.6	68.4	60.2	50.9	43.8	35.2
2	59.8	59.9	59.1	56.3	53/7	49.0	44.3	38.5
3	62.6	60.5	58.6	55.8	53.2	49.2	45.2	39.5
4	45.5	45.6	46.0	46.1	46.0	44.5	41.9	38.1
5	46.3	46.3	46.3	46.1	45.7	44.4	42.1	38.7
6	46.8	46.9	46.7	46.2	45.6	44.0	41.8	38.7
7	48.8	47.7	46.8	45.7	44.9	43.4	41.6	38.7
8	39.8	39.9	40.0	40.2	40.4	40.0	38.9	37.1
9	40.1	40.1	40.1	40.1	40.1	39.6	38.5	36.9
10	40.1	40.1	40.1	40.1	40.0	39.6	38.4	36.9
11	40.1	40.2	40.2	40.2	40.1	39.6	38.5	37.0
12	40.6	40.6	40.6	40.5	40.2	39.7	38.5	36.9
13	41.6	41.5	41.3	41.0	40.5	39.7	38.4	36.8
14	41.9	41.9	41.7	41.1	40.5	39.6	38.3	36.6
15	42.9	42.3	41.7	40.8	40.0	39.1	37.8	36.2
16	38.4	38.4	38.2	37.9	37.5	37.0	36.2	35.0
17-128	38.2–40.4	38.2-40.2	38.0-39.7	37.6-39.2	36.9-38.4	35.9-37.3	34.5-36.1	32.9-34.8
Fully-open with non-Power 5 guarantee
	0	1	2	3	4	5	6	7
1	0.0	0.0	0.0	0.2	0.4	0.5	0.6	1.1
2	87.3	77.9	69.7	60.2	53.3	45.2	38.8	31.6
3	64.1	63.0	61.1	57.5	54.2	49.0	44.0	38.0
4	54.8	53.8	52.9	51.1	49.4	46.1	42.4	37.5
5	47.8	47.6	47.6	47.2	46.6	44.8	41.9	38.2
6	46.6	46.8	46.8	46.5	46.0	44.5	42.0	38.7
7	48.2	47.5	46.9	46.1	45.2	43.6	41.4	38.3
8	43.1	42.7	42.4	42.1	41.9	41.0	39.5	37.1
9	40.9	40.8	40.7	40.6	40.6	40.1	38.8	36.9
10	40.0	40.1	40.2	40.2	40.1	39.6	38.4	36.9
11	40.0	40.1	40.2	40.2	40.1	39.6	38.4	37.0
12	40.5	40.5	40.6	40.5	40.3	39.8	38.5	36.9
13	41.5	41.4	41.3	40.9	40.5	39.8	38.5	36.8
14	41.9	41.9	41.6	41.1	40.6	39.7	38.4	36.8
15	42.8	42.3	41.7	40.9	40.1	39.1	37.8	36.2
16	38.9	38.9	38.6	38.3	37.8	37.3	36.4	35.1
17-128	38.2-40.4	38.2-40.2	38.0-39.7	37.6-39.2	36.9-38.4	35.9-37.3	34.5-36.1	32.9-34.8
Partially-open
	0	1	2	3	4	5	6	7
1	84.2	76.9	70.5	61.2	54.6	46.4	39.7	31.8
2	52.7	52.7	52.2	49.6	47.6	43.8	39.7	34.6
3	55.6	54.2	52.5	49.5	47.1	43.3	39.5	34.8
4	45.1	44.9	44.1	42.7	41.3	38.9	36.3	32.8
5	46.2	45.9	44.9	43.2	41.6	39.0	36.5	32.9
6	49.5	49.2	48.6	47.6	46.4	44.2	41.6	38.0
7	48.9	48.2	47.6	46.6	45.6	43.8	41.4	38.0
8	42.8	42.7	42.5	42.2	41.6	40.7	39.2	36.8
9	41.6	41.5	41.3	41.1	40.7	40.0	38.8	37.0
10	40.4	40.5	40.5	40.5	40.3	39.7	38.5	36.8
11	40.1	40.2	40.2	40.2	40.1	39.6	38.5	37.0
12	40.5	40.6	40.6	40.5	40.3	39.6	38.5	37.0
13	41.5	41.4	41.3	41.0	40.5	39.8	38.6	37.0
14	41.9	41.9	41.6	41.2	40.6	39.7	38.4	36.8
15	42.8	42.3	41.7	40.9	40.1	39.2	38.1	36.4
16	38.9	38.8	38.6	38.3	37.8	37.2	36.4	35.1
17-128	38.2-40.4	38.2-40.2	38.0-39.7	37.6-39.2	36.9-38.4	35.9-37.4	34.5-36.1	32.9-34.8
Partially-open with non-Power 5 guarantee
	0	1	2	3	4	5	6	7
1	84.2	76.9	70.5	61.2	54.6	46.4	39.7	31.8
2	52.6	52.7	52.1	49.7	47.7	43.7	39.6	34.4
3	55.5	54.1	52.4	49.5	47.2	43.3	39.5	34.8
4	44.6	44.4	43.7	42.3	41.1	38.7	36.1	32.7
5	45.6	45.3	44.4	42.8	41.3	38.7	36.2	32.9
6	45.7	45.6	44.7	43.1	41.6	38.9	36.2	32.9
7	50.6	49.8	48.9	47.5	46.2	43.7	41.1	37.6
8	44.4	44.3	44.2	43.7	43.0	41.7	39.6	36.6
9	42.8	42.6	42.4	41.9	41.3	40.3	38.7	36.5
10	41.0	41.1	41.0	40.9	40.5	39.8	38.6	36.8
11	40.2	40.3	40.4	40.5	40.4	39.8	38.6	36.9
12	40.4	40.5	40.5	40.5	40.4	39.7	38.6	37.0
13	41.4	41.4	41.2	40.9	40.5	39.7	38.5	36.9
14	41.8	41.8	41.6	41.2	40.6	39.8	38.6	36.9
15	42.7	42.2	41.7	40.9	40.2	39.2	38.1	36.5
16	39.2	39.2	39.0	38.6	38.1	37.5	36.6	35.2
17-128	38.2-40.4	38.2-40.2	38.0-39.7	37.6-39.2	36.9-38.4	35.9-37.4	34.5-36.1	32.9-34.8

Table 6

Simulation results for σ_E ∈ {3, 4, 5} in 7- and 8-team tournaments when $σ_{R}^{2} = 167$ 0

Tournament type	Standard error of committee team strength estimate
	σ_E = 3		σ_E = 4		σ_E = 5
	Validity	Effectiveness	Validity	Effectiveness	Validity	Effectiveness
4-Team fully-open (current system)	97.2	47.6	93.8	47.5	87.5	46.0
8-Team fully-open	99.8	42.2	98.9	42.3	96.3	42.0
8-Team partially-open	99.4	44.1	98.3	43.8	94.9	42.9
8-Team fully-open w/ non-Power 5	98.8	44.2	96.7	43.6	93.5	42.5
8-Team partially-open w/ non-Power 5	98.4	45.6	95.7	45.0	91.5	43.5
7-Team fully-open	99.4	47.3	98.1	46.5	94.5	45.0
7-Team partially-open	98.4	48.2	95.5	47.2	91.2	45.4
7-Team fully-open w/ non-Power 5	99.2	47.7	97.0	46.8	93.0	45.2
7-Team partially-open w/ non-Power 5	95.7	49.0	92.3	47.7	86.5	45.2

Table 7

Simulation results for σ_E ∈ {3, 4, 5} in 7- and 8-team tournaments when $σ_{R}^{2} = 194$

Tournament type	Standard error of committee team strength estimate
	σ_E = 3		σ_E = 4		σ_E = 5
	Validity	Effectiveness	Validity	Effectiveness	Validity	Effectiveness
4-Team fully-open (current system)	97.2	46.1	93.8	46.0	87.5	44.5
8-Team fully-open	99.8	40.2	98.9	40.4	96.3	40.0
8-Team partially-open	98.8	42.2	96.7	41.6	93.5	40.7
8-Team fully-open w/ non-Power 5	99.4	42.1	98.3	41.9	94.9	41.0
8-Team partially-open w/ non-Power 5	98.5	43.7	95.7	43.0	91.5	41.7
7-Team fully-open	99.4	45.7	98.1	44.9	94.5	43.4
7-Team partially-open	98.4	46.6	95.5	45.6	91.2	43.8
7-Team fully-open w/ non-Power 5	99.2	46.1	97.0	45.2	93.0	43.6
7-Team partially-open w/ non-Power 5	95.7	47.5	92.3	46.2	86.8	43.7

Table 8

Simulation results for σ_E ∈ {3, 4, 5} in 12-team tournaments when $σ_{R}^{2} = 167$

Tournament type	Standard error of committee team strength estimate
	σ_E = 3		σ_E = 4		σ_E = 5
	Validity	Effectiveness	Validity	Effectiveness	Validity	Effectiveness
4-Team fully-open (current system)	97.2	47.6	93.8	47.5	87.5	46.0
12-Team fully-open	99.8	42.4	98.9	42.2	96.3	41.6
12-Team partially-open	99.4	42.4	98.3	42.2	94.9	41.5
12-Team fully-open w/ non-Power 5	98.8	42.4	96.7	42.2	93.5	41.6
12-Team partially-open w/ non-Power 5	98.4	42.4	95.7	42.3	91.5	41.6

Table 9

Simulation results for σ_E ∈ {3, 4, 5} in 12-team tournaments when $σ_{R}^{2} = 194$

Tournament type	Standard error of committee team strength estimate
	σ_E = 3		σ_E = 4		σ_E = 5
	Validity	Effectiveness	Validity	Effectiveness	Validity	Effectiveness
4-Team fully-open (current system)	97.2	46.1	93.8	46.0	87.5	44.5
12-Team fully-open	100.0	40.5	99.8	40.2	99.0	39.7
12-Team partially-open	99.9	40.5	99.6	40.3	98.2	39.6
12-Team fully-open w/ non-Power 5	100.0	40.5	99.7	40.3	98.5	39.8
12-Team partially-open w/ non-Power 5	99.8	40.5	99.4	40.4	97.7	39.7

Even in the case where the standard error of the committee’s team strength estimates is as high as 7 points, the validity results for fully-open tournaments show that the true best team is about 35% likely to be ranked highest; of course, as the standard error decreases, the validity increases to the expected maximum of 100% when the committee makes no errors.

As a result, except where the only team to automatically qualify for the tournament is the top non-Power-Five team (which is unlikely to be the true best), a 1-team tournament clearly has the highest effectiveness as long as the committee’s standard error of team strength estimate is 4 points or lower. When that standard error is 5, the 1-team tournament generally has the highest effectiveness by 1-2%, and for standard errors of 6 or 7 larger tournaments are better. For all tournament types and sizes, the effectiveness of the most-effective tournament decreases as the standard error of the committee’s estimates increases.

Another consequence of the committee’s estimation error being relatively low is that the simulation results show effectiveness tiers not by number of rounds of a tournament (e.g., tournaments of size 5-8 require three rounds to determine a champion; 9-16 teams require 4 rounds, etc.), but by the number of games that the top-ranked team needs to play. For example, the effectiveness of a 3-round tournament with 8 teams is more similar to the effectiveness of a 4-round tournament with 9 teams than to a 3-round tournament with 7 teams. The reason is that in a 7-team tournament, the top-ranked team (which is reasonably likely to be the true best team) gets a bye in the first round; without an 8th team, the top-ranked team automatically advances to the second round. So, the top-ranked team has only two chances for in-game randomness to cause it to be upset, whereas in an 8- or 9-team tournament, the top-ranked team has three such chances.

Our results show, therefore, tiers of similar effectiveness: Tournaments of size 2 or 3 have similar effectiveness, as do tournaments of size 4 through 7, tournaments of size 8 through 15, and tournaments of size 16 and greater. (Tournaments of size 32 to 63, size 64 to 127, and size 128 each require the top-ranked team to play an additional game; however, the probability of the top-ranked team defeating the 32nd, 64th, or 128th-ranked team is sufficiently high even with in-game randomness that there is not much impact on effectiveness.)

All of these observations hold whether the in-game randomness has a variance of 167 or 194, just with slightly different magnitudes.

7 Discussion

The current playoff system is a fully-open 4-team tournament. Most of the main criticisms and defenses of this tournament system can be phrased in the language of effectiveness and validity. Validity-based arguments, that the best team might be left out of the tournament without expansion and/or guarantees of inclusion, include that the current tournament might leave out the true best team in some years, every Power-Five conference winner should have a chance to play for the championship, and top non-Power-Five teams deserve a chance. Effectiveness arguments, that in an expanded tournament the best team might be less likely to win, include that the field might be diluted in an expanded tournament. Aside from validity and effectiveness, there are also fan-based and economic-based arguments: Fans want to see the championship decided by teams playing each other, and a larger tournament with more playoff games might also have the economic benefit of increased revenue and increase the number of fans who can attend a playoff game. In this section, we address the first two sets of questions by observing our simulated tournaments’ validity and effectiveness, and then discuss the implications on the fan-based and economic arguments.

In Section 6, we show simulation results for committees that range from perfect (σ_E = 0) to approximately equal to the Sagarin ratings (σ_E = 7), but we believe it is unlikely that either extreme is correct. In this section, we consider the middle range of committee quality, σ_E ∈ {3, 4, 5}.

Tables 3 show the validity and effectiveness of the current 4-team fully-open tournament, as well as the validity and effectiveness of all four types of 8-team tournaments we tested. Increasing the tournament from 4 to 8 while retaining its fully-open character decreases effectiveness by 4-6%, while increasing validity by 2-9%. However, the need for tradeoff decreases when all conference champions and the top non-Power-Five team are guaranteed spots in an 8-team tournament; in that case, effectiveness decreases by just 2-3% while validity increases by 1-4%. In essence, the simulation results suggest that substituting an 8-team tournament with guarantees for conference champions and the top non-Power-Five team does not create significant changes in validity or effectiveness. The effectiveness is not significantly decreased by the increase in number of teams, and while the validity does not increase significantly, giving opportunities to all conference champions and the top non-Power-Five team does not hurt effectiveness.

Tables 3 also show that 7-team tournaments have the potential for good effectiveness/validity tradeoffs. A 7-team fully-open tournament provides validity increases of 2-7% while decreasing effectiveness by just 0.3-1.1% compared with the current 4-team fully-open tournament, and if σ_E is 3, a 7-team tournament with guarantees for conference champions provides small increases for both validity and effectiveness.

On the other hand, newer proposals for a 12-team tournament (e.g., CollegeFootballPlayoff.com (2021)) would have a greater impact on validity and effectiveness. Tables 5 show the validity and effectiveness of the current 4-team fully-open tournament as well as the validity and effectiveness of all four types of 12-team tournaments we tested. The simulation results show that expanding the tournament to 12 teams would decrease effectiveness by 5-6% no matter what, and increase validity by 2-12%. Unlike expansion to 7 or 8 teams where there might be little difference, changing from a 4-team tournament to a 12-team tournament includes a definite effectiveness/validity tradeoff in addition to the economic and fan effects.

Overall, when considering only 4-team and 8-team tournaments, our simulations suggest that the annual debate over tournament size may be much ado about nothing. Replacing the current 4-team fully-open tournament with an 8-team tournament with guarantees for conference champions and the top non-Power-Five team is likely to lead only to small changes in both validity and effectiveness. As a result, decision-makers can give full consideration to fan and economic issues. It is also possible to obtain increased validity with only a very small effectiveness change by switching to a 7-team partially-open tournament, albeit with one fewer playoff game (and the resulting fan and economic effects) than an 8-team tournament would require. On the other hand, if 12-team tournaments are under consideration, there is a distinct validity/effectiveness tradeoff involved: the tournament would be 2-12% more likely to include the true best team, but that true best team would be 5-6% less likely to be correctly identified by winning the championship.

Footnotes

Acknowledgements

The authors would like to thank two anonymous referees and an editor for their suggestions for strengthening this paper.

We note that Sagarin does not publicly archive week-by-week ratings,so we use the only available data,the rankings after the postseason.

This is in contrast to not only human rankings like polls,but also algorithmic ratings that use secondary statistics such as yards gained,game progress,etc. A team might lose a game despite appearing to play better and having better secondary statistics;a relevant example might be when LSU beat Alabama in 2011 by a score of 9-6 despite having worse secondary statistics,in part due to Alabama missing four field goals. At the end of the season,LSU and Alabama were ranked as the top two teams in the BCS,and Alabama won the rematch,and the national championship,by a score of 21-0.

The Las Vegas line also incorporates factors such as how teams’ specific strengths and weaknesses match up with each other,so “error in team strength estimation” is really an oversimplification. We retain the usage of the term for simplicity,but for the Las Vegas line we mean it to include all factors related to estimating attributes of a team,and excluding in-game randomness.

Model 1 of Curry and Sokol (2016) gives a value of 194,and Models 2 and 3 give nearly identical results,each estimating that in-game randomness accounts for 167 of the variance.

References

Appleton,

D.R.

1995, May the best man win?, The Statistician, 44, 529–538.

Berry,

S.M.

2003, A Statistician Reads the Sports Pages: CollegeFootball Rankings: The BCS and the CLT, Chance 16, 46–49.

Chen,

, Ham,

S.H.

, & Lim,

2011, Designing multiperson tournaments with asymmetric contestants: an experimental study, Management Science 57, 864–883.

CollegeFootballPlayoff.com. 2021. 12-Team Playoff Proposed by College Football Playoff Working Group. https://collegefootballplayoff.com/news/2021/6/10/12-team-playoff-proposal.aspx.

Curry,

, & Sokol,

2016, Quantifying March’s Madness. Working paper.

David,

H.A.

1988, The method of paired comparisons, 2nd ed. Chapman / Hall.

Dizdar,

2013, On the optimality of small research tournaments. Working Paper, University of Bonn, Institute of Economic Theory, http://dx.doi.org/10.2139/ssrn.2357096.

Fanson,

2020, Vegas Always Knows? A Mathematical Deep Dive. https://www.theonlycolors.com/2020/9/29/21492301/vegas-always-knows-a-mathematical-deep-dive, downloaded June 1, 2022.

Fullerton,

R.L.

, & McAfee,

R.P.

1999, Auctioning entry intotournaments, Journal of Political Economy 107, 573–605.

10.

Gill,

P.S.

2000, Late-Game Reversals in Professional Basketball, Football, and Hockey, The American Statistician 54, 94–99.

11.

Glenn,

W.A.

1960, A comparison of the effectiveness of tournaments, Biometrika 47, 253–262.

12.

Glickman,

M.E.

2008, Bayesian locally optimal design of knockout tournaments, Journal of Statistical Planning and Inference 47, 2177–2127.

13.

Groh,

, Moldovanu,

, Sela,

, & Sunde,

2012, Optimal seedings in elimination tournaments, Economic Theory 49, 59–80.

14.

Hochtl,

, Kerschbamer,

, Stracke,

, & Sunde,

2010, Optimal design of multi-stage elimination tournaments with heterogeneous agents: theory and experimental evidence. Working Paper.

15.

Horen,

, & Riezman,

1985, Comparing draws for single elimination tournaments, Operations Reearch 33, 249–262.

16.

Hwang,

F.K.

1982, New concepts in seeding knockout tournaments, American Mathematical Monthly 89, 235–239.

17.

Jennessy,

, & Glickman,

2016, Bayesian optimal design offixed knockout tournament brackets, Journal of Quantitative Analysis in Sports 12, 1–15.

18.

Marchand,

2002, On the comparison between standard and randomknockout tournaments, The Statistician 51, 169–178.

19.

National Collegiate Athletic Association. Football Bowl Subdivision records. Http://fs.ncaa.org/Docs/stats/football_records/2014/FBS.pdf, last downloaded January 15, 2014.

20.

OddsShark.com. NCAAF football odds & handicapping database. http://www.oddsshark.com/ncaaf/database, data downloaded August 29, 2021.

21.

Paine, N. 2014 “How to fix the NFL playoffs.” https://slate.com/culture/2014/01/nfl-playoffs-2014-an-insane-idea-to-ensure-that-the-best-teams-have-the-best-chance-to-win.html, downloaded January 8, 2014.

22.

Ryvkin,

2005, The predictive power of noisy elimination tournaments. Technical report, Center for Economic Research and Graduate Education –Economic Institute, Prague.

23.

Sagarin,

Jeff Sagarin computer ratings archive. https://sagarin.usatoday.com/archive/, each year’s data downloaded August 29, 2021.

24.

Scarf,

, & Bilbao,

2006, The optimal design of sporting contests. Working Paper 320/06, Salford Business School, University of Salford, Manchester, UK.

25.

Scarf,

P.A.

, & Shi,

2008, The importance of a match in atournament, Computers and Operations Research 35, 2406–2418.

26.

Schwenk,

A.J.

2000, What is the correct way to seed a knockouttournament? The American Mathematical Monthly 107, 140–150.

27.

Seals,

D.T.

1963, On the probability of winning with different tournament procedures, Journal of the American Statistical Association 58, 1064–1081.

28.

Sheremeta,

R.M.

, & Wu,

S.Y.

2012, Testing canonical tournament theory: on the impact of risk, social preferences, and utility structure. Working Paper, Purdue University.

29.

Sokol,

, March 26, 2010. In a 96-team field, upsets could be a thing of the past. https://archive.nytimes.com/thequad.blogs.nytimes.com/2010/03/26/in-a-96-team-field-upsets-could-be-a-thing-of-the-past/, New York Times.

30.

ThePredictionTracker.com. Computer rating system prediction results for college football (NCAA IA). https://www.thepredictiontracker.com/ncaaresults.php, each year’s data downloaded August 29, 2021.

31.

Vu,

T.D.

2010, Knockout tournament design: a computational approach. PhD diss., Department of Computer Science, Stanford University.

Randomness,uncertainty,and the optimal college football championship tournament size