Abstract
Introduction
A considerable amount of empirical evidence indicates that restrictive zoning and land use policies, oftentimes referred to as exclusionary zoning, inflate housing prices and increase residential segregation (Glaeser and Gyourko, 2018; Rothwell and Massey, 2009; Sahn, 2022; Trounstine, 2018, 2020). Despite the clear importance of these policies, zoning and land use data, particularly of the longitudinal variety, are notoriously difficult to come by. The main sources of national zoning and land use data largely come from cross-sectional surveys. These surveys have been essential to documenting America’s zoning regime and to analysing the role land use policies play in shaping communities, influencing housing costs and shoring up residential segregation (Hsieh and Moretti, 2019; Rothwell and Massey, 2009; Trounstine, 2020). However, researchers have noted that these surveys have their limitations (Bronin, 2022; Manville et al., 2022; Rodríguez-Pose and Storper, 2020). In particular, we argue that they are costly, time intensive, subject to low response rates and measurement error and become dated soon after data collection.
As an efficient and accurate alternative to these survey methods, we access publicly available zoning and land use information and utilise natural language processing (NLP) to construct the National Zoning and Land Use Database (NZLUD). We begin by replicating selected measures using data collected from 2019 to 2022 for our target sample: the municipalities present in the 2006 Wharton Residential Land Use Regulatory Index (WRLURI) survey (Gyourko et al., 2008). We discuss the process of downloading and parsing the zoning and land use text data and address how the NZLUD measures compare to the same measures among municipalities and metropolitan statistical areas (MSAs) present in the WRLURI 2018 database, a follow-up to the WRLURI 2006 survey. We also replicate measures of maximum permitted densities and inclusionary zoning programmes from the 2003 National Longitudinal Land Use Survey (NLLUS), another land use database (Lo et al., 2019), comparing the NZLUD with estimates from the NLLUS 2019 database. Last, we combine all these measures to create the Zoning Restrictiveness Index (ZRI): a similar but distinct index compared to the WRLURI.
We show that municipalities and MSAs with the lowest and highest ZRI and WRLURI scores match reasonably well across the datasets. Moreover, sociodemographic characteristics of municipalities across the distribution of ZRI scores follow the same pattern established in previous studies (Gyourko et al., 2008, 2021). We also produce measures beyond previously existing zoning data, publishing what we believe to be the first national measures of accessory dwelling unit authorisation, building height maximums, parking requirements and the proportion of residential districts permitting multi-family housing by right. These results suggest that utilising municipal codes to construct a national zoning and land use database is a viable alternative to surveys. We make all code and data publicly available (see https://github.com/mtmleczko/nzlud), allowing the NZLUD to be replicated in future years to ensure accurate, up-to-date and longitudinal nationwide zoning and land use data.
Zoning and land use data
Zoning and land use policies have critical implications for urban inequality and arguably represent the most significant power that local governments wield (Trounstine, 2018). Minimum lot sizes, use restrictions and height restrictions comprise three universal elements of zoning and land use policy, but many others exist (Fischel, 2015). The sheer number of ways municipalities can affect development presents profound challenges to measuring exclusionary zoning (Gyourko and Molloy, 2015).
The earliest versions of zoning laws sought to promote public health and sensible land use, but these policies were soon co-opted for exclusionary ends (Pendall et al., 2006). Spearheaded by cities like San Francisco in the late-19th century and Baltimore in the early-20th century, municipalities began explicitly segregating by race in their zoning policies (Fischel, 2015; Nightingale, 2012). The Supreme Court case
A large body of empirical evidence has established a link between exclusionary zoning and higher housing prices, less housing construction and lower overall welfare (Glaeser and Gyourko, 2018; Glaeser et al., 2005; Hsieh and Moretti, 2019). Gyourko and Molloy (2015) provide a thorough review of this literature. An important and growing body of work also emphasises the role of restrictive land use policies on increasing racial segregation (Pendall, 2000; Resseger, 2013; Rothwell, 2011; Rothwell and Massey, 2009; Sahn, 2022; Trounstine, 2020) and income segregation (Lens and Monkkonen, 2016; Rothwell and Massey, 2010).
As Gyourko and Molloy (2015) note, although researchers have established the centrality of zoning policies to a wide array of social problems and conditions, they lack access to national longitudinal zoning and land use data, which would help researchers identify the causal effects of exclusionary zoning (Quigley and Rosenthal, 2005). Some longitudinal zoning data do exist, but those typically only include a single state. 1 The National Longitudinal Land Use Survey (NLLUS) is the most comprehensive national longitudinal data to date, but even these data cover only 88 MSAs between 2003 and 2019 and have even less coverage extending back to 1994 (Lo et al., 2019). Other longitudinal efforts have relied on state and federal land use and fair housing court cases to measure land use restrictiveness and fair housing injunctions (Ganong and Shoag, 2017; Trounstine, 2020). Finally, Gyourko et al. (2021) carried out the second wave of the WRLURI survey, which, when merged with the 2006 wave of the WRLURI survey produces data for 874 municipalities across two waves separated by 13 years.
Despite these important efforts, data on zoning and land use still face two major hurdles: (1) the data by and large are not longitudinal and (2) they are not regularly updated. Moreover, most of these data are based on surveys, which are time-intensive, expensive and subject to low responses rates and measurement error. We overcome these limitations by constructing the National Zoning and Land Use Database (NZLUD) using NLP of municipal codes as well as zoning and land use ordinances.
Constructing the database
To form our database, we replicated selected measures in the WRLURI 2006 and NLLUS 2003 data from municipal codes and zoning and land use ordinances collected between 2019 and 2022. 2 We begin with a description of the WRLURI and NLLUS data as background into our data collection and processing procedure.
WRLURI data
The WRLURI 2006 data represent the most widely cited and used land use data to date (Gyourko et al., 2008). They cover a wide range of measures, from density restrictions and the timeliness of development approvals to the amount of local political pressure. The authors of these data condensed 11 subindex measures into one overall index of the land use regulatory environment: the WRLURI.
To build the WRLURI dataset, researchers mailed a survey questionnaire to 6896 municipal officials between 2004 and 2006. The data contain 2729 responses, yielding a response rate of roughly 40%, representing about 60% of the surveyed population. Larger municipalities were more likely to respond to the survey, but municipalities of varying sizes were well-represented in the data with the exception of towns with populations under 2500.
In 2018, a team of researchers commissioned a new survey of land use regulations (Gyourko et al., 2021). This survey mirrored the 2006 effort, asking many of the same questions during the 2018 calendar year. The 2018 update contains 2844 responses, yielding a response rate of roughly 26%; 874 unique municipalities are represented in both datasets.
NLLUS data
Compared to the WRLURI 2006 data, the NLLUS 2003 data cover less of the country but contain a more comprehensive set of zoning and land use measures (Gallagher et al., 2019). The NLLUS 2003 data cover 36 states and Washington, DC, encompassing 91 MSAs, while the WRLURI 2006 data cover 50 states and Washington, DC, encompassing 323 MSAs. However, the NLLUS 2003 data tend to have better MSA-wide coverage, with a median of approximately 25% of municipalities represented, versus roughly 12% in the WRLURI 2006 data.
The NLLUS 2003 data offer a complementary picture of the zoning and land use regulatory environment compared with the WRLURI 2006 data. They cover many of the same measures in addition to other important measures, such as the maximum permitted density in a municipality. The NLLUS 2003 survey generally targeted municipalities with populations over 10,000 in the 50 largest consolidated MSAs (CMSAs) and MSAs and produced a dataset of 1845 jurisdictions, with a response rate of 78%. The NLLUS 2019 survey carried out a similar sampling procedure but produced data for only 1703 jurisdictions and yielded a response rate of 58% (Lo et al., 2019). 3
Survey limitations
Fielding surveys, particularly national surveys, is expensive and time-consuming. This poses a challenge to any large-scale survey, but those of zoning and land use policies are especially susceptible to the inefficiency of survey data collection because zoning amendments take place regularly (Fischel, 2015). Consequently, some zoning data collected via survey could be inaccurate (or incomplete) by the time it is published.
Moreover, while survey efforts have yielded critical information, these surveys tend to have response rates below, and sometimes well below, 60%, lowering confidence in the data’s representativeness. Finally, land use surveys contain measurement error. Researchers commissioning zoning and land use surveys may receive duplicate responses with inconsistencies or different responses when multiple municipal officials submit survey responses. This is apparent in the WRLURI 2006 data, even for objective items such as minimum lot sizes. Accessing and parsing administrative source data offers more assurances against such inaccuracies.
Source data measures
The WRLURI 2006 data provide a starting set of target municipalities and land use regulations. While our measures do not need to be consistent with the WRLURI measures, as the WRLURI data do not necessarily reflect the true underlying land use regulations in all cases, the WRLURI data do offer a useful benchmark to evaluate our measures. Consequently, we tailored our outputs and analyses to be consistent with the WRLURI 2006 and WRLURI 2019 studies, as well as the NLLUS 2003 and 2019 data.
We focused on a range of measures that were likely to be present in municipal zoning and land use regulations. We categorise these measures into the following five subindices:
1. Explicit growth controls index (EGCI): this subindex represents measures that directly restrict the supply of housing units and correlate with less housing density and displacement of low-income households and households of colour (Levine, 1999). To keep consistent with WRLURI measures, we constructed this index as the sum of the six following growth controls, in the form of annual limits or caps on:
○ Single-family permits
○ Multi-family permits
○ Single-family units
○ Multi-family units
○ Multi-family dwellings
○ Multi-family dwelling units
2. Open space requirements index (OSRI): this subindex is a binary indicator of whether a municipality requires developers to dedicate or preserve some percentage of the lot(s) for open space or submit an in-lieu payment or fee, which can reduce the amount of land or funds available for development, particularly high-density development and affordable housing (Schmidt and Paulsen, 2009).
3. Minimum lot size index (MLSI): this subindex contains information regarding minimum lot sizes within municipal boundaries for all districts that allow residential uses. Larger minimum lot sizes restrict housing supply by implicitly restricting density (Fischel, 2015; Song, 2021). In accordance with the WRLURI measures, we capture the maximum-minimum lot size across all residential districts, but we construct the MLSI as a categorical variable (rather than a binary indicator of the top two minimum lot size categories), which could fall into one of the following mutually exclusive categories:
○ Less than one-half acre
○ Greater than or equal to one-half acre and less than one acre
○ Greater than or equal to one acre and less than two acres
○ Greater than or equal to two acres
4. Reviewing and/or approving authorities for projects not requiring rezoning (NZI): this subindex represents the sum of how many of the following governing bodies need to approve a project that does not require rezoning:
○ Planning commission
○ Local municipal council
○ County board
○ Environmental review board
○ Public health board
○ Design review board
5. Reviewing and/or approving authorities for rezoning (RZI): this subindex represents the sum of how many of the following governing bodies need to approve rezonings (including variances) or zoning amendments:
○ Planning commission
○ Local municipal council
○ Zoning board
○ County board
○ County zoning authority
○ Environmental review board
○ Town meeting
We also replicated two pre-existing measures contained in the NLLUS 2003 data:
6. Inclusionary zoning programmes index (IZPI): this subindex is a binary indicator for municipalities that operate inclusionary zoning programmes, including the use of in-lieu payments or fees often directed towards affordable housing trust funds (Pendall et al., 2006). 4
7. Maximum permitted density index (MPDI): this subindex indicates the maximum permitted number of dwelling units per acre (or square feet of lot area per acre) in any district that allows residential uses. It features the following categories, expressed in units per acre:
○ 0–4
○ 5–7
○ 8–14
○ 15–30
○ More than 30
Our data contain a subset of measures included in the overall WRLURI. While our data capture a narrower scope of land use regulations, they also include more information regarding inclusionary zoning. In particular, we observe maximum permitted density, identified as a variable of key importance in prior work on zoning and residential segregation (Rothwell and Massey, 2009, 2010).
Moreover, in addition to the regulations captured in the WRLURI and NLLUS data, we demonstrate the scalability of our approach by calculating four additional and important measures of zoning and land use policy:
8. Accessory dwelling unit index (ADUI): this subindex is a binary measure that captures whether or not a municipality permits accessory dwelling unit (ADU) construction in any residential district at all. 6 ADUs are increasingly seen as a way to promote moderate housing density and lower housing construction costs (Brinig and Garnett, 2013).
9. Maximum height index (MHI): recognising building height restrictions as a key method for restricting density (Brueckner and Singh, 2020), we collect the median and mode building height limits (in feet and storeys) across all districts in which residential uses are permitted, taking the median of these four numbers. 5
10. Minimum required parking index (MRPI): to capture another common mechanism for limiting density (Bronin, 2022), we compute the median and mode of parking requirements per residential unit in districts permitting residential uses and take the average of these two numbers.
11. Permitted multi-family housing index (MFPI): we constructed this subindex, the proportion of residential districts that allow multi-family housing development by right, as a way to capture a municipality’s preference for single-family zoning, one of the most recognised and explicit forms of exclusionary zoning (Sahn, 2022). 7
NLP
Zoning and land use regulations are ubiquitous and often available online. In fact, over 95% of the municipalities represented in the WRLURI 2006 sample have made their zoning and land use information publicly available online.
We began with these 2660 municipalities (out of the 2720 unique municipalities in the 2006 WRLURI sample). 8 When possible, we downloaded the entirety of the municipal codes to ensure that we captured all information pertinent to zoning and land use. 9 Despite many idiosyncrasies, most zoning and land use bylaws contain sections pertaining to district use restrictions, dimensional requirements and administrative processes, which is where the bulk of our extracted information was located. Ultimately, we downloaded and verified 2639 sets of municipal codes, representing about 97% of the original target sample. 10 The other 3% of municipalities do not make their codes publicly available online.
After verifying that the raw data contained pertinent zoning and land use information, we then utilised NLP to transform the text data into a usable file for analysis. Our pre-processing steps included lemmatisation of keywords, the filtering of unnecessary words, punctuation marks, or symbols and standardising numeric information (either stored as words or fractions) into digits. To circumvent the lack of standardisation in the text, we built a series of regular expression searches that matched on a set of original keywords for each of our target measures, capturing a specified number of preceding and following characters depending on the measure. 11 Through a process of trial-and-error, we also defined a range of words or phrases that, when present in a given text string, indicated a false positive match and prevented any further matching for the given text string.
In the text that follows, we describe our general text processing procedure. We provide specific details about the construction of each measure in the online Supplemental Material.
For our non-dimensional measures (EGCI, OSRI, NZI, RZI, IZPI and ADUI), we searched for additional keywords within a captured text string. Each of these keywords had an associated weight. We summed these weights of the matching keywords within a captured text string. When the sum of the weights was greater than the established threshold, we set the indicator column for a particular measure to 1. Otherwise, we set the indicator column to 0. See online Supplemental Material Table S2 for a list of keywords used and their associated weights.
The challenge with this approach consisted of limiting false-positive matches (by filtering inappropriate matches) and minimising false-negative non-matches (by capturing as many potential matches as possible) for the above measures. For instance, inclusionary zoning programmes have many names (e.g. inclusionary housing, incentive housing, density bonuses) and oftentimes entail different incentives. Validating this step required extensive quality checks to ensure that our set of matching words and stop-word and stop-phrase filters appropriately captured the breadth of key terms while also screening out inappropriate matches.
Our dimensional measures (MLSI, MPDI, MHI, MRPI, and MFPI indices) required a different process. Underlying information for these measures largely came from in-text district regulations, tables of dimensional requirements, or both (see Figures S1 and S2 in the online Supplemental Material for illustrations). In the case of the in-text references, our code searched within the original captured string, advancing to smaller strings of text surrounding the original keyword by passing stop-phrase flags and matching key contextual phrases to extract the relevant information. In the case of information stored in dimensional requirement tables, we built our code to detect the presence of these tables and extract the relevant dimensional requirements from them.
Verifying this process required overcoming a series of complications. For one, we accounted for information stored in various formats, such as minimum lot sizes stored in square feet or acres, maximum permitted densities stored as units per acre or square feet per acre and building height maximums stored in feet or storeys. Additionally, to compute a per-unit minimum lot size metric, this approach required us to restrict our matches to the relevant portions of residential district requirements. This entailed not only filtering out the same measures for non-residential districts (e.g. commercial, industrial), but also some dimensional requirements relevant to multi-family housing developments, planned unit developments (PUDs), cluster developments, or subdivisions, which often express requirements in terms of the entire development rather than each individual lot. Similarly, we filter out dimensional requirements such as floor space requirements and minimum unit sizes, which capture regulations that are similar but distinct from minimum lot sizes. We accomplished this through a series of checks for stop-phrases and filtering out of implausible dimensional requirement matches. 12
This process outputs all captured minimum lot sizes, maximum permitted densities, maximum building heights and per-unit parking requirements across all residential districts or instances in which residential units are permitted, whether by right or through conditional or special use. This latter condition captures all residential development that could be undertaken, including rural, overlay, or agricultural districts where large-lot single-family dwellings are permitted, capturing an important dimension through which municipalities could limit density (Fischel, 2015). 13
We calculate the proportion of districts permitting multi-family housing by right among residential and mixed-use districts by assigning each identified and eligible district a specific permitted use category for each mention of its by right permitted uses in the text. The permitted use categories include single-family, multi-family (i.e. four-family units or more), two-family (including attached single-family units, townhouses and triplexes) and mixed-use. 14 We assign each eligible district to a permitted use category based on its modal permitted use assignment. The sum of districts assigned to multi-family or mixed uses by right form the numerator of our measure, while the total number of assigned districts forms the denominator. See the online Supplemental Material for more details.
Post-processing
The resulting output of all text processing contained 32 binary variables and 7 continuous variables, corresponding to the previously discussed measures. 15 We iterated this process, checking the aggregate means across our measures for face validity in addition to systematically checking the validity of values in the highest categories (e.g. maximum minimum lot sizes of two or more acres). In the case of minimum lot size and maximum permitted density measures, we checked output values against the municipal codes among roughly a quarter of our sample to verify their accuracy. 16
We used the minimum lot size information collected to impute the maximum permitted density for two municipalities. Our code failed to capture minimum lot size information for about 4% of the sample. In these cases, we manually inputted the minimum lot size information. Additionally, four municipalities required manual corrections to their explicit growth control measures and roughly a quarter of the sample required manual formatting edits to the municipal codes to produce accurate minimum lot size and maximum permitted density measures. 17
To condense the information conveyed in these indicators, we created an index similar to the WRLURI but with slightly different measures. As in Gyourko et al. (2008), we employ principal components analysis of our 11 subindices to create our overall standardised index, which we label the Zoning Restrictiveness Index (ZRI). We extract the first principal component and create the ZRI (prior to standardising) as a weighted sum of the subindex values, with the resulting loadings serving as the weights. Additionally, to compare our output against the WRLURI output, we re-create the ZRI using only measures that exist in both our data and the WRLURI and NLLUS data. 18 See online Supplemental Material Table S3 for more details about the construction of the ZRI and its subindices.
To benchmark our results, we also create parallel ZRI indices, this time using values from the WRLURI and NLLUS data sources. We were able to merge 874 municipalities represented in both waves of the WRLURI survey, 591 municipalities in the 2019 NLLUS survey and 242 municipalities in both the NLLUS 2019 and the two WRLURI survey waves. 19 This resulted in three subsamples of municipalities with which to compare the results of our data creation. Additionally, as with both WRLURI studies, we specify survey weights to account for potential bias from missing data. Details on the construction of these weights along with ZRI summary statistics when applying them are contained in online Supplemental Tables S5 and S6, respectively.
We also aggregated municipalities to the MSA level and computed several unweighted MSA-level ZRI indices. These included the median ZRI across municipalities present in the data for a given MSA, the difference between the highest suburban (i.e. non-central city) ZRI and central city ZRI (similar to Trounstine, 2018), and a standardised sum of the two. We use this combined index as our primary MSA-level ZRI since it captures the average restrictiveness across an MSA, but also the extent to which there exist highly exclusive suburban municipalities relative to central city municipalities. We compare this measure to the standardised median of 2018 WRLURI scores for each MSA.
These comparisons allow us to gauge the consistency of our data with the two other established sources of national zoning and land use information. While a high degree of consistency is desirable, capturing the zoning and land use regulations accurately is more important. Here, our method improves upon previous methods as we compare our results against the actual regulations, particularly when values differ across datasets. A correlation coefficient of roughly 0.3 between the ZRI and WRLURI indicates that while both indices generally measure a similar concept, they are nonetheless different measures of zoning restrictiveness.
Results
Despite the complexity of building a zoning and land use database from administrative text data, we constructed a database with values comparable to the WRLURI 2018 and NLLUS 2019 data. Tables 1 and 2 display the results. We achieve similarity in aggregate means across both full samples for each indicator. Our data record higher fractions of municipalities with minimum lot sizes of one or two acres or more as well as higher permitted densities. 20 Nevertheless, the aggregate means for each minimum lot size and maximum permitted density indicator point to generally similar levels of permitted density across both datasets.
Comparison of National Zoning and Land Use Database output to Wharton Residential Land Use Regulatory Index 2018 output.
Comparison of National Zoning and Land Use Database output to National Longitudinal Land Use Survey 2019 output.
To assess municipality-specific congruence across datasets, we turn to municipal-level comparisons for the 874 matching WRLURI municipalities and 591 matching NLLUS municipalities, captured by the final three columns of Tables 1 and 2, respectively. Again, we note the similarities in mean values across the matching datasets and consistency with the trends in mean values across the full datasets. Moreover, the fraction of matching values across the datasets is large for most of the measures.
The measures with the lowest fraction of matching values are those most difficult to measure. For instance, most municipalities have open space regulations of some kind. Distinguishing between these and whether a municipality requires ‘mandatory dedication of space or open space (or fee in lieu of dedication)’ likely has led to measurement error in survey data (Gyourko et al., 2021: Online Appendices p. 5). Additionally, while we achieve notable matching rates for several individual reviewing or approving authority measures (see Table S7 in the online Supplemental Material), distinguishing between approval authority and responsibility to review and perhaps provide recommendations also has likely led to measurement error in previous data.
Tables 3 and 4 respectively display the 10 most and least restrictive MSAs, with at least 10 municipalities present in the data according to the ZRI and WRLURI. Table 3 illustrates that the MSAs our method identifies are consistent with prior research and reflect many previously identified in the literature as highly restrictive and suffering from acute housing shortages (Gyourko et al., 2021). Notably, our method identifies the Washington, DC MSA as the most restrictive MSA, followed by the New York City MSA, the most restrictive MSA according to the WRLURI. While these results suggest considerable overlap between the ZRI and WRLURI, there are still some notable differences. In particular, we identify the Detroit MSA, one of the least restrictive MSAs according to the WRLURI results, as the eighth most restrictive in our data. This result serves as a reminder that our multi-dimensional MSA-level ZRI is a measure of exclusionary zoning that is distinct from the WRLURI. In addition to capturing average MSA-wide zoning and land use restrictiveness (similar to the WRLURI), our MSA-level ZRI reflects divergent zoning and land use restrictiveness between a central city and suburban municipalities as reflected in the Detroit MSA.
Comparison of most restrictive MSAs between National Zoning and Land Use Database and Wharton Residential Land Use Regulatory Index 2018 sample.
Comparison of least restrictive MSAs between National Zoning and Land Use Database and Wharton Residential Land Use Regulatory Index 2018 sample.
The regions present in Table 4 differ from those in Table 3, with a more pronounced presence of MSAs from the South and Midwest regions. Nevertheless, there is regional overlap across the samples in Table 4, further supporting the conclusion that both methods are broadly similar in their characterisation of land use restrictiveness. Moreover, consistent with Gyourko et al. (2008, 2021), we assess the sociodemographic profile of municipalities along the distribution of ZRI scores, finding that zoning restrictiveness, on average, tends to be higher in more affluent, diverse and dense municipalities. The same is true for MSAs, though our results demonstrate that more restrictive MSAs also tend to be more segregated. Table S8 in the online Supplemental Material contains these results.
Overcoming the limitations of prior zoning data
Having validated our measures with previously existing zoning and land use data, we turn now to demonstrating the extent to which our method can be scaled to include additional measures and additional municipalities. First, we consider the additional measures that we created beyond those found in the WRLURI or NLLUS data (see Table S9 in the online Supplemental Material). Roughly 46% of municipalities in the sample permit some form of accessory dwelling unit (ADU). The average maximum building height across the sample registers at 34 feet and the average required number of parking spaces per residential unit is 1.8. Finally, on average, nearly 39% of residential districts in a municipality permit multi-family housing development by right, though this varies considerably across the sample, as 253 municipalities (approximately 10% of the sample) do not permit any multi-family housing by right in any residential district.
A major disadvantage of previously available national zoning data is under-coverage within MSAs. For instance, the median proportion of municipalities in a given MSA represented in the WRLURI data is 0.12. Since our approach can create zoning data for any municipality for which there is publicly available text data, we can scale our approach by expanding our sample. As an illustration, we downloaded all publicly available municipal codes for two often-referenced MSAs in the zoning literature: San Francisco and Houston (Gyourko et al., 2021). This allowed us to assess how sensitive our measures are to within-MSA under-coverage and to generate a more complete picture of within-MSA zoning regimes across MSAs.
We attained 100% coverage in the San Francisco MSA and 70% coverage in the Houston MSA (as opposed to 38% and 16%, respectively, in the WRLURI sample). While the coverage in the Houston MSA is notably lower, the municipalities with missing zoning text data tended to be very small and sparsely populated. The municipalities for which we do have coverage represent 99% of the Houston MSA population and 95% of its land area. 21
Figures 1 and 2 display maps with graduated colour scales indicating quintiles of the ZRI with all 11 subindices for municipalities within the San Francisco and Houston MSAs, respectively. One can immediately see that average zoning restrictiveness is higher in the San Francisco MSA. Nearly every municipality in the San Francisco MSA registers ZRI scores in the fourth or fifth quintile. Alternatively, many municipalities in the Houston MSA are characterised by relatively permissive zoning and land use regulations. A number of municipalities, including the central city Houston, have no zoning codes at all, but still implement policies governing land use to some degree. Additionally, while median zoning restrictiveness in the Houston MSA is lower than in the San Francisco MSA, there are still examples of exclusionary suburbs within the Houston MSA (e.g. Fulshear), demonstrating that some aggregate MSA measures of zoning restrictiveness can mask within-MSA exclusionary dynamics, which further motivates the use of our multi-dimensional MSA-level ZRI index.

ZRI (all measures) values for the complete San–Francisco–Oakland–Fremont MSA.

ZRI (all measures) values for the complete Houston–Sugar Land–Baytown MSA.
Finally, by collecting the total or near-universe of municipalities within an MSA, we can assess the robustness of our ZRI measures to excluding potentially significant portions of an MSA or elements of land use policy (see Table S10 in the online Supplemental Material). We find that adding measures of land use policy generally attenuates the ZRI results but including full sample coverage accentuates them. Neither addition substantially altered the direction or general magnitude of the results when compared to ZRI constructions with fewer measures or incomplete sample coverage. These results indicate that incorporating additional data into our text-processing can produce more precise and complete zoning and land use data, while still affirming that previous MSA-level results using a subset of these measures or samples can still be broadly consistent with these new data.
Discussion
Applying NLP to administrative data from across the country, we created the National Zoning and Land Use Database (NZLUD), finding that it compares well to survey data that have been the most frequent form of zoning and land use measurement up until this point. We have shown our process to be viable and accurate for the most readily identifiable forms of land use regulation in municipal codes. Our efforts, when combined with pre-existing and publicly available zoning data, can create a longitudinal dataset of zoning and land use policies using a process that is cost-efficient, replicable, accurate and less prone to measurement error, overcoming many of the challenges endemic to survey studies of this topic. Because we have made our data and code public, future iterations of these data can expand to other measures and municipalities not captured in the WRLURI or NLLUS samples. This process can be refreshed at regular intervals to provide consistent and accurate panel data of zoning and land use policies from across the United States.
Our contribution is notable for its use of NLP techniques to uncover zoning and land use policies across the country. It also complements similar efforts, including both state-wide and national studies that have manually processed zoning ordinances and maps (Bronin, 2022; Sahn, 2022). Similarly, Shanks (2021) uses NLP techniques to construct an index of zoning restrictiveness – the Natural Language Processing Zoning Stringency Index (NALPZ) – from municipal zoning bylaws in Massachusetts. The Urban Institute has utilised machine learning techniques to predict permitted densities from geospatial zoning data and zoning ordinances (Tyagi and MacDonald, 2019). Finally, Song (2021) estimates minimum lot sizes for a near-universe of single-family zoning districts across the US via a structural-break algorithm using parcel and Home Mortgage Disclosure Act (HMDA) data. Notwithstanding some differences, our approach generally leads to similar conclusions from these efforts regarding the pervasiveness of restrictive zoning and land use. 22
However, our approach stands out from these efforts for a number of reasons. First, our method produces a dataset with dozens of measures, as opposed to a single measure (Shanks, 2021; Song, 2021), which is important given how no single dimension of zoning and land use regulations can capture the full extent of zoning restrictiveness. Second, because our data can be checked against publicly available information, our method is falsifiable, which distinguishes it from approaches like Shanks (2021). Third, our data provide greater coverage, both geographically (by including more than just one state) and topically (our data apply to all residential development, not just single-family zoned properties). Finally, we provide the source code that can be adapted and supplemented, representing the most open-source and automated process of collecting nationwide zoning and land use data to date.
Our approach, despite its promise, is not without its limitations. First, while our approach is less time-intensive and more scalable compared to survey methods, collecting the input zoning and land use text information still requires a considerable amount of time and attention to ensure that all relevant zoning and land use information i present in the text corpus. Second, a non-trivial portion of the sample required manual formatting overrides to produce accurate information for certain measures, though future refinements to the code should be able to limit the need for this moving forward. Finally, our approach misses other important elements of zoning and land use information, namely, data stored in zoning maps that reveal how much land a municipality zones for multi-family housing (Ellickson, 2021). Incorporating zoning map data into our procedure could be a worthwhile avenue for future research.
Researchers and advocates should continue to investigate zoning and land use policies, not only owing to their role in inflating housing prices and preserving residential segregation but also as an avenue for possible reform. To assess and identify the disparate impact of zoning and land use policies and to actively monitor policy outcomes, researchers need more complete and timely access to data. Our contribution represents an important step forward in this regard.
Supplemental Material
sj-docx-1-usj-10.1177_00420980231156352 – Supplemental material for Using natural language processing to construct a National Zoning and Land Use Database
Supplemental material, sj-docx-1-usj-10.1177_00420980231156352 for Using natural language processing to construct a National Zoning and Land Use Database by Matthew Mleczko and Matthew Desmond in Urban Studies
Supplemental Material
sj-docx-2-usj-10.1177_00420980231156352 – Supplemental material for Using natural language processing to construct a National Zoning and Land Use Database
Supplemental material, sj-docx-2-usj-10.1177_00420980231156352 for Using natural language processing to construct a National Zoning and Land Use Database by Matthew Mleczko and Matthew Desmond in Urban Studies
Footnotes
Declaration of conflicting interests
Funding
Supplemental material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
