Sage Journals: Discover world-class research

Abstract

Agent-based microsimulation modeling techniques are adopted for urban system modeling mainly because of their capacity to address the complex interactions among individuals, households, and other urban elements. The performance of urban simulation models is largely dependent on the quality of the input data, which is generated through a population synthesis procedure. This study proposes a Bayesian network and generalized raking techniques for population synthesis. The Bayesian network is used to generate the synthetic population pool from the microsample, and generalized raking is used to fit the synthetic population with the control total. Some of the key features of the proposed population synthesis are as follows: accommodating heterogeneity based on both household and individual attributes; tackling missing/incomplete observations in the microsample; and generating a true synthesis of the population from the microsamples. A data-driven structure learning technique is adopted to generate effective and optimal structures among the heterogenous households and individuals. This Bayesian network + generalized raking procedure is implemented to generate a 100% synthetic population at the smallest zonal level of dissemination area for the Central Okanagan region of British Columbia. The results suggest that capturing heterogeneity within the Bayesian network has tremendously benefitted the reconstruction process to efficiently and accurately generate a synthetic population from the available microsample. Finally, this population synthesis is developed as a component of the agent-based integrated urban model, currently under development at The University of British Columbia’s Okanagan campus.

Keywords

population synthesis Bayesian network generalized raking agent-based model

The agent-based microsimulation (ABM) technique has emerged as a popular method to capture the complex interactions among transportation, urban form, and the environment within large-scale integrated urban modeling systems. Unlike traditional aggregated four-step models, these models simulate multifaceted decisions and their interactions while accommodating the heterogeneous behavior of individuals and households ( 1 , 2 ). However, these models are data hungry, primarily requiring highly disaggregate-level population data such as the age and gender of individuals and income and dwelling type of households for an entire region as input information. Although census data includes such information, they are not accessible because of privacy and confidentiality issues ( 3 ). Undertaking surveys to collect microsamples of an entire region is costly and often the survey samples represent a small portion of the population. For example, the travel survey for the Greater Vancouver Area represents 2.5% of its population ( 4 ). One of the alternative methods is to generate synthetic populations, which can be used as the input of the ABMs ( 5 , 6 ). This process is commonly known as population synthesis, which typically utilizes two sets of data: microsamples of the population and the aggregated marginal total at the spatial resolution of interest ( 7 ). However, a key challenge in generating a synthetic population is to combine a disaggregate sample to aggregated marginal information at smaller spatial units controlling for both individual- and household-level attributes.

A wide variety of approaches have been proposed to construct a synthetic population, such as iterative proportional fitting (IPF), iterative proportional updating (IPU), combinatorial optimization (CO), and fitness-based synthesis (FBS) ( 8 – 12 ). In general, the process of population synthesis consists of two stages: the fitting stage and the generation stage ( 13 ). In the fitting stage, a disaggregate microsample is weighted via an iterative reweighting procedure to match a set of target marginal distributions. In the generation stage, the actual population is derived in proportion to these weights ( 13 , 14 ). One of the conventional approaches for generating synthetic population is IPF, developed by Beckman et al. ( 8 ). The traditional IPF method focuses on fitting multiway contingency tables from the microsamples to marginal controls ( 7 ). A Monte Carlo simulation method is used to generate the synthetic population. Since its development, IPF has undergone significant evolution to improve the efficiency of the traditional IPF as well as to address limitations, including high-dimensional problems, categorization detail, convergence, the zero-cell problem, and memory consumption ( 5 , 9 , 13 , 15 ). In this line of research, Ye et al. ( 9 ) introduced an IPU technique—an advanced and modified version of IPF to control both individual- and household-level attributes. The IPU method experiences a few discrepancies observed in matching the contingency table to the marginal totals, especially in matching individual-level attributes ( 16 ). Similar to the IPU method, CO controls both household- and individual-level attributes. In this method, optimization techniques are used to minimize the difference between the synthetic population and marginal totals ( 10 , 11 ). Ma and Srinivasan ( 12 ) proposed FBS to allow multilevel controls by selecting variables based on maximum fitness. However, these conventional methods replicate the microsample to generate the synthetic population, resulting in over-dependence on the accuracy and availability of the microsamples ( 7 , 17 ).

In recent years, simulation-based approaches to generate synthetic populations have received considerable attention because of their advantages over deterministic models in handling high-dimensional problems and less dependency on the available microsamples ( 18 ). The simulation-based approach uses statistical learning techniques to learn the inherent joint distribution among the variables of interest and generate the synthetic population by sampling from the joint distribution. This approach integrates the fitting and generation stages together and does not require replication of the microsample to generate the synthetic population ( 7 , 17 ). For example, Farooq et al. ( 17 ) proposed a Markov chain Monte Carlo (MCMC) algorithm to generate the sample with Gibbs sampling. They used discrete choice models to estimate partial conditional distributions instead of full conditionals. Further, Saadi et al. ( 3 ) presented a hidden Markov model (HMM) approach for population synthesis, where a Markovian process was applied to sample all attributes sequentially. However, these probabilistic approaches were unable to capture the complex interactions among a set of variables without extensive expert knowledge ( 7 ).

Sun and Erath ( 7 ) introduced a Bayesian network (BN) approach for population synthesis. The idea is to use a data-driven framework to capture probabilistic relationships among the variables of interest. They proposed machine learning algorithms for learning BN structures to identify inherent joint distribution in a simple graphical model. A similar modeling approach was used by Zhang et al. ( 19 ) and Ilahi and Axhausen ( 20 ). Sun and Erath ( 7 ) used hierarchical structures for different members for two types of households: single-member households and multiple member households; however, it does not represent the heterogeneity that may exist across many different household types, such as couples with children and couples without children. More recently, Sun et al. ( 21 ) proposed an alternative hierarchical mixture modeling framework, where different household and individual hierarchies are defined as latent variables. However, the selection of latent variables remains unclear in this modeling approach. Further, Sun and Erath ( 7 ) used a data-driven approach only to learn the BN structure, which may overlook the fundamental inter-relationships and create unrealistic relationships among variables. For example, age cannot influence the gender of a person and gender cannot influence the age of a person ( 22 ). Zhang et al. ( 19 ) improved the efficiency of BN structure learning by integrating both expert experience and data-driven structure learning to maintain fundamental inter-relationships and ensure that unrealistic relationships are not created. However, most of these statistical models did not tackle the challenges related to incomplete/missing observations in the microsample ( 23 ).

Although statistical learning approaches can replicate the joint distribution of the original microsample and generate a synthetic population pool, it is challenging to match the marginal distributions of all variables at a micro-spatial resolution, such as census tracts or dissemination areas (DAs), for an entire region at the same time ( 19 , 20 ). As such, a fitting adjustment by using various IPF-based algorithms needs to be used ( 20 ). Saadi et al. ( 3 ) applied the HMM and IPF to generate a synthetic population. Ilahi and Axhausen ( 20 ) reported that the generalized raking (GR) algorithm outperforms the other IPF-based algorithms. Casati et al. ( 24 ) used the GR algorithm after the MCMC synthesis approach to match the marginal controls.

The contributions of this study in the population synthesis literature are three-fold: (1) accommodating heterogeneity based on both household and individual attributes; (2) tackling missing/incomplete observations in the microsample; and (3) generating a true synthesis of the population from the microsamples. As the households are heterogeneous because of diversity and complexity in different household structures and the living arrangements of individuals, this study accommodates heterogeneous inter-relationships within different household compositions capturing the complex structure at the household and individual levels. Heterogeneity is accommodated by considering several compositions of households, such as single-member, couple with children, couple without children, lone parent with children, and other types of households. The hierarchical structure of an individual’s role within a household is captured based on income and age. This study applies the BN to generate a synthetic population pool accounting for heterogeneity at the household and individual levels. While generating the synthetic population pool, the incomplete/missing observations in the microsample are treated by adopting a structural expectation–maximization (SEM) algorithm to perform a maximum likelihood (ML) estimation on the parameters from the incomplete dataset. The synthetic population pool is reconstructed from the available microsample instead of replicating the existing samples. A data-driven technique is adopted to generate effective and optimal structures, which is further validated by logical relationships. This process confirms that the BN structure retains the fundamental relationships and does not create unrealistic relationships among variables. A GR algorithm is used as the post-processing step to match the marginal total of household- and individual-level attributes at a micro-spatial resolution for an entire region. Finally, the proposed population synthesis procedure is implemented for the Okanagan region of British Columbia (BC), Canada. A 100% synthetic population is generated for this region using the public use microdata file (PUMF) as the microsample and zonal marginal total from the Canadian Census. The spatial resolution for this population synthesis procedure is the DA—that is, the smallest spatial unit for which the marginal total is available from Census Canada. This population synthesis will be used as the input for an integrated urban transportation, land use, and energy model, which is currently under development at The University of British Columbia’s Okanagan campus.

Method

The process of synthetic population generation in this paper is twofold: firstly, the BN is adopted as a probabilistic model to create a synthetic population pool by accommodating heterogeneity among the different household types and, secondly, GR algorithm is adopted as a post-processing step to match the marginal totals of both household- and individual-level attributes at a micro-spatial resolution. Figure 1 illustrates the framework for population synthesis used in this paper.

Figure 1.

Population synthesis framework.

A BN is a probabilistic graphical model in which conditional dependencies of a set of random variables are represented by a directed acyclic graph (DAG), $G = (V, A)$ ( 25 – 27 ). The graph $G$ is a mathematical object with a set of nodes $V = {v_{1}, v_{2}, \dots \dots \dots, v_{n}}$ and a set of arcs $A$ , which are identified by pairs for nodes in $V$ to represent dependencies with each other as the parent node and the child node. The dependencies between nodes are described by using conditional probability tables (CPTs). In the example shown in Figure 2, an arc from node $v_{1}$ to $v_{2}$ indicates that node $v_{2}$ is dependent on node $v_{1}$ , in which $v_{1}$ is called a parent of node $v_{2}$ , and $v_{2}$ is called a child of node $v_{1}$ . Comprehensive detail on the theory and application of BNs can be found in Buntine ( 27 ) and Scutari ( 28 ).

Figure 2.

An example of a Bayesian network.

The construction of a BN consists of two main steps ( 19 , 23 ): (a) structure learning to determine the dependencies of a set of random variables to construct a DAG; and (b) parameter estimation based on this DAG structure to learn the CPT at each node. To construct an optimal and effective network structure, the following techniques are used for BN structure leaning ( 23 , 29 ): (i) expert knowledge; (ii) the machine learning algorithm; and (iii) a hybrid of expert experience and the machine learning algorithm. Commonly used machine learning algorithms for learning BN structures are the constraint-based algorithm, the score-based algorithm, and the hybrid algorithm ( 30 ). This paper uses a combination of the heuristic algorithm and score-based criteria to generate effective and optimal structures, which is further validated by logical relationships. The is critical to capture the fundamental relationships among variables of interest, such as age cannot influence the gender of a person and vice versa. On the other hand, age can potentially cause or influence the marital status of a person. In the case of the machine learning algorithm, a combination of the tabu search algorithm and Akaike information criterion (AIC) scoring criteria are used. In this process, the score-based algorithm starts with an empty structure and the final structure is selected based on the highest score by adding, deleting, or reversing arcs ( 19 , 29 ):

$AIC (G^{h} | D) = \log P (G^{h}, Θ | D) - d$ (1)

where $Θ$ is the ML estimate parameter given a hypothetical structure $G^{h}$ ; D is the given observation; and d is the degrees of freedom in $Θ$ .

The second step of a BN construction is the parameter learning from the BN structure. However, the microsamples often contain missing/incomplete observations, which require treatment. Most of the statistical models failed to directly analyze the data with missing/incomplete values ( 23 ). In this study, the missing/incomplete observations are handled by using the expectation–maximization (EM) algorithm ( 7 ). The EM algorithm mainly involves two steps: the expectation step (E-Step) to calculate the expected sufficient statistic of the missing value, and maximization step (M-Step) to calculate the new ML or maximum a posteriori (MAP) values ( 23 ). The current work implements a variant of the standard EM algorithm, the SEM algorithm, proposed by Friedman ( 30 ). The SEM algorithm combines the standard EM algorithm with a structure search by using various Bayesian scoring criteria for model selection. Further details on the theory of the EM and SEM algorithms can be found in Zou and Yue ( 23 ) and Friedman ( 30 ), respectively.

In the post-processing stage, an IPF-based algorithm, specifically, a GR algorithm, is adopted to fit the pool of synthetic population with the marginal total at the micro-spatial resolution of an entire region. This study adopts a GR algorithm as the previous studies suggested that the GR algorithm is faster and it outperforms the other IPF-based algorithms ( 20 ).

Study Area and Data

Study Area

The study area for this exercise is the Okanagan region, located in the southern interior of the province of BC, Canada. The region includes the following five major cities: Kelowna, West Kelowna, Lake Country, Vernon, and Peachland (Figure 3). For interior BC, this is the most densely populated region, with approximately 384 people/square kilometer and consisting of over 60% of the population of the Okanagan region. While the population synthesis framework used in this paper can be applied to any other region, this paper uses these five cities as a case study. The synthetic population is generated at the DA level to capture micro-spatial resolution. The study area is divided into 293 DAs, and each DA represents “a small geographic unit with an average population of 400 to 700 persons” ( 31 ).

Figure 3.

Study area.

Data

In this paper, the data used for generating the synthetic population includes the 2016 Canadian Census hierarchical PUMF as the microsample, and the 2016 Canadian Census information at the DA level as the control total. However, the microsample for the Okanagan region is not available in the PUMF data. As such, the microsamples outside of the Vancouver Census metropolitan area (CMA) within BC are considered the microsample for the Okanagan region, which includes 21,270 individuals and 9210 households. The aggregated marginal total from the 2016 Canadian Census information for the study area includes 218,501 individuals and 91,690 households. The spatial resolution of the Census information considered in this study is at the smallest zonal level of the DA. The household- and individual-level attributes and their categories used in this study are presented in Table 1.

Table 1.

Individual- and Household-Level Attributes

Variable name	Definition	Categories
Individual-level attributes
Gen	Gender (2 categories)	Female; male
Age	Age (13 categories)	Under 9; 10–14; 15–19; 20–24; 25–29; 30–34; 35–39; 40–44; 45–49; 50–54; 55–64; 65–74; 75 and above
Mar	Marital status (4 categories)	Single; Married; Common law; Separated, divorced, or widowed (and not living common law)
Edu	Education (9 categories)	No certificate; Secondary (high) school; Trades certificate; Apprenticeship certificate; College; University below bachelor level; Bachelor; University above bachelor level; Not applicable^a
Emp	Employment (4 categories)	Employed; Unemployed; Not in the labor force; Not applicable^a
PInc	Person income × C$1000 (12 categories)	Under C$10; C$10–C$19; C$20–C$29; C$30–C$39; C$40–C$49; C$50–C$59; C$60–C$69; C$70–C$79; C$80–C$89; C$90–C$99; C$100 and over; Not applicable^a
Occ	Occupation (11 categories)	Management; Business, finance, and administration; Natural and applied sciences and related; Health; Education, law, and social, community and government services; Art, culture, recreation, and sport; Sales and service; Trades, transport, and equipment operators; Natural resources, agriculture, and related production; Manufacturing and utilities; Not applicable^a
Household-level attributes
Hhs	Household size (5 categories)	1; 2; 3; 4; 5 or more
HInc	Household income × C$1000 (14 categories)	Under C$10; C$10–C$19; C$20–C$29; C$30–C$39; C$40–C$49; C$50–C$59; C$60–C$69; C$70–C$79; C$80–C$89; C$90–C$99; C$100–C$124; C$125–C$149; C$150–C$199; C$200 and over
Tenure	Tenure type (2 categories)	Owned; Rented and band housing
Built	Period of dwelling construction (7 categories)	1960 or before; 1961–1980; 1981–1990; 1991–2000; 2001–2005; 2006–2010; 2011–2016
Rooms	Number of rooms (5 categories)	1–4; 5; 6; 7; 8 or more
Dtype	Dwelling type (3 categories)	Single-detached house; Apartment; Other dwelling

Persons less than 15 years of age.

Households are heterogeneous because of diversity and complexity in different household structures and the living arrangements of individuals. The heterogeneous inter-relationships at both household and individual levels could be unique across household structures. As such, to model heterogeneous inter-relationships within different household compositions to capture the complex structure at both household and individual levels, the following pre-processing is done.

(i) Household structures: the household compositions/structures used in this study are consistent with the household structures identified in the 2016 PUMF data. The following five household compositions are considered in this study to capture heterogeneous inter-relationships at the household level: single-member household, couple household without children, couple household with children, lone-parent household with children, and other household. The “other household” category is defined as a household that does not fall into any specific category. The data consists of 13 variables (i.e., 7 individual-level and 6 household-level variables) with 2665 observations for single-member households, 2870 observations for couple households without children, 1956 observations for couple households with children, 730 observations for lone-parent households with children, and 989 observations for other households.

(ii) Living arrangements of individuals: although there is one reference person per household in the PUMF data, the Canadian Census identifies this reference person based on who was listed first on the questionnaire ( 32 ). As such, it does not reflect the actual living arrangements of individuals within a household. Following Casati et al. ( 24 ), the reference individuals are identified based on income and age. Specifically, the individual with the highest income (or the one with the highest age, if there was a conflict) is identified as the reference person of the household, and the individual with the minimum age difference from the reference person is identified as the spouse of that household ( 7 ).

Results

This section presents the results of the synthetic population. As indicated earlier, this study implemented the BN + GR procedure to generate a 100% synthetic population at the smallest zonal level of DA for the Okanagan region of BC, Canada. Firstly, the BN structures learned from the different household compositions are discussed. Then, the performance of the simulated synthetic population pool generated from the BNs are evaluated based on various matrices. Finally, this section is concluded with a quantitative assessment of the synthetic population during the post-processing step of the proposed framework.

All tests were performed on a personal computer with Intel (R) Xeon (R) W-1270 CPU (8 cores, 3.40 GHz) and 16.0 GB RAM in the Operating System of Microsoft Windows 10 Enterprise (x64).

Bayesian Network Structures for Different Household Types

To capture the unique heterogeneous inter-relationships at both household and individual levels, BN structures were developed for five different household compositions. In addition, following Casati et al. ( 24 ), the hierarchical structure of an individual within a household was captured in the BN. For example, within a couple household without children, the first agent or reference person was identified based on the highest income or the highest age, followed by the second agent who was a spouse of the first agent. On the other hand, within a couple household with children, children were added as a subsequent agent to the BN based on their age. Following Sun and Erath ( 7 ), a series of BNs was developed to integrate household and individual attributes to the model. For instance, the BN for single-member households was developed based on the household attributes and the reference person (in this case, the only member was defined as the reference person) of the household. On the other hand, household attributes, reference person, and spouse attributes were used to develop the BN for couple households without children. For the remaining households (couple household with children, lone-parent household with children, and other household), the number of household members was added to the BN with the household attributes and reference person and/or spouse attributes. Furthermore, a separate BN was developed to characterize the inter-relationships of the reference person and spouse attributes to the rest of the members of that household based on the number of household members. Figure 4 illustrates the BNs for the reference person only (within a single-member household, lone-parent household with children, and other household) and the reference person and their spouse only (within a couple household without children and couple household with children) along with the household attributes. BN learning was implemented using the R-project and “bnlearn” package ( 33 , 34 ). For the logical relationships, the “whitelist” and “blacklist” features within the “bnlearn” package were used. Finally, a pool of synthetic population was drawn from the BN using the forward sampling technique, which reproduces a joint distribution that resembles the original microsample.

Figure 4.

Partial Bayesian network structures: (a) single-member household, (b) couple household without children, (c) couple household with children, (d) lone-parent household with children, and (e) other household.

This study assumes the relationships implied by the estimated hierarchical structures apply to all households of a particular type (e.g., couple household without children). This is, however, not uncommon in the literature. Sun and Erath ( 7 ), Zhang et al. ( 19 ), and Ilahi and Axhausen ( 20 ) applied the estimated hierarchical structures to all households of that particular type. The Bayesian structures developed in this study capture local conditional probabilities to identify the optimal structure. During the BN structure learning process, a data-driven technique was adopted. The developed BN structures were further validated by the fundamental relationships and/or “common sense” to avoid unrealistic relationships among individual and household attributes, such as an individual’s age cannot influence the gender and an individual’s gender cannot influence the age of that person; on the other hand, age can potentially cause or influence the marital status, education, and occupation of a person. The developed BN structures suggest both common and heterogeneous relationships across the different household types. For example, in general, age influences marital status, education, and occupation across the different household types. This is expected, as education level is directly related to the age of an individual. For example, a six-year-old is more likely to be in elementary school. In the case of a couple household without children and a couple household with children, there is a strong relationship between the reference person’s age and spouse’s age for both groups. It is also noticed that the relationship between the reference person’s gender and spouse’s gender are the same for those two household compositions as well. These phenomena validate the retention of logical relationships within each BN structure.

In contrast, significant differences have been observed among different household types, which indicates the existence of heterogeneity across the different household structures and the hierarchical structure of an individual’s role within a household. For example, households with children (i.e., couple household with children and lone-parent household with children) tend to have a higher number of rooms, suggesting the dwelling unit is influenced by household size. This reflects the likelihood that the number of rooms in a dwelling unit is linked to the number of persons living in that household. The relationships retained in the developed BN structures are also consistent with the BNs developed by other researchers. For example, the BN structures for multiple member households developed by Sun and Erath ( 7 ) establish that household size is determined by spouse age and, further, spouse age is determined by owner age. This study also retains similar relationships for multiple member households (e.g., couple household with children).

Assessing the Synthetic Population Pool Sampled From the BNs

A pool of synthetic population (a total of approximately 1,000,000 individuals and approximately 440,000 households) was drawn from the BNs using forward sampling to recreate the original microsample. The same proportions of different household compositions in the PUMF data were used to create the pool. For example, the couple household without children category represented the highest proportion of households (31.2%) in the microsample. To keep the same proportion in the synthetic population pool, 135,000 households were sampled from this category. During this process, the reference person (i.e., single-member household and other households), the reference person and the spouse (i.e., couple household without children and couple household with children) and the lone parent (i.e., lone-parent household with children) were sampled at the same time. To create the rest of the members within the household (i.e., couple household with children, lone-parent household with children, and other household), the sampled household size attribute along with the BN for the other members were used to generate the full synthetic household. This process preserves the attributes of household members corresponding to their household attributes.

This study used graphical analysis to visualize the performance of the BN performance in preserving the joint distribution among household- and individual-level attributes in generating the synthetic population pool. The analysis suggests that the majority of the household attributes in the synthetic pool are a very good fit with the microsample, as shown in Figure 5. For example, the difference between the synthetic pool and the PUMF data are within −1% to 1% for all categories within dwelling type, tenure type, period of dwelling construction, number of rooms, and household size attributes. It was also observed that the difference between the synthetic pool and the PUMF data increases with the increase of the number of categories within an attribute. The marginal distributions of individual-level attributes also demonstrate a similar conclusion, as shown in Figure 6. However, household-level attributes perform better than individual-level attributes. This could be attributed to the individual-level attributes consisting of more categories than household-level attributes, which adds more intricacy to the BNs. To examine further, different multivariate joint distributions among household-level attributes (dwelling type × household size, dwelling type × household income, dwelling type × tenure type, dwelling type × period of dwelling construction, dwelling type × number of rooms, and dwelling type × household size × household income × tenure type × period of dwelling construction × number of rooms) were plotted to analyze the scalability. It is evident from Figure 7 that the multivariate joint distributions of household-level attributes in the synthetic pool are highly acceptable. For instance, the observed $R^{2}$ value for most of the multivariate joint distributions is very close to 1. It also demonstrates that capturing heterogeneity within the BN has tremendously benefitted the reconstruction process to efficiently and accurately generate a matching population pool from the available microsample.

Figure 5.

Household-level marginal distribution comparison: (a) dwelling type, (b) household size, (c) household income, (d) tenure type, (e) period of dwelling construction, and (f) number of rooms.

Figure 6.

Individual-level marginal distribution comparison: (a) gender, (b) age, (c) marital status, (d) education, (e) employment, (f) person income, and (g) occupation.

Figure 7.

Household-level joint distributions: (a) dwelling type × household size, (b) dwelling type × household income, (c) dwelling type × tenure type, (d) dwelling type × period of dwelling construction, (e) dwelling type × number of rooms, and (f) dwelling type × household size × household income × tenure type × period of dwelling construction × number of rooms.

To investigate the underlying association between household- and individual-level attributes, the model-based Cramér’s V was estimated (Equation 2), which was adopted from Sun et al. ( 21 ). This indicator provides pairwise joint distribution among variables. Figure 8 illustrates the pairwise Cramér’s V among both household-level attributes (dwelling type, household size, household income, tenure type, period of dwelling construction, and number of rooms) and individual-level attributes (gender, age, marital status, educations, employment, person income, and occupation). The comparative analysis suggests that the majority of the pairwise joint distributions are closely matched with the one computed using the microsample. For example, the $ρ$ values of the joint distribution between household size and age are 0.31 and 0.30 when using the microsample and synthetic pool, respectively. This indicates that the synthetic population pool generated by the BNs was able to capture the underlying inter-relationships among the individuals corresponding to the household they belong to. However, a few pairwise joint distributions indicate a higher error percentage because of their higher number of categories:

$ρ_{ij} = \sqrt{\frac{1}{min {d_{i}, d_{j}} - 1} \sum_{c_{i} = 1}^{d_{i}} \sum_{c_{j} = 1}^{d_{j}} \frac{{(ϕ_{c_{i} c_{j}} - ϕ_{c_{i}}^{- (i)} ϕ_{c_{j}}^{- (j)})}^{2}}{ϕ_{c_{i}}^{- (i)} ϕ_{c_{j}}^{- (j)}}}$ (2)

where $ρ_{ij}$ represents a model-based version of Cramér’s V, which ranges from 0 to 1. The two variables are independent when $ρ_{ij} \approx 0 .$

Figure 8.

Joint distribution of both household- and individual-level attributes: (a) public use microdata file and (b) synthetic pool.

Assessing the Results of the Synthetic Population

In the post-processing stage, an IPF-based algorithm; specifically, a GR algorithm, was adopted to fit the pool of synthetic population with the marginal total at the micro-spatial resolution of an entire region. This study adopts a GR algorithm, as the previous studies suggested that the GR is faster and it outperforms the other IPF-based algorithms ( 20 ). The GR algorithm was implemented from the R-package mlfit ( 35 ). One of the challenges with these IPF-based algorithms is that they produce fractional weights for each household and individual. As such, this paper used the truncate replicate sample (TRS) method ( 36 ) as the integerization and expansion process, where the integerization component converts fractional weights to integers and the expansion component replicates the individuals, households, or both, based on the integer weight. A 100% synthetic population was generated for the year 2016 by using 2016 Canadian Census hierarchical PUMF as the microsample, and the 2016 Canadian Census information as the control total. To assess the performance of the BN + GR, the synthetic population was generated at two spatial resolutions—(a) the region level and (b) the DA level.

At the region level, six household-level attributes (i.e., dwelling type, household size, household income, tenure type, period of dwelling construction, and number of rooms) and three individual-level attributes (i.e., gender, age, and marital status) were used to control the marginal totals. The results suggest that the BN + GR generated higher accuracy in matching the synthetic population with the aggregated marginal totals. For example, a population of 218,981 was generated by using the proposed population synthesis approach, in comparison to the Census total population of 218,501 in the study area. This is a slight over-representation of the Census total population by 0.22%. On the other hand, a total of 91,745 households were generated, which is also a slight over-representation of the Census total households by 0.06%. Furthermore, the distributions of the household- and individual-level attributes in the synthetic population and Census totals were investigated. The analysis suggests that the majority of the household attributes in the synthetic population are a very good fit with the Census total, as shown in Figure 9. For example, the difference between the synthetic population and the Census total are within −1% to +1% for most of the categories within the dwelling type, household size, tenure type, period of dwelling construction, and number of rooms. However, the difference increases with the increase of the number of categories within an attribute—that is, household income. Similar results were observed for the individual-level attributes, as shown in Figure 10. For example, the difference between the synthetic population and the Census total are within −1% to 1% for most of the categories within gender and marital status. However, one of the categories within the individual age attribute shows the difference is greater than 4.5%.

Figure 9.

Comparison between Census and synthetic population for household-level attributes: (a) dwelling type, (b) household size, (c) household income, (d) tenure type, (e) period of dwelling construction, and (f) number of rooms.

Figure 10.

Comparison between Census and synthetic population for individual-level attributes: (a) gender, (b) marital status, and (c) age.

To assess the performance of the proposed population synthesis approach to capture the micro-spatial resolution, the synthetic population was also generated at a DA level within the study area, which consists of 293 DAs. Dwelling type and gender were used to control marginal totals at the DA level. In this study, absolute error percentage (Equation 3) between the generated synthetic population and Census total was chosen to investigate the performance of the estimated synthesized population.

$Absolute error percentage (%) = \frac{\sum_{i = 1}^{N} | x_{i} - {\hat{x}}_{i} |}{\sum_{i = 1}^{N} x_{i}} \times 100$ (3)

where N represents the number of categories within an attribute of interest; $x_{i}$ represents the known marginal total for category i; and ${\hat{x}}_{i}$ represents the estimated marginal total for category j.

Figure 11 shows the absolute error percentages between the generated synthetic population and Census total at the DA level for gender and dwelling type attributes. It is apparent from Figure 11 that the error percentages of different gender categories within most of the DAs show satisfactory estimates. For example, approximately 74% of the DAs showed an error percentage within a range of 0%–4.5% for different gender categories. Only 2% of the DAs showed a difference of greater than 10%. However, dwelling type showed higher error percentages than that of the gender attribute, which is in line with the above-mentioned findings—the difference increases with the increase of the number of categories within an attribute. Overall, in comparison with the region-level marginal totals, a relatively higher percentage of errors has been observed at the DA level. This may be because of the zero-cell problem of the DAs that negatively affects the accuracy of the results. In addition, the use of the integerization procedure to convert fractional weights to integers at each DA contributes to the higher error percentage.

Figure 11.

Difference between Census and synthetic population at the dissemination area level: (a) absolute error percentage (gender) and (b) absolute error percentage (dwelling type).

Furthermore, a comparison between Census and synthetic population by DA size based on total population is presented in Figure 12. It is evident from the boxplot and whisker diagram that the increase in the DA size leads to the increase in the absolute difference. This further justifies the use of the integerization procedure, which contributes to the higher absolute difference with the increase in the size of the DA.

Figure 12.

Absolute difference by dissemination area size: (a) gender and (b) dwelling type.

Finally, the generated synthetic population and households were distributed in the five major cities within the study area, as shown in Figure 13. This demonstrates that the population density and household density vary dramatically within the region, and the City of Kelowna accounts for the majority of the population (and households) compared to the other cities. For example, a total of 53,960 households and 127,078 individuals were generated in the City of Kelowna, which represents approximately 37% of the households and approximately 41% of the population within the study area, respectively. In addition, it is evident from the figure that the urban core within each city represents the densely populated areas. This indicates that the synthetic population, generated from the BN + GR, resembles the original density in the study area.

Figure 13.

Comparison between population and household distribution at the dissemination area (DA) level: (a) observed population at the DA level, (b) synthetic population at the DA level, (c) observed household at the DA level, and (d) synthetic household at the DA level.

Conclusions

This study adopts a BN approach and GR technique to generate a 100% synthetic population at the smallest zonal level of DA for the Central Okanagan region of BC, Canada. The BN technique is adopted to tackle the following challenges in a population synthesis procedure: incorporate heterogeneity among different population groups based on both household and individual attributes; handle the missing/incomplete observations in the microsample; and generate a synthetic population pool resembling the original microsample. A data-driven technique is adopted during the BN structure learning process, which is further validated to confirm that the BN retains the fundamental relationships among household and individual attributes. Following the generation of the synthetic population pool, a GR algorithm is adopted to fit the population with the control total at the DA level.

The implementation results suggest that capturing heterogeneity within the BN has tremendously benefitted the reconstruction process to efficiently and accurately generate a matching population pool from the available microsample. The synthetic population pool generated from a series of BNs is also able to capture the underlying inter-relationships among the individuals corresponding to the household they belong to. Finally, a post-processing step, BN + GR, provides higher accuracy in matching the synthetic population pool with the aggregated marginal totals.

The developed population synthesis procedure will be used to generate baseline populations for an agent-based integrated urban model, which is currently under development at The University of British Columbia’s Okanagan campus. The socio-demographic attributes generated through this population synthesis procedure will then serve as the attributes of the individual and household classes of the respective agents. The integrated urban model will microsimulate individual and household agent changes, such as demographic events, and decisions related to land use, vehicle ownership, and travel activities—largely based on these socio-demographic characteristics of the synthetic population.

This study provides promising insights toward the data-driven simulation-based approach to generate a synthetic population, in particular incorporating heterogeneity among different population groups based on both household and individual attributes. There are several directions for future research. This study assumes the relationships implied by the estimated hierarchical structures apply to all households of a particular type (e.g., couple household without children). Although this is not uncommon in the literature, future research should focus on investigating the existence of heterogeneity within a specific household type. Future studies should also focus on probabilistic methods to identify heterogenous groups. One of the methods could be using a clustering technique to allocate households and individuals into different groups. Another direction for future research is to explore the use of the BN technique to predict demographic events (i.e., birth, death, age, marriage, and divorce) within the simulation procedure and test whether these relationships hold or not in the future year(s).

The model presented in this study is developed for the Central Okanagan region in BC. Future research should focus on data transferability to compare the applicability of the BN hierarchies for different datasets. This study adopts a GR algorithm as a post-processing step to match the marginal totals of both household- and individual-level attributes at a micro-spatial resolution, and comparing the other IPF-based algorithms was not within the scope of this study; future research can consider comparing these algorithms using a synthetic population pool from the BN to investigate model performance with respect to accuracy and run time.

Footnotes

The authors would like to thank Prof. Johan W. Joubert for providing the initial understanding on the use of BN models. The authors would also like to thank Trevor Nikodym for proofreading this paper.

Author Contributions

The authors confirm their contribution to the paper as follows: study conception and design: M. N. Rahman,M. R. Fatmi;data collection: M. N. Rahman;analysis and interpretation of results: M. N. Rahman,M. R. Fatmi;draft manuscript preparation: M. N. Rahman,M. R. Fatmi. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: This work was funded by the Natural Sciences and Engineering Research Council (NSERC) Discovery Grant;and the Canada Foundation for Innovation (CFI - JELF).

ORCID iDs

Md Nobinur Rahman

Mahmudur Rahman Fatmi

References

Hörl

Balac

Open Data Travel Demand Synthesis for Agent-Based Transport Simulation: A Case Study of Paris and Île-de-France. Arbeitsberichte Verkehrs-und Raumplanung, Vol. 1499, 2020. https://doi.org/10.3929/ethz-b-000412979

Hafezi

M. H.

Habib

M. A.

Synthesizing Population for Microsimulation-Based Integrated Transport Models Using Atlantic Canada Micro-Data. Procedia Computer Science, Vol. 37, 2014, pp. 410–415.

Saadi

Mustafa

Teller

Farooq

Cools

Hidden Markov Model-Based Population Synthesis. Transportation Research Part B: Methodological, Vol. 90, 2016, pp. 1–21.

TransLink. 2017 Trip Diary Survey Data. Metro Vancouver: TransLink. October 10, 2019. https://public.tableau.com/app/profile/translink/viz/Trip_Diary_2017/TripDiary2017. Accessed July 18, 2021.

Pritchard

D. R.

Miller

E. J.

Advances in Population Synthesis: Fitting Many Attributes Per Agent and Fitting to Household and Person Margins Simultaneously. Transportation, Vol. 39, No. 3, 2012, pp. 685–704.

B. M.

Birkin

M. H.

Rees

P. H.

A Dynamic MSM With Agent Elements for Spatial Demographic Forecasting. Social Science Computer Review, Vol. 29, No. 1, 2011, pp. 145–160.

Sun

Erath

A Bayesian Network Approach for Population Synthesis. Transportation Research Part C: Emerging Technologies, Vol. 61, 2015, pp. 49–62.

Beckman

R. J.

Baggerly

K. A.

McKay

M. D.

Creating Synthetic Baseline Populations. Transportation Research Part A: Policy and Practice, Vol. 30, No. 6, 1996, pp. 415–429.

Konduri

Pendyala

R. M.

Sana

Waddell

A Methodology to Match Distributions of Both Household and Person Attributes in the Generation of Synthetic Populations. Presented at 88th Annual Meeting of the Transportation Research Board, Washington, D.C., 2009.

10.

Voas

Williamson

An Evaluation of the Combinatorial Optimization Approach to the Creation of Synthetic Microdata. International Journal of Population Geography, Vol. 6, No. 5, 2000, pp. 349–366.

11.

Voas

Williamson

Evaluating Goodness-of-Fit Measures for Synthetic Microdata. Geographical and Environmental Modeling, Vol. 5, No. 2, 2001, pp. 177–200.

12.

Srinivasan

Synthetic Population Generation With Multilevel Controls: A Fitness-Based Synthesis Approach and Validations. Computer-Aided Civil and Infrastructure Engineering, Vol. 30, No. 2, 2015, pp. 135–150.

13.

Müller

Axhausen

K. W.

Population Synthesis for Microsimulation: State of the Art. Presented at 90th Annual Meeting of the Transportation Research Board, Washington, D.C., 2011.

14.

Saadi

Farooq

Mustafa

Teller

Cools

An Efficient Hierarchical Model for Multi-Source Information Fusion. Expert Systems With Applications, Vol. 110, 2018, pp. 352–362.

15.

Guo

J. Y.

Bhat

C. R.

Population Synthesis for Microsimulating Travel Behavior. Transportation Research Record: Journal of the Transportation Research Board, 2007. 2014: 92–101.

16.

Lim

P. P.

Gargett

Population Synthesis for Travel Demand Forecasting. Proc., 36th Australasian Transport Research Forum (ATRF), Brisbane, Australia, 2013.

17.

Farooq

Bierlaire

Hurtubia

Flötteröd

Simulation Based Population Synthesis. Transportation Research Part B: Methodological, Vol. 58, 2013, pp. 243–263.

18.

Borysov

S. S.

Rich

Pereira

F. C.

How to Generate Micro-Agents? A Deep Generative Modeling Approach to Population Synthesis. Transportation Research Part C: Emerging Technologies, Vol. 106, 2019, pp. 73–97.

19.

Zhang

Cao

Feygin

Tang

Shen

Z. J.

Pozdnoukhov

Connected Population Synthesis for Transportation Simulation. Transportation Research Part C: Emerging Technologies, Vol. 103, 2019, pp. 1–6.

20.

Ilahi

Axhausen

K. W.

Integrating Bayesian Network and Generalized Raking for Population Synthesis in Greater Jakarta. Regional Studies, Regional Science, Vol. 6, No. 1, 2019, pp. 623–636.

21.

Sun

Erath

Cai

A Hierarchical Mixture Modeling Framework for Population Synthesis. Transportation Research Part B: Methodological, Vol. 114, 2018, pp. 199–212.

22.

Joubert

J. W.

Synthetic Populations of South African Urban Areas. Data in Brief, Vol. 19, 2018, pp. 1012–1020.

23.

Zou

Yue

W. L.

A Bayesian Network Approach to Causation Analysis of Road Accidents Using Netica. Journal of Advanced Transportation, Vol. 2017, 2017, pp. 1–18.

24.

Casati

Müller

Fourie

P. J.

Erath

Axhausen

K. W.

Synthetic Population Generation by Combining a Hierarchical, Simulation-Based Approach With Reweighting by Generalized Raking. Transportation Research Record: Journal of the Transportation Research Board, 2015. 2493: 107–116.

25.

Ulak

M. B.

Yazici

Zhang

Analyzing Network-Wide Patterns of Rail Transit Delays Using Bayesian Network Learning. Transportation Research Part C: Emerging Technologies, Vol. 119, 2020, pp. 102749.

26.

Charniak

Bayesian Networks Without Tears. AI Magazine, Vol. 12, No. 4, 1991, pp. 50–63.

27.

Buntine

A Guide to the Literature on Learning Probabilistic Networks From Data. IEEE Transactions on knowledge and data engineering, Vol. 8, No. 2, 1996, pp. 195–210.

28.

Scutari

Learning Bayesian Networks With the bnlearn R Package. Journal of Statistical Software, Vol. 35, 2010, pp. 1–22.

29.

T.-Y.

Chow

J. Y.

Causal Structure Learning for Travel Mode Choice Using Structural Restrictions and Model Averaging Algorithm. Transportmetrica A: Transport Science, Vol. 13, No. 4, 2017, pp. 299–325.

30.

Friedman

The Bayesian Structural EM Algorithm. Proc., 14th Conference on Uncertainty in Artificial Intelligence, San Francisco, CA, 1998, pp. 129–138.

31.

Statistics Canada. Dictionary, Census of Population in 2016: Dissemination area (DA). Ottawa: Statistics Canada. November 16, 2016. https://www12.statcan.gc.ca/census-recensement/2016/ref/dict/geo021-eng.cfm. Accessed July 18, 2021.

32.

Statistics Canada. 2016 Census public use microdata file (PUMF): Hierarchical File. Ottawa: Statistics Canada. 2016. https://www150.statcan.gc.ca/n1/en/catalogue/98M0002X2016001. Accessed July 18, 2021.

33.

Scutari

Ness

Package ‘bnlearn’, 2020. https://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf. Accessed July 18, 2021.

34.

R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, 2013.

35.

Müller

Siripanich

mlfit: Iterative Proportional Fitting Algorithms for Nested Structures. https://mlfit.github.io/mlfit/; https://github.com/mlfit/mlfit.

36.

Lovelace

Ballas

‘Truncate, Replicate, Sample’: A Method for Creating Integer Weights for Spatial Microsimulation. Computers, Environment and Urban Systems, Vol. 41, 2013, pp. 1–11.

Population Synthesis Accommodating Heterogeneity: A Bayesian Network and Generalized Raking Technique

Abstract

Keywords

Method

Study Area and Data

Study Area

Data

Results

Bayesian Network Structures for Different Household Types

Assessing the Synthetic Population Pool Sampled From the BNs

Assessing the Results of the Synthetic Population

Conclusions

Footnotes

Author Contributions

Declaration of Conflicting Interests

Funding

ORCID iDs

References