Abstract
The agent-based microsimulation (ABM) technique has emerged as a popular method to capture the complex interactions among transportation, urban form, and the environment within large-scale integrated urban modeling systems. Unlike traditional aggregated four-step models, these models simulate multifaceted decisions and their interactions while accommodating the heterogeneous behavior of individuals and households (
A wide variety of approaches have been proposed to construct a synthetic population, such as iterative proportional fitting (IPF), iterative proportional updating (IPU), combinatorial optimization (CO), and fitness-based synthesis (FBS) (
In recent years, simulation-based approaches to generate synthetic populations have received considerable attention because of their advantages over deterministic models in handling high-dimensional problems and less dependency on the available microsamples (
Sun and Erath (
Although statistical learning approaches can replicate the joint distribution of the original microsample and generate a synthetic population pool, it is challenging to match the marginal distributions of all variables at a micro-spatial resolution, such as census tracts or dissemination areas (DAs), for an entire region at the same time (
The contributions of this study in the population synthesis literature are three-fold: (1) accommodating heterogeneity based on both household and individual attributes; (2) tackling missing/incomplete observations in the microsample; and (3) generating a true synthesis of the population from the microsamples. As the households are heterogeneous because of diversity and complexity in different household structures and the living arrangements of individuals, this study accommodates heterogeneous inter-relationships within different household compositions capturing the complex structure at the household and individual levels. Heterogeneity is accommodated by considering several compositions of households, such as single-member, couple with children, couple without children, lone parent with children, and other types of households. The hierarchical structure of an individual’s role within a household is captured based on income and age. This study applies the BN to generate a synthetic population pool accounting for heterogeneity at the household and individual levels. While generating the synthetic population pool, the incomplete/missing observations in the microsample are treated by adopting a structural expectation–maximization (SEM) algorithm to perform a maximum likelihood (ML) estimation on the parameters from the incomplete dataset. The synthetic population pool is reconstructed from the available microsample instead of replicating the existing samples. A data-driven technique is adopted to generate effective and optimal structures, which is further validated by logical relationships. This process confirms that the BN structure retains the fundamental relationships and does not create unrealistic relationships among variables. A GR algorithm is used as the post-processing step to match the marginal total of household- and individual-level attributes at a micro-spatial resolution for an entire region. Finally, the proposed population synthesis procedure is implemented for the Okanagan region of British Columbia (BC), Canada. A 100% synthetic population is generated for this region using the public use microdata file (PUMF) as the microsample and zonal marginal total from the Canadian Census. The spatial resolution for this population synthesis procedure is the DA—that is, the smallest spatial unit for which the marginal total is available from Census Canada. This population synthesis will be used as the input for an integrated urban transportation, land use, and energy model, which is currently under development at The University of British Columbia’s Okanagan campus.
Method
The process of synthetic population generation in this paper is twofold: firstly, the BN is adopted as a probabilistic model to create a synthetic population pool by accommodating heterogeneity among the different household types and, secondly, GR algorithm is adopted as a post-processing step to match the marginal totals of both household- and individual-level attributes at a micro-spatial resolution. Figure 1 illustrates the framework for population synthesis used in this paper.

Population synthesis framework.
A BN is a probabilistic graphical model in which conditional dependencies of a set of random variables are represented by a directed acyclic graph (DAG),

An example of a Bayesian network.
The construction of a BN consists of two main steps (
where
The second step of a BN construction is the parameter learning from the BN structure. However, the microsamples often contain missing/incomplete observations, which require treatment. Most of the statistical models failed to directly analyze the data with missing/incomplete values (
In the post-processing stage, an IPF-based algorithm, specifically, a GR algorithm, is adopted to fit the pool of synthetic population with the marginal total at the micro-spatial resolution of an entire region. This study adopts a GR algorithm as the previous studies suggested that the GR algorithm is faster and it outperforms the other IPF-based algorithms (
Study Area and Data
Study Area
The study area for this exercise is the Okanagan region, located in the southern interior of the province of BC, Canada. The region includes the following five major cities: Kelowna, West Kelowna, Lake Country, Vernon, and Peachland (Figure 3). For interior BC, this is the most densely populated region, with approximately 384 people/square kilometer and consisting of over 60% of the population of the Okanagan region. While the population synthesis framework used in this paper can be applied to any other region, this paper uses these five cities as a case study. The synthetic population is generated at the DA level to capture micro-spatial resolution. The study area is divided into 293 DAs, and each DA represents “a small geographic unit with an average population of 400 to 700 persons” (

Study area.
Data
In this paper, the data used for generating the synthetic population includes the 2016 Canadian Census hierarchical PUMF as the microsample, and the 2016 Canadian Census information at the DA level as the control total. However, the microsample for the Okanagan region is not available in the PUMF data. As such, the microsamples outside of the Vancouver Census metropolitan area (CMA) within BC are considered the microsample for the Okanagan region, which includes 21,270 individuals and 9210 households. The aggregated marginal total from the 2016 Canadian Census information for the study area includes 218,501 individuals and 91,690 households. The spatial resolution of the Census information considered in this study is at the smallest zonal level of the DA. The household- and individual-level attributes and their categories used in this study are presented in Table 1.
Individual- and Household-Level Attributes
Persons less than 15 years of age.
Households are heterogeneous because of diversity and complexity in different household structures and the living arrangements of individuals. The heterogeneous inter-relationships at both household and individual levels could be unique across household structures. As such, to model heterogeneous inter-relationships within different household compositions to capture the complex structure at both household and individual levels, the following pre-processing is done.
(i) Household structures: the household compositions/structures used in this study are consistent with the household structures identified in the 2016 PUMF data. The following five household compositions are considered in this study to capture heterogeneous inter-relationships at the household level: single-member household, couple household without children, couple household with children, lone-parent household with children, and other household. The “other household” category is defined as a household that does not fall into any specific category. The data consists of 13 variables (i.e., 7 individual-level and 6 household-level variables) with 2665 observations for single-member households, 2870 observations for couple households without children, 1956 observations for couple households with children, 730 observations for lone-parent households with children, and 989 observations for other households.
(ii) Living arrangements of individuals: although there is one reference person per household in the PUMF data, the Canadian Census identifies this reference person based on who was listed first on the questionnaire (
Results
This section presents the results of the synthetic population. As indicated earlier, this study implemented the BN + GR procedure to generate a 100% synthetic population at the smallest zonal level of DA for the Okanagan region of BC, Canada. Firstly, the BN structures learned from the different household compositions are discussed. Then, the performance of the simulated synthetic population pool generated from the BNs are evaluated based on various matrices. Finally, this section is concluded with a quantitative assessment of the synthetic population during the post-processing step of the proposed framework.
All tests were performed on a personal computer with Intel (R) Xeon (R) W-1270 CPU (8 cores, 3.40 GHz) and 16.0 GB RAM in the Operating System of Microsoft Windows 10 Enterprise (x64).
Bayesian Network Structures for Different Household Types
To capture the unique heterogeneous inter-relationships at both household and individual levels, BN structures were developed for five different household compositions. In addition, following Casati et al. (

Partial Bayesian network structures: (
This study assumes the relationships implied by the estimated hierarchical structures apply to all households of a particular type (e.g., couple household without children). This is, however, not uncommon in the literature. Sun and Erath (
In contrast, significant differences have been observed among different household types, which indicates the existence of heterogeneity across the different household structures and the hierarchical structure of an individual’s role within a household. For example, households with children (i.e., couple household with children and lone-parent household with children) tend to have a higher number of rooms, suggesting the dwelling unit is influenced by household size. This reflects the likelihood that the number of rooms in a dwelling unit is linked to the number of persons living in that household. The relationships retained in the developed BN structures are also consistent with the BNs developed by other researchers. For example, the BN structures for multiple member households developed by Sun and Erath (
Assessing the Synthetic Population Pool Sampled From the BNs
A pool of synthetic population (a total of approximately 1,000,000 individuals and approximately 440,000 households) was drawn from the BNs using forward sampling to recreate the original microsample. The same proportions of different household compositions in the PUMF data were used to create the pool. For example, the couple household without children category represented the highest proportion of households (31.2%) in the microsample. To keep the same proportion in the synthetic population pool, 135,000 households were sampled from this category. During this process, the reference person (i.e., single-member household and other households), the reference person and the spouse (i.e., couple household without children and couple household with children) and the lone parent (i.e., lone-parent household with children) were sampled at the same time. To create the rest of the members within the household (i.e., couple household with children, lone-parent household with children, and other household), the sampled household size attribute along with the BN for the other members were used to generate the full synthetic household. This process preserves the attributes of household members corresponding to their household attributes.
This study used graphical analysis to visualize the performance of the BN performance in preserving the joint distribution among household- and individual-level attributes in generating the synthetic population pool. The analysis suggests that the majority of the household attributes in the synthetic pool are a very good fit with the microsample, as shown in Figure 5. For example, the difference between the synthetic pool and the PUMF data are within −1% to 1% for all categories within dwelling type, tenure type, period of dwelling construction, number of rooms, and household size attributes. It was also observed that the difference between the synthetic pool and the PUMF data increases with the increase of the number of categories within an attribute. The marginal distributions of individual-level attributes also demonstrate a similar conclusion, as shown in Figure 6. However, household-level attributes perform better than individual-level attributes. This could be attributed to the individual-level attributes consisting of more categories than household-level attributes, which adds more intricacy to the BNs. To examine further, different multivariate joint distributions among household-level attributes (dwelling type × household size, dwelling type × household income, dwelling type × tenure type, dwelling type × period of dwelling construction, dwelling type × number of rooms, and dwelling type × household size × household income × tenure type × period of dwelling construction × number of rooms) were plotted to analyze the scalability. It is evident from Figure 7 that the multivariate joint distributions of household-level attributes in the synthetic pool are highly acceptable. For instance, the observed

Household-level marginal distribution comparison: (

Individual-level marginal distribution comparison: (

Household-level joint distributions: (
To investigate the underlying association between household- and individual-level attributes, the model-based Cramér’s V was estimated (Equation 2), which was adopted from Sun et al. (
where

Joint distribution of both household- and individual-level attributes: (
Assessing the Results of the Synthetic Population
In the post-processing stage, an IPF-based algorithm; specifically, a GR algorithm, was adopted to fit the pool of synthetic population with the marginal total at the micro-spatial resolution of an entire region. This study adopts a GR algorithm, as the previous studies suggested that the GR is faster and it outperforms the other IPF-based algorithms (
At the region level, six household-level attributes (i.e., dwelling type, household size, household income, tenure type, period of dwelling construction, and number of rooms) and three individual-level attributes (i.e., gender, age, and marital status) were used to control the marginal totals. The results suggest that the BN + GR generated higher accuracy in matching the synthetic population with the aggregated marginal totals. For example, a population of 218,981 was generated by using the proposed population synthesis approach, in comparison to the Census total population of 218,501 in the study area. This is a slight over-representation of the Census total population by 0.22%. On the other hand, a total of 91,745 households were generated, which is also a slight over-representation of the Census total households by 0.06%. Furthermore, the distributions of the household- and individual-level attributes in the synthetic population and Census totals were investigated. The analysis suggests that the majority of the household attributes in the synthetic population are a very good fit with the Census total, as shown in Figure 9. For example, the difference between the synthetic population and the Census total are within −1% to +1% for most of the categories within the dwelling type, household size, tenure type, period of dwelling construction, and number of rooms. However, the difference increases with the increase of the number of categories within an attribute—that is, household income. Similar results were observed for the individual-level attributes, as shown in Figure 10. For example, the difference between the synthetic population and the Census total are within −1% to 1% for most of the categories within gender and marital status. However, one of the categories within the individual age attribute shows the difference is greater than 4.5%.

Comparison between Census and synthetic population for household-level attributes: (

Comparison between Census and synthetic population for individual-level attributes: (
To assess the performance of the proposed population synthesis approach to capture the micro-spatial resolution, the synthetic population was also generated at a DA level within the study area, which consists of 293 DAs. Dwelling type and gender were used to control marginal totals at the DA level. In this study, absolute error percentage (Equation 3) between the generated synthetic population and Census total was chosen to investigate the performance of the estimated synthesized population.
where
Figure 11 shows the absolute error percentages between the generated synthetic population and Census total at the DA level for gender and dwelling type attributes. It is apparent from Figure 11 that the error percentages of different gender categories within most of the DAs show satisfactory estimates. For example, approximately 74% of the DAs showed an error percentage within a range of 0%–4.5% for different gender categories. Only 2% of the DAs showed a difference of greater than 10%. However, dwelling type showed higher error percentages than that of the gender attribute, which is in line with the above-mentioned findings—the difference increases with the increase of the number of categories within an attribute. Overall, in comparison with the region-level marginal totals, a relatively higher percentage of errors has been observed at the DA level. This may be because of the zero-cell problem of the DAs that negatively affects the accuracy of the results. In addition, the use of the integerization procedure to convert fractional weights to integers at each DA contributes to the higher error percentage.

Difference between Census and synthetic population at the dissemination area level: (
Furthermore, a comparison between Census and synthetic population by DA size based on total population is presented in Figure 12. It is evident from the boxplot and whisker diagram that the increase in the DA size leads to the increase in the absolute difference. This further justifies the use of the integerization procedure, which contributes to the higher absolute difference with the increase in the size of the DA.

Absolute difference by dissemination area size: (
Finally, the generated synthetic population and households were distributed in the five major cities within the study area, as shown in Figure 13. This demonstrates that the population density and household density vary dramatically within the region, and the City of Kelowna accounts for the majority of the population (and households) compared to the other cities. For example, a total of 53,960 households and 127,078 individuals were generated in the City of Kelowna, which represents approximately 37% of the households and approximately 41% of the population within the study area, respectively. In addition, it is evident from the figure that the urban core within each city represents the densely populated areas. This indicates that the synthetic population, generated from the BN + GR, resembles the original density in the study area.

Comparison between population and household distribution at the dissemination area (DA) level: (
Conclusions
This study adopts a BN approach and GR technique to generate a 100% synthetic population at the smallest zonal level of DA for the Central Okanagan region of BC, Canada. The BN technique is adopted to tackle the following challenges in a population synthesis procedure: incorporate heterogeneity among different population groups based on both household and individual attributes; handle the missing/incomplete observations in the microsample; and generate a synthetic population pool resembling the original microsample. A data-driven technique is adopted during the BN structure learning process, which is further validated to confirm that the BN retains the fundamental relationships among household and individual attributes. Following the generation of the synthetic population pool, a GR algorithm is adopted to fit the population with the control total at the DA level.
The implementation results suggest that capturing heterogeneity within the BN has tremendously benefitted the reconstruction process to efficiently and accurately generate a matching population pool from the available microsample. The synthetic population pool generated from a series of BNs is also able to capture the underlying inter-relationships among the individuals corresponding to the household they belong to. Finally, a post-processing step, BN + GR, provides higher accuracy in matching the synthetic population pool with the aggregated marginal totals.
The developed population synthesis procedure will be used to generate baseline populations for an agent-based integrated urban model, which is currently under development at The University of British Columbia’s Okanagan campus. The socio-demographic attributes generated through this population synthesis procedure will then serve as the attributes of the individual and household classes of the respective agents. The integrated urban model will microsimulate individual and household agent changes, such as demographic events, and decisions related to land use, vehicle ownership, and travel activities—largely based on these socio-demographic characteristics of the synthetic population.
This study provides promising insights toward the data-driven simulation-based approach to generate a synthetic population, in particular incorporating heterogeneity among different population groups based on both household and individual attributes. There are several directions for future research. This study assumes the relationships implied by the estimated hierarchical structures apply to all households of a particular type (e.g., couple household without children). Although this is not uncommon in the literature, future research should focus on investigating the existence of heterogeneity within a specific household type. Future studies should also focus on probabilistic methods to identify heterogenous groups. One of the methods could be using a clustering technique to allocate households and individuals into different groups. Another direction for future research is to explore the use of the BN technique to predict demographic events (i.e., birth, death, age, marriage, and divorce) within the simulation procedure and test whether these relationships hold or not in the future year(s).
The model presented in this study is developed for the Central Okanagan region in BC. Future research should focus on data transferability to compare the applicability of the BN hierarchies for different datasets. This study adopts a GR algorithm as a post-processing step to match the marginal totals of both household- and individual-level attributes at a micro-spatial resolution, and comparing the other IPF-based algorithms was not within the scope of this study; future research can consider comparing these algorithms using a synthetic population pool from the BN to investigate model performance with respect to accuracy and run time.
