Abstract
Keywords
Introduction
Spatial interaction is essential for urban activity and is ultimately afforded by the transportation network. Can the geographical distribution of urban activity thereby be inferred directly from some measure of centrality derived from the transportation system? In this paper, we combine theories from spatial interaction modelling (e.g. Wilson, 2000) and network centrality (e.g. Newman, 2008) to develop a model to test this hypothesis with encouraging results. As a framing, we begin by subdividing the problems faced by planners and theorists into:
Computational models are attractive as tools for studying these dependencies, which leads us to
However, even if we were to solve the modelling problem, we would still be left with a
Our intention is to strike at the modelling and data problems simultaneously by exploring an alternative approach. We aim to infer the distribution of urban activity, by modelling only the physical characteristics of geographical zones and their interactions, i.e. without reliance on
The first part of the paper concerns theoretical background and derivation of centrality models for predicting urban activity. We then present our data sources, followed by ‘Methods’ and ‘Results’ sections where the model implementation and empirical validation processes are described.
Theory
Background
From the common wisdom that cities tended, from early on, to be established on trade routes, natural ports or river crossings, stems the fundamental assumption of all spatial economic theories: a location with good accessibility is more attractive than locations with bad access. This is a fundamental assumption that theoretically goes back to Von Thünen (1826). A breakthrough study by Hansen (1959) demonstrated that locations with high accessibility were developed earlier and more densely than less accessible locations. On the same path, Alonso (1964) formulated a theory linking accessibility and land use. Following Krugman (1996) and Fujita et al. (1999), a great part of spatial development can be explained by the interplay between two major driving forces: (i) economies of scale and (ii) spatial factors such as transport costs and land prices.
To take the leap from these concepts towards an urban centrality measure, we propose to use a simplified model of urban economic activity in combination with a much more detailed spatial representation. This makes it possible to view the urban system as a network of interacting locations (Andersson et al., 2006; Barthélemy, 2011; De Montis et al., 2013).
Urban activity
A central concept in this paper is the notion of urban activity (denoted
Local characteristics
A fundamental property of a location is its capacity to be adapted to human activity, determined by basic usability such as local access to buildable land and infrastructure. These local characteristics (denoted
Accessibility and centrality
Consider the accessibility to attractions as defined by Hansen (1959);
This concept is powerful and forms the basis for the measures that we elaborate in this paper. One outcome of such a centrality concept is the famous PageRank algorithm used by Google (Brin and Page, 1998), which enables a ranking of web documents with regard to their importance. Documents on the internet are given a higher ranking if they are linked to from other pages with high ranking. Notably, at no point, the search engine has to analyse the semantic contents of the documents, which is exactly what it seeks to rank the importance of. This approach has also been applied to physical road networks by, e.g. Jiang (2006) and Chin and Wen (2015), with the main objective to describe human movement. El-Geneidy and Levinson (2011) have tackled the centrality calculation from a different direction, by using data on actual flows as a starting point. Our proposed centrality measures are also based on flows of interactions, but without any requirements of specific travel data. Instead, the computations are performed by modelling these flows using a general interaction function with infrastructure network data as input (although modelling accuracy could likely be improved by using detailed empirical interaction data).
Using centrality measures based on the road network to predict urban flows and activities is not a new idea, see for example Hillier and Hanson (1989), Porta et al. (2009), Sevtsuk and Mekonnen (2012) and Gao et al. (2013). However, the measures that have been mostly in focus (closeness and betweenness centrality) cannot easily be incorporated into a spatial interaction modelling framework, which is our main reason for instead exploring extensions of eigenvector centrality.
Closing the loop from activities to flows and back again to activities
Our modelling approach departs from classical spatial interaction modelling (Batty, 2013; Wilson, 2000), where local activity levels
From activity to spatial interaction
Spatial interaction models arise by subjecting the logic of the gravity model to local constraints on the size of flows in the system. Flows of interactions between zones can then be estimated, by distributing economic flows from origins to destinations in proportion to their relative attractions (see Figure 1). As noted by Wilson (2000) such a model formulation will take into account the competition between different locations for attracting incoming flows.

Deriving flows from activity and attractivity. The flow is shown as unidirectional, but a flow in the opposite direction is also present and can be computed analogously. See supplemental material for a detailed derivation of the interaction model.
From spatial interaction back to activity
In many cases, the distribution of activities in the system is of interest in itself. Salient questions include how infrastructural change affects things like urban extent, patterns of interaction, housing, jobs, and so on. Infrastructural data are considerably more widely available, complete and consistent than demographic and economic data on the nebulous concepts of activity and attraction, which we must approach via its rich flora of expressions such as buildings, land value and population. If we can tease most of the information we need out of the infrastructure of interactions, we are in a much better shape with regard to data supply but also with regard to model design. We may then circumvent the need to figure out how various sub-models interact, and we are at least less exposed to the ontological mismatch between models and reality.
In Figure 2 we outline the logical sequence in which we develop our preferential centrality model by using a ‘quasi-growth model’ –

From spatial interaction to activity modelling.
independently of the quasi-growth constants
The model may be substantially improved by positing that activity in itself stimulates attractivity,
We call this new non-linear measure
Interaction function
The most common choices for interaction functions are the exponential function
Data
The data used for this study are of three kinds: road network, property polygons and land taxation values. The road network is used for three purposes: finding accessible areas within the polygons, finding connections from the polygons onto the road network and finally performing the distance calculations between zones. The property polygons are assigned a land taxation value from the taxations database according to a common identifier. They are thereafter aggregated into zones based on area and type code. In this study, the municipality of Gothenburg is chosen as a prototype area to develop, test and validate the model.
Roads and streets are imported with preserved topology and attributes from Open Street Map (OSM). OSM has been subject to questions about its quality, but studies have found that the data quality is on pair with other data sources (Dhanani et al., 2012; Haklay, 2010). The reasons for choosing OSM are several: it is readily available to download, it contains the necessary attributes for the calculation, it has worldwide coverage for future expansions of the model and the data are open.
The entire extent of Sweden is partitioned into ‘properties’. Properties are either owned by individuals or juridical entities, or they can be jointly owned in the form of associations. The precision and quality of these data is high, since the purpose is to establish and prove ownership (which needs to be precise and just). Properties are of different types and usages; therefore, they are classified and assigned a type code based on usage by the Swedish taxation authority. The extent and borders of these properties are obtained from the Swedish land survey.
The Swedish taxation authority assigns to all properties a taxation value that should represent about 75% of the market value. This value is arrived at by a procedure that takes several characteristics into consideration such as area, closeness to water, building type, sales values of the neighbouring properties, etc. The quality of these data is also very good in the sense that it is done according to a legal criterion, although the values for industries are a bit uncertain due to the fact that they are seldom sold. Therefore, these few sales have a disproportionately big impact on the industrial properties taxation values. This has to be taken into account for in the regression analysis. All the taxation values and type codes are acquired from the Swedish taxation authority.
Methods
The procedure for model exploration and validation is roughly composed of three steps: (1) data preparation in order to create the input for the activity model as well as preparing the empirical data used in the last step, (2) running the activity model and (3) finally, using the results from the models in a multiple spatial regression analysis with the empirical values.
For the activity model, we compare four different versions: the local model, the monocentric model, the iterative eigenvector model and the iterative preferential model. Our aim is to assess whether or not the more elaborate iterative models provide any additional predictive capabilities compared to the simpler versions. To find out whether the models are capable of capturing all of the spatial dependencies, we have performed spatial testing (Anselin, 1988) in the regression analysis.
Data preparation
Spatial entities
The spatial entities used in the activity model and the multiple regression analysis are chosen to be realised as zones, defined as one or more aggregated properties. All properties smaller than 3000 square metre are aggregated to zones by dissolving common borders, if they have the same taxation type code.
Geographical analysis of polygon features is subject to the MAUP (Openshaw and Taylor, 1979). The way of spatial partitioning of land must therefore be carefully chosen. The justifications for using zones as spatial units are that properties are readily available, have a designated usage and can provide useful output in planning applications. Property-based zones also simplify the empirical comparisons, since model and data will have the same spatial representation.
Connection between road network and zones
We do not use detailed data about physical connections between zones and the road network. Instead approximate ‘virtual’ connections are created in the road network model by choosing the shortest Euclidean lines between zonal centroids and connection-permissible roads. Motorways, trunk roads and other roads with high speed limits are not considered permissible for these virtual connections.
Zonal weights – Local characteristics
A zonal weight (
Accessible areas are here stipulated as land that can be accessed from roads. Therefore, the assumption in the model is that only the area within a certain distance from a road is possible to develop. These areas are created by buffering the roads (30 meters in the baseline case) and doing a union overlay onto the properties.
Buildable areas are hereby defined as firm ground suitable for buildings. Areas used by (or very close to) road or rail infrastructure are not considered as buildable.
Permitted areas are those that, according to planning restrictions, are allowed for development. In our current model implementation, productive forestry, agricultural land and areas used for special purpose buildings are considered as not permitted.
A basic attractivity factor is closeness to open water, which can have a large effect on land value and land taxation. Since our study area (Gothenburg) is situated by the coast we must include some approximation for this effect. We have chosen to include the water attraction as a multiplicative factor of 1.5 for the zonal weights for zones with centroids within 500 meters of the coastline.
Implementation of the activity model
To arrive at zone-to-zone impedances
Compared to the iterative models,
Zonal weights are mainly used as input to the iterative activity models. However, for comparative purposes we also investigate a
Spatial regression
Preparation of the spatial regression analysis data
The two independent variables are the prediction from the activity model and the amount of industrial area per zone. The reason to include the amount of industrial area in the regression model is that industrial properties have on average a lower taxation value due to the taxation process.
The dependent variable is the property taxation value. For some records in the taxation database, there is not a 1:1 relationship to property polygons. We handle this by aggregation, de-aggregation and filtering. We start from 60,137 property polygons and arrive at 27,628 zones after aggregation. Out of these, we have empirical taxation values for 12,062 zones, hence only they are used in the regression.
Weight matrix creation
In order to specify a regression model with spatial diagnostics, a spatial weights matrix has to be created. The weights matrix in this study is created by using the impedance of the road network between all places and then apply a cut-off value in order to determine which zones are to be treated as adjacent ones. We have chosen a cut-off value that is 3000 meters. To examine the robustness of the model, a weight matrix based on Euclidian distance of 600 meters is also tested in the regression.
Investigating spatial dependencies
To examine the presence of spatial dependence, an analysis of Moran’s I for the model values and empirical values is made (Haining, 2003; Moran, 1950). This test (see Table 1) shows that both preferential model values and taxation values are subject to a rather strong spatial autocorrelation while the local weights are not.
Indicators for spatial autocorrelation.
This finding indicates that spatial diagnostics need to be evaluated in the regression analysis, to make sure that all spatial autocorrelation is taken care of. The finding that local weights are virtually not at all spatially autocorrelated tells us that they cannot sufficiently explain the variation in the empirical property taxation values.
Ordinary least squares (OLS) with spatial diagnostics
An OLS with both spatial and non-spatial diagnostics is performed in order to know whether the dependent variable’s spatial autocorrelation is captured by the independent variables (which would mean that an ordinary OLS is sufficient). If not, the diagnostics are used as guidance for the next steps in order to take care of the spatial autocorrelation (Anselin, 1988). This results in a collection of diagnostics that need to be analysed:
Diagnosis for non-normal error distribution, Jaque–Bera (JB) test. Diagnostics for heteroscedasticity, Breusch–Pagan and Koenker–Bassett tests (B–P and K–B). Diagnostics for spatial autocorrelation, Lagrange multipliers (LM) tests and Moran’s I on the residuals.
Comparative indicators for model fitness and validity
To evaluate and compare models,
When spatial autocorrelation is present in the residuals, the observations are not independent from each other, hence the regression model is not valid. This is investigated with the LM tests; if they are significant it indicates that some measure like using a spatial lag or spatial error model has to be taken in order to handle the remaining spatial autocorrelation (Anselin, 1988). If the LM (or robust LM) test for spatial error model is significant while the tests for lag model are not, a spatial error model is probably the right way to go, and vice versa. If both tests are significant, the regression analysis is not valid and there is no indication of any spatial model that can make it valid. In that case the model has to be respecified (Anselin and Rey, 2014). This procedure has been used in this study for guidance in the search for a good and valid model.
Software
For the data preparation, cleaning and aggregation, FME was used. The activity models were implemented in python, using the packages OSMnx (Boeing, 2017) and NetworkX (Hagberg et al., 2008). The spatial statistical analysis was performed in GeoDa (Anselin et al., 2006).
Results
Model validity and fitness
All models except the preferential models have all the LM tests significant, which invalidates them due to untreated spatial autocorrelation. The local and industrial models are included just as control, to see that it is actually the activity model prediction that is responsible for the good results. The other indicators on model fitness shown in Table 2 imply that the preferential model is the best choice, even before considering and applying the spatial error model.
Results from the spatial regression. A better fit is indicated by a lower Schwarz and a higher
For the preferential model, the robust version of the LM test for error model was significant (0.00) while the robust version of the LM test for lag model was not (0.83). This suggested that using a spatial error model is the correct approach (Anselin and Rey, 2014). Therefore, only the preferential spatial error model is usable for inference and predictions, although its spatially clustered errors (Anselin, 1995) are hiding some unknown spatial factors (see Figure 3).

Preferential spatial error model: Predictions (top left), empirical land value (top right) and local weights (bottom left) are normalised with regard to zone area. Spatial residuals (bottom right) show the remaining spatially autocorrelated error term.
Other statistical tests on the preferential spatial error model
The low multicollinearity number (12) indicates that there is no problematic multicollinearity among the explanatory variables. Values < 30 are usually considered as unproblematic (Anselin and Rey, 2014).
The JB test is significant, which indicates a non-normal distribution of error terms. However, this test is less relevant, since this dataset is large (Anselin and Rey, 2014).
According to the B–P and K–B tests there is a significant heteroskedasticity in the model results. There can be multiple reasons for this where one possible cause is the aggregation of properties (Haining, 2003). The effects are not that great in these specific models, since the standard errors are very low on their own. It is therefore not considered as crucial for the conclusions of this study.
Sensitivity analysis
We have explored many variations of the key parameters, such as the preferentiality parameter
Discussion of results
Comparing the model versions
The eigenvector and monocentric models have decent performance; therefore, the interpretation of their results has been used as steps in the search for a valid model. The preferential spatial error model, besides being the only valid model, also performs well in absolute numbers with a pseudo
Remaining challenges
In this paper, we have not aimed to present a full predictive model. Some improvements for moving in that direction are as follows:
To reduce uncertainty in the regression coefficients, heteroskedasticity should be sufficiently taken care of. Some more parameter variations as well as trying different levels of aggregation into zones might give some clues on how to handle this problem. The preferential spatial error model still contains unknown spatial variables that are handled as a spatial error term together with standard residuals. To understand those errors can be helpful for further development of the model. Some ideas and suggestions for further investigation are as follows:
○ Different kinds of properties (i.e. commercial versus residential) might not be fully comparable in taxation terms. ○ Other transportation modes, such as pedestrian, bicycle and public transport are not captured in the current car-oriented implementation of the model ○ Truncation effects: this study is only investigating areas within the Gothenburg municipality, although the city also acts as a regional centre for a larger surrounding region. In the preferential model, we have a parameter
Conclusions and ways forward
By using a theoretical concept of interaction-based centrality, we have demonstrated that it is possible to create an urban activity model with empirical validity, using only two data sources – road networks and property polygons. The empirical validation is based upon using land taxation values as a proxy for urban activity.
According to the comparative results from the spatial regression, local characteristics are far from enough to explain the geographical variation of land values. The activity intensity is also affected by the geographical ranking of the location: in the city and in the region. Including the distance to the city centre in a monocentric interaction model gives a seemingly better fit, but the spatial statistical tests show this model to be invalid for the geographical area that we study, indicating that a more elaborate model is warranted. With the introduction of our concept of preferential centrality, where initial concentrations of activity are assumed to ignite local feedback mechanisms that attract even more activity, we finally arrive at a valid regression model.
The preferential centrality model has several additional advantages compared to a monocentric approach. First, we avoid the requirement of having to manually identify the most central location. Instead the centrality model will endogenously determine central places and their relative importance. In a polycentric setting this is a crucial model feature. Second, in a planning context it can often be an important question in itself how the location and strength of urban centres are affected by planning interventions, such as new infrastructure. For example, the preferential model can be used to analyse the robustness of a city centre under the influence of suggested new road investments. Such an analysis is clearly not possible within a monocentric model framework.
Regarding data requirements, our approach is somewhat more demanding when compared to a basic monocentric model, since travel times must be computed between all zones and not only to the predefined centre. The number of zones needed (i.e. the spatial resolution) depends on context and further studies are needed to determine what levels of resolution that are adequate for different planning applications.
Our current model implementation is technically complicated and requires different pieces of software. This is, however, not a fundamental property of the approach and we aim in future work to achieve a workflow within a single open source framework, to open up for broader testing and practical application.
Before using our modelling approach in a practical planning context, further validation is needed: both cross-sectional by studying other and larger areas, and longitudinal by investigating changes in urban activity over a time period where the road network also has changed. For the purpose of this validation, we cannot escape the need to use empirical activity data, such as taxation values or night light data. However, since our sensitivity analyses show that model outcomes are fairly robust, a validated preferential centrality model should be transferrable to applications in different geographical settings, without any need for local economic or demographic data.
Supplemental Material
Supplemental material for Preferential centrality – A new measure unifying urban activity, attraction and accessibility
Supplemental Material for Preferential centrality – A new measure unifying urban activity, attraction and accessibility by Alexander Hellervik, Leonard Nilsson and Claes Andersson in EPB: Urban Analytics and City Science
Footnotes
Acknowledgements
Declaration of conflicting interests
Funding
Supplemental material
Note
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
