Probit Models for Grouped-data Migration Flows: A Theoretical Note

In this theoretical note, we propose the GProbit model as an alternative to gravity models to estimate grouped-data flows. This is a model based on the random utility theory, which is consistent with the principle of population behavior. Instead of migrant counts, the dependent variable of the GProbit model of flows consists of a number of observed proportions. It allows explaining the propensity to migrate from any origin to a destination, which is an interesting relative concept not affected by the size effect. For this reason, it is expected to have better fit and less problems of non-normality, as illustrated by an application for the internal migration flows of the Spanish regions. Article History: Received: February 28 2019 / Revised: May 2


Introduction
In many areas of economics, choices made by individuals are costly to collect or inaccessible. However, analysts may have access to the choice data aggregated across groups of individuals in the form of counts or shares. Regression-based spatial interaction macro-models have frequently been used to estimate group-data choices. This is the case of many applications of gravity models to migration or other spatial interaction processes (LeSage and Fischer, 2010). In these models, the (aggregated) observations are treated as if they were single entities. They specify the dependent variable as mere (log-transformed) aggregations of individual data, which can produce-among others-severe problems of non-normality and heteroscedasticity, enhancing spatial autocorrelation in the error terms. Moreover, a simple aggregation of individual choices does not necessarily lead to grouped or "herd" behavior (Schelling, 2006;Sen and Smith, 2012). Aggregation must result in models consistent with theory, which should be capable of identifying overall regularities in collective population behavior (Kanaroglou et al., 1996).
For this reason, we recommend following a different strand of the literature based on a choicetheoretic perspective. Although typically concerned with the identification of individual behavior, choice models have also been specified for grouped data when observations no longer consist of single individuals but sets of several persons who share similar characteristics (e.g. living in the same region). In these grouped-data choice models, the dependent variable consists of a number of observed proportions or relative frequencies (Gourieroux, 2000), which are estimated by a nonlinear weighted least squares method (Berkson, 1944(Berkson, , 1955(Berkson, , 1957Amemiya, 1985). Groupeddata choice models can be easily generalized to spatial interaction models of migration or trade flows, as in Borjas (2006) and Aroca and Hewings (2002).
In this note, we briefly review the specification and estimation methods of the standard probit model for grouped-data flows.

From Individual to Grouped-data Choice Models
The random utility theory provides the framework to deal with individuals' decision (McFadden, 2001). Let be the U od the utility that an individual gets from moving from region o to d, where o is the origin region while d represents any of the potential destination regions. Therefore, an individual will move from region o to d if U od ≥ U oo , that is, when the utility of moving (U od ) is more profitable for this individual than the utility of staying (U oo ). The utility function of moving (U od ) has a non-stochastic part (V od ) and a random error term (ε od ): (1) This model stands that: where y * = U od − U oo is a latent variable for which there is not a direct measure, but an indicator (y) that takes the value 1 if the individual moves or 0 when that individual decides to stay, conditional to a set of variables x which explain the migration decision. In probabilistic terms, it goes like: where the non-stochastic part in the indirect utility function (V od ), which is generally assumed to be linear and can be estimated by maximum likelihood (ML). Choice data can be aggregated across groups of individuals in the form of counts or shares. Differently from the individual setup presented above, here we use the behavior of the whole population. Grouped data are obtained by observing the response of the individuals belonging to the same region or 'group' provided that they can share similar characteristics (e.g. spatial location, age, income class, etc.). In a theoretic model with no individual (spatial) interaction, adding up the independent probabilities for all the individuals who move from region o to j, will give the probability that a generic individual of region o ends up in d. This definition might change slightly depending upon the denominator of the share, which can be the total population in the origin region at the beginning of the period or the number of migrants departing from the origin region 1 .
Assuming that each of the group components is large, by the law of the large numbers it can be concluded that the observed proportion (P ) is close-or an estimation of-the population or theoretical proportion (π). Hence, we can treat this problem as a simple one of sampling from a Bernoulli population, in which the observed proportion is equal to the population proportion plus an error term (ε od ): where the dependent variable consists of the n number of observed proportions of people moving from an origin o to a destination region d (M od ) over the total group of migrants moving out from o (M o ), that is P od = M od M o . By the central limit theorem, the error term ε od is approximately normally distributed with E(ε od ) = 0; Var(ε od ) = [π od (1 − π od ) M o ], being n o the total number of migrants in region o.

The Probit Model of Migration Flows
The population proportion can be expressed as an indirect utility function, π od = F x ′ od β , for x od a vector gathering a set of k factors which explains the migration decision and β contains a set of parameters. One of the functional forms most frequently used in application for F is the probit model, which by means of the Slutsky's theorem on convergence in probability, can be linearized (Gourieroux, 2000, section 4.2). The Cumulative Distribution Function (CDF) of the standard normal distribution is expressed as Φ x ′ od β . Since the CDF is strictly monotonic, it has an inverse form, Z od = Φ −1 (P od ), which by means of a Taylor series approximation for ε od = 0 → P od = π od (Greene, 2003, section 21.4.6), leads to the probit model for grouped-data flows or "GProbit model of flows": where X d X o = x od , being X o and X d the characteristics of spatial units o and d, respectively, and u od is the error term.
Since the number of migrants moving from each origin region (M o ) is large, the random variable of the GProbit model of flows, u od , is approximately normally distributed with E[ u od x od ] = 0 and non-constant variance defined as: where φ is the Probability Distribution Function (PDF) of the standard normal distribution. Therefore, the GProbit model of flows is heteroskedastic by construction due to the different values adopted by the denominator of the ratio P od = M od M o , which is the flow rate of people living in an origin o who move to any destination d, including intra-regional flows, M oo (for d = o). Berkson (1944Berkson ( , 1955Berkson ( , 1957 proposed a simpler way to estimate grouped-data choice models by nonlinear Weighted Least Squares (WLS), which is a variation-for qualitative response modelsof the MCSE or MIN χ 2 test of goodness of fit proposed in the literature (Amemiya, 1985, section 9.2.5 and 9.2.6). This method, which can also be applied to the GProbit model of flows, consists of finding parameter values minimizing a measure of the distance between the observed proportions (P od ) and the theoretical ones (π od ). It is solved in a two-step procedure because the weights are functions of the unknown parameters: 1. In the first step, the β parameters are estimated by Ordinary Least Squares (OLS), which produces consistent but inefficient estimates. This step provides the estimations of the dependent variable Φ −1 (P od ) ⋀ =Ẑ od and the error variances,σ 2 od : 2. In the second step, the estimated variances based on the first-step estimatesσ 2 od are used as weights for the WLS. The MIN χ 2 estimatorβ is defined as: Hence, the Berkson's probit model of grouped-data flows can be expressed as follows: where Z * od = Z od σ od , x * od = x od σ od , and u * od = u od σ od . However, since any other forms of heteroskedasticity are usually present in the error terms of spatial cross-section model, e.g. spatial group-wise heteroskedasticity (Chasco et al., 2018), the basic GProbit model of expression (5) should be estimated by OLS with a robust inference on the parameters (Anselin and Rey, 2014). Models (5) and (9) are linear models that can be estimated efficiently by standard methods like Ordinary Least Squares (OLS), Maximum Likelihood (ML) or whatever others.

An Empirical Illustration for Interregional Flows in Spain
We illustrate the performance of a GProbit model to estimate internal migration flows for the 17 NUTS 2 regions ("Autonomous Communities") in Spain, taken from the EVR register ("Estadística de Variaciones Residenciales") of the Spanish National Statistics Office (INE). Flows were constructed as the rate of emigrants moving from an origin region o to a destination region d over the total people of region o who have changed their residence during this period (including the intra-regional movements). We compare the performance and results of this model with the gravitational model using the conventional log transformation of flows for the dependent variable.
The distance matrix was formed using the log-transformed distance between the capital cities of the Spanish regions. We use six additional explanatory variables, which are the most significant from a set of more than 60 classical 'push' and 'pull' factors. They are: population, R&D expenditure per capita, average altitude, annual maximum temperature and annual atmospheric precipitation. All of them were defined as the ratio of the destination over the origin (D O) values. We would expect a priori that flows are directly proportional to the D O ratios of population and R&D expenditure and inversely proportional to the D O ratios of housing price, altitude, maximum temperature and atmospheric precipitation. Data has been ordered according to the origin-centric scheme described by Pace (2009) andWang (2011).
Ordinary least-squares (OLS) estimates are shown in Table 1. Regressions (1) and (2) model the interregional flows differently specified by the GProbit and gravity models, as observed proportions of people moving from o to d, M od /M o , and migrant counts M od , respectively. Since the total group of migrants moving out from an origin region is the sum of the interregional plus intra-regional flows departing from this region M o = M od + M oo , the GProbit model (2) allows estimating intra-regional migration rates directly as M oo M o = 1 − ∑ n−1 d=1 M od M o . Hence, these proportions can be interpreted as a probability or 'propensity to migrate'. As regards the gravity model, the intra-regional flows (M oo ) must be estimated in a separate model with different explanatory variables due to the different nature of inter-and intra-regional flows (LeSage and Pace, 2008). They are presented in Table 1, regression (3).
All the coefficients are very significant. However, the adjusted R 2 takes a very low value, particularly for the gravity model estimation, which is in line with other previous analysis in the literature. Spanish interregional migration has long been resistant to traditional economic explanations: none of the considerable research on Spanish internal migration finds clear significance in even core variables of income and employment (Mulhern and Watson, 2009). The strong rigidity of the Spanish labor market, centrally controlled by the trade unions, and a very high national unemployment discourages internal migration (Bover and Velilla, 1999) and instead promotes migration to other countries.
Traditional measures of prediction accuracy are also presented in Table 1. Besides the adjusted-R 2 of the OLS estimations, we also report some traditional measures of prediction accuracy for the estimated variable of proportions or propensity to migrate,P od =M od M o . First, we show the results of a bias indicator (RBIAS), which is the absolute difference between the observed and predicted values, divided by the predicted values. Positive values are indicative of predicted Table 1 Estimation results for the interregional migration models.

GProbit model
Gravity model overestimation, being zero the perfect situation of unbiasedness. Both models get positive values, though the gravity model has a RBIAS outcome (4.04) more than five times higher than the GProbit model (0.79). Second, the coefficient of variation (CV) is a standardized measure of dispersion that is defined as the ratio of the standard deviation to the mean. In this context, it could be interpreted as a measure of efficiency of the estimates and homoskedasticity of the prediction errors. Hence, a completely efficient estimator will get a CV value of zero. As shown in Table 1, the GProbit estimation has a CV close to zero (1.16) and it is almost 300 times more efficient than the gravity model (311.03). Hence, the error terms are more homoscedastic for the GProbit than the gravity model estimation. Third, the relative root mean square error (RRMSE) constitutes a balance between bias and variability. It is the mean value of the square root of the squared difference between observed and predicted values, divided by the predicted values. Once again, zero is the best value and the GProbit model performs better (0.16) than the gravity model (0.35). Figure 1 illustrates the results obtained by the prediction accuracy measures. The line graph with the real and estimated values of the flow rates shows that both models perform better in estimating flow rates closer to the average. However, they tend to overestimate lower rates while the higher ones are mainly underestimated. In fact, both models fail in estimating propensities to migrate above the average, particularly the gravity model. Additionally, the box plots for the difference between real and estimated migration rates show that this variable is closer to normality for the GProbit estimation, since it gets a mean and median values closer to zero, as well as a fewer upper outliers (more homoskedasticity) than the gravity model.

Conclusions
The intent of this theoretical note is presenting an alternative to gravity models to model grouped-data flows of any kind (migration, transport, networks, etc.) based on the random utility theory. Logit and probit models for grouped-data are consistent with the theory of population behavior. Additionally, they have less problems of non-normality and heteroskedasticity, mainly because the dependent variable consists of a number of observed proportions (people moving from an origin to a destination region over the total group of migrants moving out from this origin) instead of migrant counts (as it is the case in the standard gravity models).
That is, the GProbit model of flows allows explaining the propensity to migrate from any origin to a destination, which is an interesting relative concept not affected by the size effect. Since it is a linear model, it can be expanded to include spatial autocorrelation 2 and heterogeneity effects 3 , with some arrangements. This is something to be developed in a future work.
2 Spatial autocorrelation arises when the aggregated flows from an origin to a destination are not independent from each other. As in the conventional spatial interaction model, the spatial GProbit model can adopt different specifications. For example, the spatial lag or SAR GProbit model can be expressed as follows: Z od = ρ d W d Z od + ρoWoZ od + ρωWωZ od + αι N + X d β d + Xoβo + λD + ε od and the spatial error GProbit model is Z od = αι N + X d β d + Xoβo + λD + ρ d W d u od + ρoWou od + ρωWωu od + ε od for W d = In ⊗ W , Wo = W ⊗ In, Wω = W ⊗ W , n is the number of regions, W is the conventional (row-normalized) n-by-n spatial weight matrix, and ρo, ρ d , ρω are the spatial autoregressive parameters (LeSage and Pace, 2008).