next up previous
Next: The analysis of point Up: A review of spatial Previous: Research traditions

Basic issues in spatial statistics

Since observations of spatial data are as unlikely to be independent as observations on time series, it is perhaps surprising that not more use has been made of this source of information. With an adequate choice of explanatory variables, this spatial dependence may be readily drawn into a model, and cease to be a nuisance. However, spatial dependence is not necessarily just a nuisance, but may help us to capture important facets of the realities of economic processes (cf. Hendry and Mizon, 1978). The literature on spatial statistics is substantial (see Cliff and Ord, 1973, 1981, Ripley, 1981, Upton and Fingleton, 1985, Griffith, 1988, Anselin, 1988, Haining, 1990, and more recently Cressie, 1993, and Bailey and Gatrell, 1995). We will here give a brief introduction to some of the key issues.

In their now classic survey of problems in analysing spatial data, Duncan, Cuzzort, and Duncan express the focus of this study in the following way:

Interest in areal distributions merges more or less imperceptibly into a concern with the `spatial structure' of communities, economies, and societies. At the present time it is difficult to appreciate the magnitude of effort which was required to establish the concept of an economy or a society as a territorially organised system (1961, p. 16).

They continue to identify four perspectives on spatial differentiation, which they describe as: (a) chorographic interest in areal differentiation; (b) interest in areal distribution; (c) interest in spatial structure; and (d) concern with the explanation of areal variation (page 19). They deserve credit for taking up the problems which spatial data pose for analysts of society, and of change in society. Since spatial data are neither the outcome of controlled experiments, nor do they result from random samples, it is clear that beyond mapping and informal inference from patterns, specific spatial statistical methods are required.

Data from which statistical inferences are to be drawn ought to fulfill a number of criteria, key among which is that they are independent of each other. The founders of statistics were keenly aware of the difficulties of making inferences from spatial and time series data. Student describes the problem in detail in a paper published over ninety years ago (1914), while the way in which Galton posed the problem is discussed below. In time series, we know that later observations may depend on earlier ones; this dependence is termed autocorrelation. First order autocorrelation is between the current observation and its immediate predecessor. The ordering of the data is clear, although the choice of temporal units does make a difference, for example hourly, daily, weekly or monthly data may display different forms. In the time series case, it is usual to speak of a temporal data generation process. This can be thought of as an unobservable curve, generated both in relation to its own previous values and in relation to the current and possibly previous values of other variables. If we observe it at discrete and regularly spaced intervals, we get time series data, from which we can try to estimate the underlying, unobservable curve.

Spatial data may be viewed as observations taken at discrete points on a surface, rather than a curve, since we are in two dimensions, not just the single dimension of time series. It is in this sense that we can speak of underlying, unobservable spatial data generation processes, about which we would like to infer. The inferences which we would like to be able to make are about these processes, which for a variety of reasons may not be directly observable. Using political behaviour as an example, we could seek to establish the identities of voters, hoping to link their ballot papers to their other characteristics, such as place of residence or birth, sex, age, occupation, etc. An exit poll could be used to achieve this, but then the focus would be on the individual level, rather than on the local, territorial, or ecological links. While we have to accept that we can not make inferences about individual behaviour from ecological data (Langbein and Lichtman 1978), it is often both necessary and relevant to study spatial data generating processes at the aggregate level. Aggregation in itself should not be avoided, not least because it often returns in one form or other as classifications used as explanatory variables related to cleavages, be it socio-occupational class, organisation, or some other structuring variable above the level of the individual.

Spatial aggregation brings with it a number of specific problems. The boundary effects at the edges of the study area are often impossible to control for. If we are concerned with reconstructing unobservable surfaces, then we are faced with the hypothetical question of whether the surface extends outside the study area, even though we have no observations (Haggett 1981). If we had possessed data from beyond the study area, would it alter our inferences about the shape of the surface at the edges of our study area?

In addition, the often arbitrary nature of the assignment of observation units to aggregates, known as the modifiable areal unit problem, has to be recognised. It has been demonstrated that there is a relationship between the coherence or simplicity of the process generating the surface we are trying to make inferences about, and the way in which the observations are aggregated (Openshaw and Taylor 1979). They separate the scale problem, where results change from less aggregated to more aggregated spatial units, from the aggregation problem caused by arbitrary choices made in zoning, that is assigning basic spatial units to contiguous zones. Zones in turn imply the contiguity of member units, while groups require no contiguity. Openshaw and Taylor were able to demonstrate that the interaction between spatial autocorrelation and the zoning procedure directly affected resulting statistics (1979, p. 142). It is quite clear that the results of analysis are dependent on the particular lattice of areal unit boundaries chosen, and that different results may be yielded by analyses using different boundaries. For this reason, units of observation may be termed zones, to show that they have been subject to a process of aggregation from basic spatial units for which data may often not be available.

Finally, the non-stationarity of variance across the study area is a problem analogous to that faced in many studies using statistical inference. We recall that regression, for example, assumes that the variance of the error term should be constant, and not vary with the independent variable. In the time series case, we can say that the series is stationary if it has a constant mean, and fluctuates about that mean with a constant variance. The mean may of course be a residual after the removal of estimated structural features of the curve underlying the observations. In order to make inferences about the curve, it is important that the variance about the estimate should not vary in time. In the same way, with spatial data we should be aware of problems that arise in inference if variance about the estimated surface is not constant over the whole study area.

Haining (1990, p. 22-26) provides a useful discussion of many of the issues involved in inferring from spatial data. If it is possible that observations being treated as independent in fact derive from a shared ancestor, then they will not contribute separate degrees of freedom to the formal test used for inference, or to the judgement involved in the drawing of informal conclusions. Further light is thrown on the difficulties involved in reaching substantive conclusions by Haining (1991), in a discussion of the Clifford-Richardson adjustment of the "effective" sample size for bivariate correlation.There is a clear link between the method suggested by Haining (1991, page 215), for the calculation of the relevant adjustment, and the family of distance statistics summarised by Getis and Ord (1996). Both the adjustment method and distance statistics rely on the explication of the correlation structure at varying distances.

Summing up, we are often faced by non-experimental data for sites or zones, which we would like to analyse. Abstracting from zones to simplify the argument, we are in an inherently multivariate situation, where each site stands in a relationship to every other one. We are faced with a set of probably non-independent random variables tex2html_wrap_inline785 , commonly referred to as a spatial stochastic process, and where tex2html_wrap_inline787 are the point location coordinates. A typical data set then consists of observed tex2html_wrap_inline789 , and is referred to as a realization of the spatial process. It is only a single observation from the joint probability distribution of the random variables tex2html_wrap_inline791 , from which little can be gleaned about the relationships between these sites, even given that we accept that they are reasonably representative in some sense (Bailey and Gatrell, 1995, p. 24-28).

On this basis, we will now proceed to review the component areas of spatial statistics, dealing in turn with point pattern analysis, geostatistics, and the analysis of lattice data.


next up previous
Next: The analysis of point Up: A review of spatial Previous: Research traditions

Roger Bivand
Fri Mar 5 08:30:34 CET 1999