LibGuides: GIS Services: Descriptive Analysis

Spatial Analysis

Statistics has several general components like descriptive statistics which is describing data in terms of number, central tendencies and dispersion; Inferential statistics which is making predictions of trends based on exiting data; and other analytic techniques like regression, kernelling, cluster analysis, Hotspot analysis and more to identify relationships between spatial phenomena.

Central Tendenacy

One of the basic concepts in descriptive statistics is the central tendency which refers to the most typical or most characteristic value in a dataset. There are three measures for central tendency: mean, median and mode.

Mean

The most widely used of these is the average or more formally arithmetic mean. The mean is calculated by adding up all the values in a dataset and then dividing by the total number of observations.

Example:

2, 3, 5, 5, 7, 8, 10, 17, 23, 34, 100

Sum of observations = 214

Number of observations = 11

Mean is 19.45

So in this example we have 11 observations that add up to 214 and the mean value is 19.5

Median

The next measure of central tendency is the median, which is the middle-most value in a dataset - when you have an odd number of observations.

2, 3, 5, 5, 7, 8, 10, 17, 23, 34, 100

So in our dataset the median value is 8 because there are 5 values before and 5 values after making it the middle value.

When you have an even number of observations the median is an average of the middle two values.

2, 3, 5, 5, 7, 8, 10, 17, 23, 34

Median is 7.5

In this example there are an even number of observations and the middle two are 7 and 8, because there are 4 values before and 4 values after. The average of the middle two values is 7.5.

Mode

The third measure of central tendency is the mode, and this is the value with the highest frequency of occurrence. In our set of observations the value 5 occurs twice and all other values occur once making 5 the mode.

2, 3, 5, 5, 7, 8, 10, 17, 23, 34

Now if all values occur only once then there is simply no mode.

On the other hand if there are values that occur more than once with the same frequency then the dataset is said to be multi-modal.

By comparing the three measures together we can gain some insights into the character of the dataset. The mean value in this case is higher than the median because the mean is susceptible to the influence of outliers, and an outlier is a value that stands either much higher or much lower than the other values. For example if we have income data for 100 people in a census tract and 99 of them make $40,000 a year and the last one makes 1 million then the average will be skewed toward the million. In a case like this the median value or the mode would be a more appropriate measure of central tendency.

When the mean and median are close together or identical then values in the dataset are distributed generally equally meaning there are about the same number of observations on either side of the central tendency. If the mean is higher or lower than the median then the dataset is said to be skewed in one direction or the other.

The mode is most useful for nominal data, that is, non-quantitative data. If we did a survey of people asking their favorite color and most people report green then green would be the mode. Because it is useful for nominal data the mode is not widely used in Spatial analyses which rely more on explicitly numeric data.

Dispersion

Dispersion means how far the values are from the measure of central tendency, and the most common measure of dispersion is the standard deviation

There is another measure of dispersion that is used with the median called the interquartile range and we will take a look at that in the sections on symbology.

When the standard deviation is small then the data is close to the central tendency meaning that the values in a dataset don’t vary too much. If the standard deviation is large then the data are spread far from the mean and the dataset has widely varying values. The farther an observation is from the mean the higher its standard deviation will be. In general 68 percent of values should fall within the first standard deviation, 95 percent with the first two standard deviation and 99 should fall with the first three standard deviations.

In this example below we have 10 observations. In the first column the values range from1 to 8 and have a mean of 4.1, because the range of values doesn’t venture far from the mean we can see a lower standard deviation which means that the data is more clustered around the mean. In the second column we have values ranging from 1 to 84 with a mean of 31.6 and a standard deviation of 27.5.

Centroids

There are a few other concepts that we have to explore before beginning descriptive spatial statistics. The first is the concept of the centroid.

The centroid refers to the mean center of a polygon feature. An analogous concept is “center of gravity.” Some statistical techniques need to pick a point to use when performing analyses and because polygons are two-dimensional it is difficult to pick a point within it. If we want to measure the distance between two block groups we could pick the shortest distance between two polygon boundaries or the furthermost boundaries or any other combination. By using the centroid we can establish an average location for the polygon and use this to measure interactions between and other characteristics of polygons.

The image above illustrates the problems encountered when trying to measure the distance between two polygon features. The thin brown lines show possible measurements between the shortest and longest distances between two polygons, while the red line shoes the distance between the average location within two polygons.

The image above shows the centroid for each block group, the detail to the right, from near the Sparrows Point area, shows a block group whose centroid is located outside of its area due to its unusual shape. Because the block group is arc-shaped the mean location is actually just outside of it.

Weighting

Another concept is weighting. Many statistical applications include a field for weighting. The mean of a set of features could be the geographic center that is based on the location of other features, or if you add a weight field the a mean location might change based on the magnitude and location of other values.

So here we can see the mean central block groups in Baltimore shown in blue. However, when we weight each block-group by population density the mean center changes to reflect the influence of different densities in each block group. The weighted mean central block-group is shown in red just below the other mean center.

Euclidean and Manhattan Distance

Another concept is that of Euclidean and Manhattan Distance.

A Euclidean Distance is the shortest distance between two objects and is a straight line. A Manhattan Distance is the shortest distance between two objects along a right angle formed by the intersection of line passing through each object.

Because real-world phenomena like traffic or pedestrian movement seldom occur in perfectly straight lines, the Manhattan distance is used as a way to approximate the effect of city blocks or other infrastructure in built environments on actual distances. Manhattan distance is not a perfect measure of distance especially when streets are winding or curvy like those around the Morgan Campus, but it does mitigate the errors that would arise if we used only the Euclidean distance.

If someone walks from the Maryland Convention Center to the National Aquarium, the distance they travel would best be approximated by the Manhattan Distance (in black, above), as opposed to the Euclidean Distance (in red, above), which is shorter and unrealistic by passing over the Inner Harbor. The Manhattan Distance is so named because it alludes to the rectangular city blocks of Manhattan, and is also sometimes called the "Taxi Cab" distance. Euclidean Distance is also known as "As the crow flies."

Mean and Median Centers

The Mean and Median Centers are two measures of spatial central tendency. The mean center is the centroid of all features in a shapefile, and the median center is the location where the distance to all other features is minimized.

In the images above, left we can see the mean center of all states in green and the median center in yellow. In comparison the mean and median centers of counties is shown in the image above right. The mean center of counties is slightly further east and south, because the number of counties is greater in those directions, while the median center of counties is slightly further south and west because the distances between individual counties are greater in those directions.

The mean center is shown below in Green and is the same concept as the centroid that we saw before. In this case the mean center applies to every feature in the dataset.

Just as in non-spatial statistics, the spatial median center is less influenced by outliers:

In the selection of counties above, the presence of Maine is less influential on the median (yellow) than the mean (green), and for this reason the median is further west.

Central Feature Measure

The Central feature tool is used to identify the feature in a shapefile that is most centrally located.

In the images above we can see a census block group (in blue) that is the most central off all the other block groups in Baltimore. In the image to the above, right is a street grid in Baltimore (with detail) which is the center most street segment in Baltimore.

In contrast to the mean and median center tools which identify central tendencies of features, the central feature tool identifies an actual location within a shapefile, so that the output will always be identical to one of the features in the shapefile.

The central feature tool has application for planning initiatives. If you want to establish a health clinic based on income or other demographic data you could use this tool to maximize accessibility. You can use the Central Feature tool for understanding distributions of spatial phenomena as well as things like site selection, so if you are planning a conference in the state and have a shapefile of venues, you can find which one is most centrally located.

Linear Directional Mean

The Linear Directional Mean Tool (LDMT) measures the average direction of line segments. This tool can be used to measure the average direction of tornadoes or disease transmission and other phenomena that change location over time.

In the image above we can see tornado paths in the continental US from 1950 to 2019. The LDMT was used to find the average direction of all tornadoes in each year (right). Most of arrows are in the state of Missouri because that is the average location of all tornadoes in the continental US. The arrow indicates the compass direction. So we can see that most tornadoes move from south-west to north east. The length of the line indicates the average length of a tornado path in that year.

The measurement adds together the compass degree directions and then calculates the average direction.

In the above image the average of all compass directions is 50.429 degrees.

If you have a dataset that shows direction of travel or the movement of diseases across a country, then its important to know that the lines have a beginning and end point and the compass direction is based on where the end point is in relation to the beginning point. If you have a line shapefile and the lines were digitized or other produced backwards then the LDMT will produce direction lines backwards as well.

In transportation this tool has been used to plan transit services. Below is an image from a study that was done to identify potential public transportation usage by different income groups and mode choice. The straight blue lines represent links between trip begin and end points in different population centers and the black arrows show the average direction of trips originating in each population center.

In this study they revealed some behavioral patterns like higher income people tended to travel to the east of the city, while middle class people shown here tended to travel northwards and lower-income people tended to travel to the west of the city and its suburbs. Also based on their analysis the researchers were able to identify competition and redundancy between different public transit modes and some other inadequacies for supporting travel behavior through public transit.

In public health epidemiology this measure is often used to tract the spread of contagious disease. The image below was taken form a study of Foot Hand and Mouth Disease in Thailand:

On the right we see the seasonal spread of the disease represented by red lines and the linear directional mean in blue which is the average direction of all the red lines. And in this study the researchers were able to correlate the general direction of the disease’s spread with the annual Monsoon cycle and implicate the monsoon in the spread of the disease.

Other examples of the Linear Directional mean might include correlations between moving weather conditions like dry air or particulate matter and the appearance of respiratory disease or possibly outbreaks of disease associated with holiday travel.

Directional Distribution

The Directional Distribution Ellipse tool creates a standard deviational ellipse to summarize the spatial characteristics of geographic features like central tendency, dispersion, and directional trends.

Above is an image of directional ellipses of the linear mean direction of tornado paths from 1950 to 2019. We can see how the average direction, average location and average length of tornado paths have changed over time, with each ellipse representing a single decade. The lines represent one standard deviation for each decade. The software can generate up to 3 standard deviations for a data set by running the tool multiple times and change the number of deviations.

This tool is useful for demonstrating trends quantitatively as opposed to relying on the visualization of directional lines alone, for example, because they may be difficult to interpret:

The image above compares the output of the Linear Mean Directional Tool and the Directional Distribution Ellipse to illustrate how the ellipses can be more expressive of several factors of the data.

A table detail of the directional distribution output.

Standard Distance

The standard Distance tool produces a circle around the mean center of a shapefile. The circle represents one standard deviation (in general) of the total number of features in the shapefile, that is approximately 68% of all features will be within the circle:

In the image above of counties in the continental US, we can see the 3 standard distance lines in purple. The table details indicate the number of features in each standard deviation. The first SD has 2100 out of 3108 observations which corresponds closely to 68 percent of the total of all observations, anticipated by the empirical rule of the standard deviation (68%, 95%, 99.7%), while the second SD contains 95.4 of all observations. The third SD contains all of the observations deviating slightly from the expected value of 99.7%

The snippet above shows a detail of the attribute table for the standard distance line. The StdDist value represents the distance from the mean center to the line.