LibGuides: GIS Services: Data Capture

About Data Capture

Questions to ask
The CRAAP Test
- Currency
- Relevance
- Authority
- Accuracy
- Purpose or Point-of-view

Initial Questions to ask

Data Capture is the process of searching for and retrieving, or creating the data we need for our project. Before searching for data, we must make several decisions about the spatial and temporal scope of the data we are looking for.

Spatial Scope: Identify the precise area we are interested in studying, e.g. Maryland, Baltimore City, Baltimore City and Baltimore County, the Washington Metropolitan region, etc.

Temporal Scope: Identify the precise time we are looking for, e.g. The most recent time, a comparison of 2000 and 2010, etc.

Data Type and Format:

The two principle spatial data types in GIS are vector and Raster (See more HERE). The vector model is used primarily to represent discrete features like roads, streams, locations, building etc., while raster data is sued to model continuous data like elevation, temperature landcover and so on.

When

Data Searching

Once we have outlined the spatial and temporal scope of our project we have to think about what kind of data we are going to use

Furthermore, the data you need may not available in the format you want or it may not be available at all. For example, you may want a data set of all food stores in Baltimore, but you can only find a data set of food store data for the entire state of Maryland; or you may want food store data and not find any at all.

Although there are many scenarios that can emerge in the data capture phase, we can generalize all of them into the following four scenarios:

We have searched for and found the exact data we need,
We have searched for and found data that is similar to what we need, but not an exact match, and we need to change it before using it in a GIS project,
We cannot find the data we are looking for.

Examples:

Scenario 1:

We want to analyze the population of Maryland using the Index of Dissimilarity to measure the level of integration between African-Americans and Whites in Maryland with the American Community Survey data from 2013-2017.

In this case we can retrieve data directly from a source like NHGIS.org, or data.census.gov. This source has census data by race and has spatial data for the state of Maryland as well, making it an exact match to our needs

Scenario 2:

We want to analyze the population of Baltimore using the Index of Dissimilarity to measure the level of integration between African-Americans and Whites in Maryland.

In this case we can find data from NHGIS or the census Bureau as above, but we will have to process it to extract Baltimore from the wider Maryland Dataset.

Scenario 3:

We want to create a map of Police Call boxes on Morgan’s campus, but cannot find one on any website, and after contacting several campus offices we determine that there probably is no data about the locations of police call boxes on campus.

We will have to create the data from scratch.

Data Capture CRAAP TEST

Data capture is often the most time-consuming part of a GIS Project. Not only are there many data sources that you can chose from, but there are other issues to consider like the currency, reliability, and accuracy of the data.

When searching for data it is very important to consider the currency of the data, the reliability of the source, the accuracy of the data itself the purpose for which the data were created. The CRAAP test was developed to help students evaluate information found on websites, but in modified form can be applied to data capture in GIS as well.

CRAP is an acronym for:

Currency
Reliability
Authority
Accuracy
Purpose/Point-of-view.

While these concepts are very useful when searching for and evaluating online data sources, they are also useful to keep in mind when creating our own data, so that our projects will be sound and the data can later be used by others.

Currency

How recent is the information?
Is it current enough for your topic?

In all but historical or longitudinal research we typically want to find the most current data available. For example, if we want to study population characteristics of Baltimore, we probably want to use the latest American Community Survey (ACS) data from the U. S. Census Bureau instead of the Decennial census, especially if the most recent 10-year census was conducted more than five years ago.

The ACS was developed by the Census Bureau because more responsive data was needed to understand population changes that occur between the 10 year census.

Sometimes it is not clear how accurate the data is. In the example below we can see three different years: 2008, 2011 and 2017. In the title it say "Last Updated 2008" and this is the year that corresponds to the data currency. The other years refer to the record about the data and not the data itself. In examples like this it is easy to get confused about how current the data is, and it may even be necessary to download the data and look at the attribute table for more information about the data.

Lan Use Record.PNG

See original here https://data.baltimorecity.gov/Geographic/Land-use-Shape/feax-3ycjLinks to an external site.

Reliability

What kind of information is included in the resource?
Does the source provide references for the origins of the data?

We want data that are complete and well documented, and that have a known provenance. If we find data on someone's personal Google Map it might not have any information about when it was created, how complete it is or even what the data represent.

Let's say we want a shapefile of free health clinics in Baltimore. We might be able to find a dataset that someone created for a class project and uploaded to Google, but is the dataset complete or are there clinics that were not included? and does the attribute table have enough information like street addresses of clinics, hours of operation, capacity, etc. Another problem is maybe someone is running a clinic in their basement that is not exactly on the level- is this something you would want to include in your project?

In these cases you may be able to compare the data set with another resources like a list of clinics provided by the health department, or you could generate a new shapefile using geocoding to convert the address data into a shapefile.

In the example below we can see a description of the data that comes with the Highway Performance Monitoring System produced by the Federal High Way Administration. This is a good example of a well-documented data source, that explains in detail what variables are included and what the field names mean.

Data reliability.PNG

Authority

Who is the creator or author?
What are their credentials? Can you find any information about the creator’s background?
Who is the publisher or sponsor and are they reputable?
What is the publisher's interest (if any) in this information?

We want data that have been created by people who know what they are doing. While it is not often necessary to know the exact person who created the data, it is often enough to retrieve data from a reliable dataset like the State or Federal Government. Institutionally-generated data can also be regarded as authoritative.

Lets say we want a dataset of street cross-walks around the Morgan campus and we find one on Google Maps and another one on the City of Baltimore data portal. Now the one on Google maps may be perfectly fine, and for all we know the person who uploaded it may have gotten it from the City of Baltimore website, but just to be safe we want to make sure that our data comes from a reliable source like a local government or other known agency and then we can assume that there is some quality control and it that it has a provenance.

Accuracy

Reliability, truthfulness Truthfulness, and correctness
Correctness of the content
Where does the information come from?
Is the information supported by evidence?
Has the information been reviewed or refereed?
Can you verify any of the information from another source or from personal
knowledge?
Does the language or tone seem unbiased and free of emotion?
Are there spelling, grammar, or typographical errors?

Accuracy is the most expansive criterion in the context of GIS. As Bolstad states, “An
accurate observation reflects the true shape, location or characteristics of the phenomenon
represented in GIS” 27 (p. 621) and further defines four parameters of accuracy in GIS. The first is
“Positional Accuracy”, or how close the GIS model is to the real location. The second is
“Attribute Accuracy” or statistical errors between the attribute data and the population
basedpopulation-based on sampling. The third is “Logical Consistency”, or the presence or lack
of paradoxes, such as a building site that is located in a water body. The last parameter is
“Completeness; ” or how well the data reflect the frequency of real-world phenomena. 28 The long
process of collecting data employs quality controls to ensure data accuracy, but these also hold
the potential for introducing and propagating errors. 29 As noted above, accuracy is in many ways
implied by authority, and the end-user searching for these data—students especially—often lack
the hardware and technical skills necessary to fully evaluate accuracy of data sources, leaving
them the only option of assuming the data are accurate.
Generalization is an implicit source of error in spatial data. Because vector and raster data
are models, or generalizations of real-world phenomena, there is naturally a loss of detail and
precision between them and spatial reality, when vector data, for example, are digitized at a too-
small scale or raster data are captured at a too-large resolution. The selection of more generalized
data is influenced by computational constraints of hardware, like processing speed and hard-
drive space; , and cartographic considerations for highly detailed data that are difficult to
interpret at small scales.
Figure 3 compares detailed (on left) and generalized (on right) shapefiles of U. S.
Counties counties on the East Coast. The detailed shapefile is not well suited to small scales
because the detail is cluttered on the coastal areas especially. The generalized shapefile presents
a much cleaner look.

Ch2-Fuller-Ndegwa — 11

Figure 3. East Coast U. S. Counties counties are represented by detailed (left) 30 and generalized
(right) 31 shapefiles.

In contrast, Figure figure 4 shows how the same shapefiles appear at large scales. The
generalized shapefile (left) makes the landforms and counties appear awkward and almost
abstract, while the detailed shapefile (right) presents a more realistic representation of coastal
areas.

Ch2-Fuller-Ndegwa — 12

Figure 4. Detail of Upper Chesapeake Bay represented by generalized (left and detailed (right)
shapefiles at large scale.

Currency

Currency

The timeliness the data

When was the data published or posted?
Has the data been revised or updated?
Does your topic require current data, or will older sources work as well?
Are links to the data or within the web-source functional?

The ACS was developed by the Census Bureau because more responsive data was needed to understand population changes that occur between the 10 year census.

See original here https://data.baltimorecity.gov/Geographic/Land-use-Shape/feax-3ycjLinks to an external site.

Relevance

Relevance
The importance of the data for your research needs

Does the information relate to your topic or (help to) answer your question?
Who is the information’s intended audience?
Is the information at an appropriate level? (i.e., not too elementary or advanced for your needs)?
Have you looked at a variety of sources before determining this is one you will use?
Would you be comfortable citing this source in your research paper?

The importance of selecting data relevant to a GIS project is self-evident. Relevance of data is determined by the initial project research questions like study area, unit of analysis, the populations or phenomenon being studied, and regulatory and legal obligations. In GIS, this criterion can be expanded to include selecting a data model, scale, and cartographic considerations. Adding data like street lines and other infrastructure can give context to an analysis and make the map more effective at communicating spatial patterns. When selecting data to contextualize an analysis, assumptions about the audiences’ familiarity with cartography and the relevance of the data to their experience or expectations must also be considered. Selecting relevant data also requires comprehensive familiarity with available data, including what the data represent and, equally important, what they do not represent. For example, in searching for ACS data about health insurance status and poverty level, we can find
three tables and a sample of their attributes:

C27016. Health Insurance Coverage Status by Ratio of Income to Poverty Level in the Past 12 Months by Age
Under 1.00 of poverty threshold — 19 to 64 years — with health insurance coverage
Under 1.00 of poverty threshold — 19 to 64 years — no health insurance coverage

C27017. Private Health Insurance by Ratio of Income to Poverty Level in the Past 12 Months by Age
Under 1.00 of poverty threshold — 19 to 64 years — with private health insurance
Under 1.00 of poverty threshold — 19 to 64 years — no private health insurance

C27018. Public Health Insurance by Ratio of Income to Poverty Level in the Past 12 Months by Age
Under 1.00 of poverty threshold — 19 to 64 years — with public coverage
Under 1.00 of poverty threshold — 19 to 64 years — no public coverage

For effective analyses, we have to know the answers to the following questions:
1. Is the population with no private insurance equal to the population with no insurance plus the population with public insurance?
2. Is the population with no public coverage equal to the population with private insurance plus the population with no health insurance?

These questions might seem paradoxical at first glance and cause confusion for students trying to navigate the wide variety of data in sources like the ACS. Taking adequate time to understand fully what the data are saying and finding data most responsive to student needs will
spare them much frustration and wasted time.

Authority

Authority

Who is the creator or author?
What are their credentials? Can you find any information about the creator’s background?
Who is the publisher or sponsor and are they reputable?
What is the publisher's interest (if any) in this information?