LibGuides: Data Management and Project Planning: Assure

Assure

Employ quality assurance and quality control procedures that enhance the quality of data (e.g., training participants, routine instrument calibration) and identify potential errors (null values, outliers, etc.) and techniques to address them.

Communicate Data Quality

Information about quality control and quality assurance are important components of the metadata:

Qualify (flag) data that have been identified as questionable by including a flagging_column next to the column of data values. The two columns should be properly associated through a naming convention such as Temperature, flag_Temperature.
Describe the quality control methods applied and their assumptions in the metadata. Describe any software used when performing the quality analysis, including code where practical. Include in the metadata who did the quality control analysis, when it was done, and what changes were made to the dataset.
Describe standards or test data used for the quality analysis. For instance, include, when practical, the data used to make a calibration curve.
If data with qualifier flags are summarized to create a derived data set, include the percent flagged data and percent missing data in the metadata of the derived data file. High frequency observations are often downsampled, and it is critical to know how much of the data were rejected in the primary data.

Confirm That the Data and Metadata Descriptions Match

To assure that metadata correctly describes what is actually in a data file, visual inspection or analysis should be done by someone not otherwise familiar with the data and its format. This will assure that the metadata is sufficient to describe the data. For example, statistical software can be used to summarize data contents to make sure that data types, ranges and, for categorical data, values found, are as described in the documentation/metadata.

Consider Compatibility of Integrated Data

The integration of multiple data sets from different sources requires that they be compatible. Methods used to create the data should be considered early in the process, to avoid problems later during attempts to integrate data sets. Note that just because data can be integrated does not necessarily mean that they should be, or that the final product can meet the needs of the study. Where possible, clearly state situations or conditions where it is and is not appropriate to use your data, and provide information (such as software used and good metadata) to make integration easier.

Develop a Quality Assurance and Quality Control Plan

Just as data checking and review are important components of data management, so is the step of documenting how these tasks were accomplished. Creating a plan for how to review the data before it is collected or compiled allows a researcher to think systematically about the kinds of errors, conflicts, and other data problems they are likely to encounter in a given data set. When associated with the resulting data and metadata, these documented quality control procedures help provide a complete picture of the content of the dataset. A helpful approach to documenting data checking and review (often called Quality Assurance, Quality Control, or QA/QC) is to list the actions taken to evaluate the data, how decisions were made regarding problem resolution, and what actions were taken to resolve the problems at each step in the data life cycle. Quality control and assurance should include:

determining how to identify potentially erroneous data
how to deal with erroneous data
how problematic data will be marked (i.e. flagged)

For instance, a researcher may graph a list of particular observations and look for outliers, return to the original data source to confirm suspicions about certain values, and then make a change to the live dataset. In another dataset, researchers may wish to compare data streams from remote sensors, finding discrepant data and choosing or dropping data sources accordingly. Recording how these steps were done can be invaluable for later understanding of the dataset, even by the original investigator.

Datasets that contain similar and consistent data can be used as baselines against each other for comparison.

Obtain data using similar techniques, processes, environments to ensure similar outcome between datasets.
Provide mechanisms to compare data sets against each other that provide a measurable means to alert one of differences if they do indeed arise. These differences can indicate a possible error condition since one or more data sets are not exhibiting the expected outcome exemplified by similar data sets.

One efficient way to document data QA/QC as it is being performed is to use automation such as a script, macro, or stand alone program. In addition to providing a built-in documentation, automation creates error-checking and review that can be highly repeatable, which is helpful for researchers collecting similar data through time.

The plan should be reviewed by others to make sure the plan is comprehensive.

Double-Check Data Entry

Ensuring accuracy of your data is critical to any analysis that follows.

When transcribing data from paper records to digital representation, have at least two, but preferably more people transcribe the same data, and compare resulting digital files. At a minimum someone other than the person who originally entered the data should compare the paper records to the digital file. Disagreements can then be flagged and resolved.

In addition to transcription accuracy, data compiled from multiple sources may need review or evaluation. For instance, citizen science records such as bird photographs may have taxonomic identification that an expert may need to review and potentially revise.

Basic Quality Control Assurances

Quality control practices are specific to the type of data being collected, but some generalities exist:

Data collected by instruments:
- Values recorded by instruments should be checked to ensure they are within the sensible range of the instrument and the property being measured. Example: Concentrations cannot be < 0, and wind speed cannot exceed the maximum speed that the anemometer can record.
Analytical results:
- Values measured in the laboratory should be checked to ensure that they are within the detection limit of the analytical method and are valid for what is being measured. If values are below the detection limit, they should be properly coded and qualified.
- Any ancillary data used to assess data quality should be described and stored. Example: data used to compare instrument readings against known standards.
Observations (such as bird counts or plant cover):
- Range checks and comparisons with historic maxima will help identify anomalous values that require further investigation.
- Comparing current and past measurements help identify highly unlikely events. For example, it is unlikely that the girth of a tree will decrease from one year to the next.
Codes should be used to indicate quality of data:
- Codes should be checked against the list of allowed values to validate code entries
- When coded data are digitized, they should be re-checked against the original source. Double data entry, or having another person check and validate the data entered, is a good mechanism for identifying data entry errors.

Dates and times:

Ensure that dates and times are valid
Time zones should be clearly indicated (UTC or local)

Data Types:

Values should be consistent with the data type (integer, character, datetime) of the column in which they are entered. Example: 12-20-2000A should not be entered in a column of dates).
Use consistent data types in your data files. A database, for instance, will prevent entry of a string into a column identified as having integer data.

Geographic coordinates:

Map coordinates to detect errors

Ensure that Data Are Reproducible

When searching for data, whether locally on one’s machine or in external repositories, one may use a variety of search terms. In addition, data are often housed in databases or clearinghouses where a query is required in order access data. In order to reproduce the search results and obtain similar, if not the same results, it is necessary to document which terms and queries were used.

Note the location of the originating data set
Document which search terms were used
Document any additional parameters that were used, such as any controls that were used (pull-down boxes, radio buttons, text entry forms)
Document the query term that was used, where possible
Note the database version and/or date, so you can any limit newly-added data sets since the query was last performed
Note the name of the website and URL, if applicable

Identify Missing Values and Define Missing Value Codes

Missing values should be handled carefully to avoid their affecting analyses. The content and structure of data tables are best maintained when consistent codes are used to indicate that a value is missing in a data field. Commonly used approaches for coding missing values include:

Use a missing value code that matches the reporting format for the specific parameter. For example, use ““-999.99””, when the reporting format is a FORTRAN-like F7.2.
For character fields, it may be appropriate to use ““Not applicable”” or ““None”” depending upon the organization of the data file.
It might be useful to use a placeholder value such as ““Pending assignment”” when compiling draft information to facilitate returning to incomplete fields.
Do not use character codes in an otherwise numeric field.

Whatever missing value is chosen, it should be used consistently throughout all data associated files and identified in the metadata and/or data description files.

Identify Outliers

Outliers may not be the result of actual observations, but rather the result of errors in data collection, data recording, or other parts of the data life cycle. The following can be used to identify outliers for closer examination:

Statistical determination:

Outliers may be detected by using Dixon’s test, Grubbs test or the Tietjen-Moore test.

Visual determination:

Box plots are useful for indicating outliers
Scatter plots help identify outliers when there is an expected pattern, such as a daily cycle

Comparison to related observations:

Difference plots for co-located data streams can show unreasonable variation between data sources. Example: Difference plots from weather stations in close proximity or from redundant sensors can be constructed.
Comparisons of two parameters that should covary can indicate data contamination. Example: Declining soil moisture and increasing temperature are likely to result in decreasing evapotranspiration.

No outliers should be removed without careful consideration and verification that they are not representing true phenomena.

Identify Values that Are Estimated

Data tables should ideally include values that were acquired in a consistent fashion. However, sometimes instruments fail and gaps appear in the records. For example, a data table representing a series of temperature measurements collected over time from a single sensor may include gaps due to power loss, sensor drift, or other factors. In such cases, it is important to document that a particular record was missing and replaced with an estimated or gap-filled value.

Specifically, whenever an original value is not available or is incorrect and is substituted with an estimated value, the method for arriving at the estimate needs to be documented at the record level. This is best done in a qualifier flag field. An example data table including a header row follows:

Day, Avg Temperature, Flag 1, 31.2, actual 2, 32.3, actual 3, 33.4, estimated 4, 35.8, actual

Mark Data with Quality Control Flags

As part of any review or quality assurance of data, potential problems can be categorized systematically. For example data can be labeled as 0 for unexamined, -1 for potential problems and 1 for “good data.” Some research communities have developed standard protocols; check with others in your discipline to determine if standards for data flagging already exist.

The marine community has many examples of quality control flags that can be found on the web. There does not yet seem to be standards across the marine or terrestrial communities.

Provide Version Information for Use and Discovery

Provide versions of data products with defined identifiers to enable discovery and use.

Items to consider when versioning data products:

Develop definition of what constitutes a new version of the data, for example:
- New processing algorithms
- Additions or removal of data points
- Time or date range
- Included parameters
- Data format
- Immutability of versions
Develop standard naming convention for versions with associated descriptive information
Associate metadata with each version including the description of what differentiates this version from another version