LibGuides: Data Management and Project Planning: Preserve

About

Plan to preserve data in the short term to minimize potential losses (e.g., via accidents), and in the long term so that project stakeholders and others can access, interpret, and use the data in the future. Decide what data to preserve, where to preserve it, and what documentation needs to accompany the data.

Backup Your Data

To avoid accidental loss of data you should:

Backup your data at regular frequencies
- When you complete your data collection activity
- After you make edits to your data
Streaming data should be backed up at regularly scheduled points in the collection process
- High-value data should be backed up daily or more often
- Automation simplifies frequent backups
Backup strategies (e.g., full, incremental, differential, etc…) should be optimized for the data collection process
Create, at a minimum, 2 copies of your data
Place one copy at an “off-site” and “trusted” location
- Commercial storage facility
- Campus file-server
- Cloud fire-server (e.g., Amazon S3, Carbonite)
Use a reliable device when making backups
- External USB drive (avoid the use of “light-weight” devices e.g., floppy disks, USB stick-drive; avoid network drives that are intermittently accessible)
- Managed network drive
- Managed cloud file-server (e.g., Amazon S3, Carbonite)
Ensure backup copies are identical to the original copy
- Perform differential checks
- Perform “checksum” check
Document all procedures to ensure a successful recovery from a backup copy

Choose and Use Standard Terminology to Enable Discovery

Terms and phrases that are used to represent categorical data values or for creating content in metadata records should reflect appropriate and accepted vocabularies in your community or institution. Methods used to identify and select the proper terminology include:

Identify the relevant descriptive terms used as categorical values in your community prior to start of the project (e.g., standard terms describing soil horizons, plant taxonomy, sampling methodology or equipment, etc.)
Identify locations in metadata where standardized terminology should be used and sources for the terms. Terminology should reflect both data type/content and access methods.
Review existing thesauri, ontologies, and keyword lists for your use before making up a new terms. Potential sources include: Semantic Web for Earth and Environmental Terminology (SWEET), Planetary Ontologies, and NASA Global Change Master Directory (GCMD)
Enforce use of standard terminology in your workflow, including:
- Use of lookup tables in data-entry forms
- Use of field-level constraints in databases (restrict data import to match accepted domain values)
- Use XML validation
- Do manual review
Publish metadata using Open Standards, for example:
- z39.50
- OGC Catalog Services for Web (CSW)
- Web Accessible Directory (WAD)

If you must use an unconventional or unique vocabulary, it should be identified in the metadata and fully defined in the data documentation (attribute name, values, and definitions).

Create and Document a Data Backup Policy

A backup policy helps manage users’ expectations and provides specific guidance on the “who, what, when, and how” of the data backup and restore process. There are several benefits to documenting your data backup policy:

Helps clarify the policies, procedures, and responsibilities
Allows you to dictate:
- where backups are located
- who can access backups and how they can be contacted
- how often data should be backed up
- what kind of backups are performed and
- what hardware and software are recommended for performing backups
Identifies any other policies or procedures that may already exist (such as contingency plans) or which ones may supersede the policy
Has a well-defined schedule for performing backups
Identifies who is responsible for performing the backups and their contact information. This should include more than one person, in case the primary person responsible is unavailable
Identifies who is responsible for checking the backups have been performed successfully, how and when they will perform this
Ensures data can be completely restored
Has training for those responsible for performing the backups and for the users who may need to access the backups
Is partially, if not fully automated
Ensures that more than one copy of the backup exists and that it is not located in same location as the originating data
Ensures that a variety of media are used to backup data, as each media type has its own inherent reliability issues
Ensures the structure of the data being backed up mirrors the originating data
Notes whether or not the data will be archived

If this information is located in one place, it makes it easier for anyone needing the information to access it. In addition, if a backup policy is in place, anyone new to the project or office can be given the documentation which will help inform them and provide guidance.

Decide What Data to Preserve

The process of science generates a variety of products that are worthy of preservation. Researchers should consider all elements of the scientific process in deciding what to preserve:

Raw data
Tables and databases of raw or cleaned observation records and measurements
Intermediate products, such as partly summarized or coded data that are the input to the next step in an analysis
Documentation of the protocols used
Software or algorithms developed to prepare data (cleaning scripts) or perform analyses
Results of an analysis, which can themselves be starting points or ingredients in future analyses, e.g. distribution maps, population trends, mean measurements
Any data sets obtained from others that were used in data processing
Multimedia: documented procedures, or standalone data

When deciding on what data products to preserve, researchers should consider the costs of preserving data:

Raw data are usually worth preserving
Consider space requirements when deciding on whether to preserve data
If data can be easily or automatically re-created from raw data, consider not preserving. E.g. if data that have undergone quality control processes and were analyzed, consider preserving since reproduction might be costly
Algorithms and software source code cost very little to preserve
Results of analyses may be particularly valuable for future discovery and cost very little to preserve

Researchers should consider the following goals and benefits of preservation:

Enabling re-analysis of the same products to determine whether the same conclusions are reached
Enabling re-use of the products for new analysis and discovery
Enabling restoration of original products in the case that working datasets are lost

Ensure Flexible Data Services for Virtual Datasets

In order for a large dataset to be effectively used by a variety of end users, the following procedures for preparing a virtual dataset are recommended:

Identify data service users
Define data access capabilities needed by community(s) of users. For example:
- Spatial subsetting
- Temporal subsetting
- Parameter subsetting
- Coordinate transformation
- Statistical characterization
Define service interfaces based upon Open Standards. For example:
- Open Geospatial Consortium (OGC WMS, WFS, WCS)
- W3C (SOAP)
- IETF (REST – derived from Hypertext Transfer Protocol [HTTP])
Publish service metadata for published services based upon Open Standards. For example:
- Web Services Definition Language (WSDL)
- RSS/Atom (see Service Casting reference below for an example of a model for publishing service metadata for a variety of service types)

Ensure Integrity and Accessibility When Making Backups of Data

For successful data replication and backup:

Users should ensure that backup copies have the same content as the original data file
- Calculate a checksum for both the original and the backup copies and compare; if different back up the file again MD5: algorithm to determine check sum http://en.wikipedia.org/wiki/MD5
- Compare files to ensure that there are no differences
Document all procedures (e.g., compression / decompression process) to ensure a successful recovery from a backup copy
To check the integrity of the backup file, periodically retrieve your backup file, open it on a separate system, and compare to the original file
A data backup is only valuable if it is accessible. When access to a data backup is required, the owner of the backup may not be available. It is important that others know how to access the backup, otherwise the data may not be accessible for recovery. It is important to know the “who, what, when, where, and how” of the backups:
- Have contact information available for the person responsible for the data
- Ensure that those who need access to backups have proper access
- Communicate what data is being backed up
- Note how often the data is backed up and where that particular backup is located including
  - physical location (machine, office, company)
  - file system location
- Be aware that there may be different backup procedures for different data sets:
  - Not all backups may be located in the same location
  - Depending upon the backup schedule, each iteration of the backup may be located in different locations (for example, more recent backups may be located on-site and older backups may be located off-site)
- Have instructions and training available so that others know how to pull the backup and access the necessary data in case you are unavailable

Ensure the Reliability of Your Storage Media

All storage media, whether hard drives, discs or data tapes, will wear out over time, rendering your data files inaccessible. To ensure ongoing access to both your active data files and your data archives, it is important to continually monitor the condition of your storage media and track its age. Older storage media and media that show signs of wear should be replaced immediately. Use the following guidelines to ensure the ongoing integrity and accessibility of your data:

Test Your Storage Media Regularly: As noted in the “Backup Your Data” best practice, it is important to routinely perform test retrievals or restorations of data you are storing for extended periods on hard drives, discs or tapes. It is recommended that storage media that is used infrequently be tested at least once a year to ensure the data is accessible.
Beware of Early Hardware Failures: A certain percentage of storage media will fail early due to manufacturing defects. In particular, hard drives, thumb drives and data tapes that have electronic or moving parts can be susceptible to early failure. When putting a new drive or tape into service, it is advisable to maintain a redundant copy of your data for 30 days until the new device “settles in.”
Determine the Life of Your Hard Drives: When purchasing a new drive unit, note the Mean Time Between Failure (MTBF) of the device, which should be listed on its specifications sheet (device specifications are usually packaged with the unit, or available online). The MTBF is expressed in the number of hours on average that a device can be used before it is expected to fail. Use the MTBF to calculate how long the device can be used before it needs to be replaced, and note that date on your calendar (For example, if the MTBF of a new hard drive is 2,500 hours and you anticipate having the unit powered on for 8 hours a day during the work week, the device should last about 2 years before it needs to be replaced).
Routinely Inspect and Replace Data Discs: Contemporary CD and DVD discs are generally robust storage media that will fail more often from mishandling and improper storage than from deterioration. However lower quality discs can suffer from delamination (separation of the disc layers) or oxidation. It is advisable to inspect discs every year to detect early signs of wear. Immediately copy the data off of discs that appear to be warping or discolored. Data tapes are susceptible both to physical wear and poor environmental storage conditions. In general, it is advisable to move data stored on discs and tapes to new media every 2-5 years (specific estimates on media longevity are available on the web).
Handle and Store Your Media With Care: All storage media types are susceptible to damage from dust and dirt exposure, temperature extremes, exposure to intense light, water penetration (more so for tapes and drives than discs), and physical shock. To help prolong its operational life, store your media in a dry environment with a comfortable and stable room temperature. Encapsulate all media in plastic during transportation. Provide cases or plastic sheaths for discs, and avoid handling them excessively.

Identify and Use Relevant Metadata Standards

Many times significant overlap exists among metadata content standards. You should identify those standards that include the fields needed to describe your data. In order to describe your data, you need to decide what information is required for data users to discover, use, and understand your data. The who, what, when, where, how, why, and a description of quality should be considered. The description should provide enough information so that users know what can and cannot be done with your data.

Who: The person and/or organization responsible for collecting and processing the data. Who should be contacted if there are questions about your data?
What: What parameters were measured or observed? What are the units of your measurements or results?
When: A description of the temporal characteristics of your data (e.g., time support, spacing, and extent).
Where: A description of the spatial characteristics of your data (e.g., spatial support, spacing, and extent). What is the geographic location at which the data were collected? What are the details of your field sensor deployment.
How: What methods were used (e.g., sensors, analytical instruments, etc.). Did you collect physical samples or specimens? What analytical methods did you use to measure the properties of your samples/specimens? Is your result a field or laboratory result? Is your result an observation or a model simulation?
Why: What is the purpose of the study or the data collection? This can help others determine whether your data is fit for their particular purpose or not.
Quality: Describe the quality of the data, which will help others determine whether your data is fit for their purpose or not.

Considering a number of metadata content standards may help you fine-tune your metadata content needs. There may be content details or elements from multiple standards that can be added to your requirements to help users understand your data or methods. You wouldn’t know this unless you consider multiple content standards.

If the project or grant requirements define a particular metadata standard, incorporate it into the data management plan.
If the community has a recommended or has a most commonly used metadata standard, use it
Consider using a metadata standard that is interoperable with many systems, repositories, and harvesters
If the community’s preferred metadata standard is not widely interoperable, consider creating metadata using a simple but interoperable standard, e.g. Dublin Core, in addition to the main standard.

Useful Definitions:

Metadata Content Standard: A Standard that defines elements users can expect to find in metadata and the names and meaning of those elements.
Metadata Format Standard: A Standard that defines the structures and formats used to represent or encode elements from a content standard.

Identify Data Sensitivity

Steps for the identification of the sensitivity of data and the determination of the appropriate security or privacy level are:

Determine if the data has any confidentiality concerns
- Can an unauthorized individual use the information to do limited, serious, or severe harm to individuals, assets or an organization’s operations as a result of data disclosure?
- Would unauthorized disclosure or dissemination of elements of the data violate laws, executive orders, or agency regulations (i.e., HIPPA or Privacy laws)?
- Does the data have any integrity concerns?
- What would be the impact of unauthorized modification or destruction of the data?
- Would it reduce public confidence in the originating organization?
- Would it create confusion or controversy in the user community?
- Would a potentially life-threatening decision be made based on the data or analysis of the data?
- Are there any availability concerns about the data?
- Is the information time-critical? Will another individual or system be relying on the data to make a time-sensitive decision (i.e. sensing data for earthquakes, floods, etc.)?
Document data concerns identified and determine overall sensitivity (Low, Moderate, High)
- Low criticality would result in a limited adverse effect to an organization as a result of the loss of confidentiality, integrity, or availability of the data. It might mean degradation in mission capability or result in minor harm to individuals.
- Moderate criticality would result in a serious adverse effect to an organization as a result of the loss of confidentiality, integrity, or availability of the data. It might mean a severe degradation or loss of mission capability or result in significant harm to individuals that does not involve loss of life or serious life threatening injuries.
- High criticality would result in a severe or catastrophic adverse effect as a result of the loss of confidentiality, integrity, or availability of the data. It might cause a severe degradation in or loss of mission capability or result in severe or catastrophic harm to individuals involving loss of life or serious life threatening injuries.
Develop data access and dissemination policies and procedures based on sensitivity of the data and need-to-know.
Develop data protection policies, procedures and mechanisms based on sensitivity of the data.

Identify Data with Long-term Value

As part of the data life cycle, research data will be contributed to a repository to support preservation and discovery. A research project may generate many different iterations of the same dataset - for example, the raw data from the instruments, as well as datasets which already include computational transformations of the data.

In order to focus resources and attention on these core datasets, the project team should define these core data assets as early in the process as possible, preferably at the conceptual stage and in the data management plan. It may be helpful to speak with your local data archivist or librarian in order to determine which datasets (or iterations of datasets) should be considered core, and which datasets should be discarded. These core datasets will be the basis for publications, and require thorough documentation and description.

Only the datasets which have significant long-term value should be contributed to a repository, requiring decisions about which datasets need to be kept.
If data cannot be recreated or it is costly to reproduce, it should be saved.
Four different categories of potential data to save are observational, experimental, simulation, and derived (or compiled).
Your funder or institution may have requirements and policies governing contribution to repositories.

Given the amount of data produced by scientific research, keeping everything is neither practical nor economically feasible.

Identify Suitable Repositories for the Data

Shaping the data management plan towards a specific desired repository will increase the likelihood that the data will be accepted into that repository and increase the discoverability of the data within the desired repository. When beginning a data management plan:

Look to the data management guidelines of the project/grant for a required repository
Ask colleagues what repositories are used in the community
Determine if your local institution has a repository that would be appropriate (and might be required) for depositing your data
Check the DataONE website for a list of potential repositories.

Plan Data Management Early in Your Project

A Data Management Plan should include the following information:

Types of data to be produced and their volume
- Who will produce the data
Standards that will be applied
- File formats and organization, parameter names and units, spatial and temporal resolution, metadata content, etc.
Methods for preserving the data and maintaining data integrity
- What hardware / software resources are required to store the data
- How will the data be stored and backed up
- Describe the method for periodically checking the integrity of the data
Access and security policies
- What access requirements does your sponsor have
- Are there any privacy / confidentiality / intellectual property requirements
- Who can access the data
  - During active data collection
  - When data are being analyzed and incorporated into publications
  - When data have been published
  - After the project ends
- How should the data be cited and the data collectors acknowledged
Plans for eventual transition of the data to an archive after the project ends
- Identify a suitable data center within your discipline
- Establish an agreement for archival
- Understand the data center’s requirements for submission and incorporate into data management plan

Plan for Effective Multimedia Management

Multimedia data present unique challenges for data discovery, accessibility, and metadata formatting and should be thoughtfully managed. Researchers should establish their own requirements for management of multimedia during and after a research project using the following guidelines. Multimedia data includes still images, moving images, and sound. The Library of Congress has a set of web pages discussing many of the issues to be considered when creating and working with multimedia data. Researchers should consider quality, functionality and formats for multimedia data. Transcriptions and captioning are particularly important for improving discovery and accessibility.

Storage of images solely on local hard drives or servers is not recommended. Unaltered images should be preserved at the highest resolution possible. Store original images in separate locations to limit the chance of overwriting and losing the original image.

Ensure that the policies of the multimedia repository are consistent with your general data management plan.

There are a number of options for metadata for multimedia data, with many MPEG standards (http://mpeg.chiariglione.org/), and other standards such as PBCore (http://pbcore.org).

The following web pages have sections describing considerations for quality and functionality and formats for each of still images, sound (audio) and moving images (video).

Sustainability of Digital Formats Planning for Library of Congress Collections:

Online, generic multimedia repositories and tools (e.g. YouTube, Vimeo, Flickr, Google Photos):

are low-cost (can be free)
are open to all
may provide community commenting and tagging
some provide support for explicit licenses and re-use
provide some options for valuable metadata such as geolocation
potential for large-scale dissemination
optimize usability and low barrier for participation
rely on commercial business models for sustainability
may have limits on file size or resolution
may have unclear access, backup, and reliability policies, so ensure you are aware of them before you rely upon them

Specialized multimedia repositories (e.g. MorphBank, LIFE):

provide domain-specific metadata fields and controlled vocabularies customized for expert users
are highly discoverable for those in the same domain
can provide assistance in curating metadata
optimize scientific use cases such as vouchering, image analysis
rely on research or institutional/federal funding
may require high-quality multimedia, completeness of metadata, or restrict manipulation
may not be open to all
may provide APIs for sharing or re-use for other projects
are recognized as high-quality, scientific repositories
may migrate multimedia to new formats (e.g. analog to digital)
may have restrictions on bandwidth usage

Some institutions or projects maintain digital asset management systems, content management systems, or other collections management software (e.g. Specify, KE Emu) which can manage multimedia along with other kinds of data projects or institutions should provide assistance:

may be mandated by institution
may be more convenient, e.g. when multiple data types result from a project
may not be optimized for discovery, access, or re-use
usually not domain-specific
may or may not be suitable for long-term preservation

Preserve Information: Keep Raw Data Raw

In order to preserve the raw data for future use:

Do not make any changes / corrections to the original raw data file
Use a scripted language (e.g., R) or a software language that can be documented (eg., C, Java, Python, etc.) to perform analysis or make corrections and save that information in a separate file
- The code, along with appropriate documentation will be a record of the changes
- The code can be modified and rerun, using the raw data file as input, if needed
Consider making your original data file read-only, so it cannot be inadvertently altered.
Avoid spreadsheet software and other Graphical User Interface-based software. They may seem convenient, but changes are made without a clear record of what was done or why. Spreadsheets provide incredible freedom and power for manipulating data, but if used inappropriately can create tremendous problems. For this reason special attention needs to be paid to adhering to best practices in organizing data in spreadsheets. Particularly important best practices that are also highlighted elsewhere are:
- Data should be organized in columns with each column representing only a single type of data (number, date, character string. An exception to this is that sometimes a header line containing column names (sometimes called variable or field names) may be placed at the top of a column.
- Each data line should be complete, that is, each line of the data should contain data for each column. Sometimes in spreadsheets, to promote human readability, values will be provided only when they change. However, if the data is sorted, the relationships would become scrambled. An exception to this rule is if a data item is really missing (and not just omitted for human readability)a missing value code might be used. Additional best practices regarding consistent use of codes for categorical variables, and informative field names also apply, but keeping the data in consistent and complete columns are the most important.

A key test is whether the data from a spreadsheet can be exported as a delimited text file, such as a comma-separated-value (.csv) file that can be read by other software. If columns are not consistent the resulting data may cause software such as relational databases (e.g., MySQL, Oracle, Access) or statistical software (e.g., R, SAS, SPSS) to record errors or even crash.

As a general rule, spreadsheets make a poor archival data format. Standards for spreadsheet file formats change frequently or go out of fashion. Even within a single software package (e.g., Excel) there is no guarantee that future versions of the software will read older file versions. For this reason, and as specified in other best practices, generic (e.g., text) formats such as comma-separated-value files are preferred.

Sometimes it is the formulae embedded in a spreadsheet, rather than the data values themselves that are important. In this case, the spreadsheet itself may need to be archived. The danger of the spreadsheet being rendered obsolete or uninterpretable may be reduced by exporting the spreadsheet in a variety of forms (e.g., both as .xls and as .xlsx formats). However the long-term utility of the spreadsheet may still depend on periodic opening of the archived spreadsheet and saving it into new forms.

Upgrades and new versions of software applications often perform conversions or modifications to data files produced in older versions, in many cases without notifying the user of the internal change(s).

Many contemporary software applications that advertise forward compatibility for older files actually perform significant modifications to both visible and internal file contents. While this is often not a problem, there are cases where important elements like numerical formulas in a spreadsheet, are changed significantly when they are converted to become compatible with a current software package. The following practices will help ensure that your data files maintain their original fidelity in the face of application updates and new releases:

Where practical, continue using the version of the software that was originally used to create the data file to view and manipulate the file contents (For example, if Excel 97 was used to create a spreadsheet that contains formulas and formatting, continue using Excel 97 to access those data files as long as possible).
When forced to use a newer version of a software package to open files created with an older version of the application, first save a copy of the original file as a safeguard against irretrievable modification or corruption.
Carefully inspect older files that have been opened/converted to be compatible with newer versions of an application to ensure data fidelity has been carried forward. Where possible, compare the converted files to copies of the original files to ensure there have been no data modifications during conversion.

Provide a Citation and Document Provenance for Your Dataset

For appropriate attribution and provenance of a dataset, the following information should be included in the data documentation or the companion metadata file:

Name the people responsible for the dataset throughout the lifetime of the dataset, including for each person:
- Name
- Contact information
- Role (e.g., principal investigator, technician, data manager)

According to the International Polar Year Data and Information Service, an author is the individual(s) whose intellectual work, such as a particular field experiment or algorithm, led to the creation of the dataset. People responsible for the data can include: individuals, groups, compilers or editors.

Description of the context of the dataset with respect to a larger project or study (include links and related documentation), if applicable.
Revision history, including additions of new data and error corrections. Links to source data, if the data in one dataset were derived from data in another dataset.
List of project support (e.g., funding agencies, collaborators, material support).
Describe how to properly cite the dataset. The data citation should include:
- All contributors
- date of dataset publication
- Title of dataset
- media or URL
- Data publisher
- Identifier (Digital Object Identifier)

Provide Identifier for Dataset Used

In order to ensure replicable data access:

Choose a broadly utilized Data Identification Standard based on specific user community practices or preferences
- DOI
- OIDs
- ARKs
- LSIDs
- XRIs
- URNs/URIs/URLs
- UUIDs
Consistently apply the standard
Maintain the linkage
Participate in implementing infrastructure for consistent access to the resources referenced by the Identifier

Provide Version Information for Use and Discovery

Provide versions of data products with defined identifiers to enable discovery and use.

Items to consider when versioning data products:

Develop definition of what constitutes a new version of the data, for example:
- New processing algorithms
- Additions or removal of data points
- Time or date range
- Included parameters
- Data format
- Immutability of versions
Develop standard naming convention for versions with associated descriptive information
Associate metadata with each version including the description of what differentiates this version from another version

Recognize Stakeholders in Data Ownership

When creating the data management plan, review all who may have a stake in the data so future users of the data can easily track who may need to give permission. Possible stakeholders include but are not limited to:

-Funding body -Host institution for the project -Home institution of contributing researchers -Repository where the data are deposited

It is considered a matter of professional ethics to acknowledge the work of other scientists and provide appropriate citation and acknowledgment for subsequent distribution or publication of any work derived from stakeholder datasets. Data users are encouraged to consider consultation, collaboration, or co-authorship with original investigators.

Store Data with Appropriate Precision

Data should not be entered with higher precision than they were collected in (e.g if a device collects data to 2dp, an Excel file should not present it to 5 dp). If the system stores data in higher precision, care needs to be taken when exporting to ASCII. E.g. calculation in Excel will be done to the highest possible precision of the system, which is not related to the precision of the original data.