Plan to preserve data in the short term to minimize potential losses (e.g., via accidents), and in the long term so that project stakeholders and others can access, interpret, and use the data in the future. Decide what data to preserve, where to preserve it, and what documentation needs to accompany the data.
Read more below:
To avoid accidental loss of data you should:
Terms and phrases that are used to represent categorical data values or for creating content in metadata records should reflect appropriate and accepted vocabularies in your community or institution. Methods used to identify and select the proper terminology include:
If you must use an unconventional or unique vocabulary, it should be identified in the metadata and fully defined in the data documentation (attribute name, values, and definitions).
A backup policy helps manage users’ expectations and provides specific guidance on the “who, what, when, and how” of the data backup and restore process. There are several benefits to documenting your data backup policy:
If this information is located in one place, it makes it easier for anyone needing the information to access it. In addition, if a backup policy is in place, anyone new to the project or office can be given the documentation which will help inform them and provide guidance.
The process of science generates a variety of products that are worthy of preservation. Researchers should consider all elements of the scientific process in deciding what to preserve:
When deciding on what data products to preserve, researchers should consider the costs of preserving data:
Researchers should consider the following goals and benefits of preservation:
In order for a large dataset to be effectively used by a variety of end users, the following procedures for preparing a virtual dataset are recommended:
Identify data service users
Define service interfaces based upon Open Standards. For example:
Publish service metadata for published services based upon Open Standards. For example:
For successful data replication and backup:
Document all procedures (e.g., compression / decompression process) to ensure a successful recovery from a backup copy
To check the integrity of the backup file, periodically retrieve your backup file, open it on a separate system, and compare to the original file
All storage media, whether hard drives, discs or data tapes, will wear out over time, rendering your data files inaccessible. To ensure ongoing access to both your active data files and your data archives, it is important to continually monitor the condition of your storage media and track its age. Older storage media and media that show signs of wear should be replaced immediately. Use the following guidelines to ensure the ongoing integrity and accessibility of your data:
Many times significant overlap exists among metadata content standards. You should identify those standards that include the fields needed to describe your data. In order to describe your data, you need to decide what information is required for data users to discover, use, and understand your data. The who, what, when, where, how, why, and a description of quality should be considered. The description should provide enough information so that users know what can and cannot be done with your data.
Considering a number of metadata content standards may help you fine-tune your metadata content needs. There may be content details or elements from multiple standards that can be added to your requirements to help users understand your data or methods. You wouldn’t know this unless you consider multiple content standards.
Useful Definitions:
Steps for the identification of the sensitivity of data and the determination of the appropriate security or privacy level are:
As part of the data life cycle, research data will be contributed to a repository to support preservation and discovery. A research project may generate many different iterations of the same dataset - for example, the raw data from the instruments, as well as datasets which already include computational transformations of the data.
In order to focus resources and attention on these core datasets, the project team should define these core data assets as early in the process as possible, preferably at the conceptual stage and in the data management plan. It may be helpful to speak with your local data archivist or librarian in order to determine which datasets (or iterations of datasets) should be considered core, and which datasets should be discarded. These core datasets will be the basis for publications, and require thorough documentation and description.
Given the amount of data produced by scientific research, keeping everything is neither practical nor economically feasible.
Shaping the data management plan towards a specific desired repository will increase the likelihood that the data will be accepted into that repository and increase the discoverability of the data within the desired repository. When beginning a data management plan:
A Data Management Plan should include the following information:
Multimedia data present unique challenges for data discovery, accessibility, and metadata formatting and should be thoughtfully managed. Researchers should establish their own requirements for management of multimedia during and after a research project using the following guidelines. Multimedia data includes still images, moving images, and sound. The Library of Congress has a set of web pages discussing many of the issues to be considered when creating and working with multimedia data. Researchers should consider quality, functionality and formats for multimedia data. Transcriptions and captioning are particularly important for improving discovery and accessibility.
Storage of images solely on local hard drives or servers is not recommended. Unaltered images should be preserved at the highest resolution possible. Store original images in separate locations to limit the chance of overwriting and losing the original image.
Ensure that the policies of the multimedia repository are consistent with your general data management plan.
There are a number of options for metadata for multimedia data, with many MPEG standards (http://mpeg.chiariglione.org/), and other standards such as PBCore (http://pbcore.org).
The following web pages have sections describing considerations for quality and functionality and formats for each of still images, sound (audio) and moving images (video).
Sustainability of Digital Formats Planning for Library of Congress Collections:
Online, generic multimedia repositories and tools (e.g. YouTube, Vimeo, Flickr, Google Photos):
Specialized multimedia repositories (e.g. MorphBank, LIFE):
Some institutions or projects maintain digital asset management systems, content management systems, or other collections management software (e.g. Specify, KE Emu) which can manage multimedia along with other kinds of data projects or institutions should provide assistance:
In order to preserve the raw data for future use:
A key test is whether the data from a spreadsheet can be exported as a delimited text file, such as a comma-separated-value (.csv) file that can be read by other software. If columns are not consistent the resulting data may cause software such as relational databases (e.g., MySQL, Oracle, Access) or statistical software (e.g., R, SAS, SPSS) to record errors or even crash.
As a general rule, spreadsheets make a poor archival data format. Standards for spreadsheet file formats change frequently or go out of fashion. Even within a single software package (e.g., Excel) there is no guarantee that future versions of the software will read older file versions. For this reason, and as specified in other best practices, generic (e.g., text) formats such as comma-separated-value files are preferred.
Sometimes it is the formulae embedded in a spreadsheet, rather than the data values themselves that are important. In this case, the spreadsheet itself may need to be archived. The danger of the spreadsheet being rendered obsolete or uninterpretable may be reduced by exporting the spreadsheet in a variety of forms (e.g., both as .xls and as .xlsx formats). However the long-term utility of the spreadsheet may still depend on periodic opening of the archived spreadsheet and saving it into new forms.
Upgrades and new versions of software applications often perform conversions or modifications to data files produced in older versions, in many cases without notifying the user of the internal change(s).
Many contemporary software applications that advertise forward compatibility for older files actually perform significant modifications to both visible and internal file contents. While this is often not a problem, there are cases where important elements like numerical formulas in a spreadsheet, are changed significantly when they are converted to become compatible with a current software package. The following practices will help ensure that your data files maintain their original fidelity in the face of application updates and new releases:
For appropriate attribution and provenance of a dataset, the following information should be included in the data documentation or the companion metadata file:
According to the International Polar Year Data and Information Service, an author is the individual(s) whose intellectual work, such as a particular field experiment or algorithm, led to the creation of the dataset. People responsible for the data can include: individuals, groups, compilers or editors.
In order to ensure replicable data access:
Consistently apply the standard
Maintain the linkage
Provide versions of data products with defined identifiers to enable discovery and use.
Items to consider when versioning data products:
When creating the data management plan, review all who may have a stake in the data so future users of the data can easily track who may need to give permission. Possible stakeholders include but are not limited to:
-Funding body -Host institution for the project -Home institution of contributing researchers -Repository where the data are deposited
It is considered a matter of professional ethics to acknowledge the work of other scientists and provide appropriate citation and acknowledgment for subsequent distribution or publication of any work derived from stakeholder datasets. Data users are encouraged to consider consultation, collaboration, or co-authorship with original investigators.
Data should not be entered with higher precision than they were collected in (e.g if a device collects data to 2dp, an Excel file should not present it to 5 dp). If the system stores data in higher precision, care needs to be taken when exporting to ASCII. E.g. calculation in Excel will be done to the highest possible precision of the system, which is not related to the precision of the original data.