LibGuides: Data Management and Project Planning: Description

Description

Document data by describing the why, who, what, when, where, and how of the data. Metadata, or data about data, are key to data sharing and reuse, and many tools such as standards and software are available to help describe data.

See more below:

Assign Descriptive File Names

File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.

When choosing a file name, check for any database management limitations on file name length and use of special characters. Also, in general, lower-case names are less software and platform dependent. Avoid using spaces and special characters in file names, directory paths and field names. Automated processing, URLs and other systems often use spaces and special characters for parsing text string. Instead, consider using underscore ( _ ) or dashes ( - ) to separate meaningful parts of file names. Avoid $ % ^ & # and similar symbols.
If versioning is desired a date string within the file name is recommended to indicate the version.

Avoid using file names such as mydata.dat or 1998.dat.

Choose and Use Standard Terminology to Enable Discovery

Terms and phrases that are used to represent categorical data values or for creating content in metadata records should reflect appropriate and accepted vocabularies in your community or institution. Methods used to identify and select the proper terminology include:

Identify the relevant descriptive terms used as categorical values in your community prior to start of the project (e.g., standard terms describing soil horizons, plant taxonomy, sampling methodology or equipment, etc.)
Identify locations in metadata where standardized terminology should be used and sources for the terms. Terminology should reflect both data type/content and access methods.
Review existing thesauri, ontologies, and keyword lists for your use before making up a new terms. Potential sources include: Semantic Web for Earth and Environmental Terminology (SWEET), Planetary Ontologies, and NASA Global Change Master Directory (GCMD)
Enforce use of standard terminology in your workflow, including:
- Use of lookup tables in data-entry forms
- Use of field-level constraints in databases (restrict data import to match accepted domain values)
- Use XML validation
- Do manual review
Publish metadata using Open Standards, for example:
- z39.50
- OGC Catalog Services for Web (CSW)
- Web Accessible Directory (WAD)

If you must use an unconventional or unique vocabulary, it should be identified in the metadata and fully defined in the data documentation (attribute name, values, and definitions).

Confirm a Match Between Data and Their Description in Metadata

To assure that metadata correctly describes what is actually in a data file, visual inspection or analysis should be done by someone not otherwise familiar with the data and its format. This will assure that the metadata is sufficient to describe the data. For example, statistical software can be used to summarize data contents to make sure that data types, ranges and, for categorical data, values found, are as described in the documentation/metadata.

Create a Data Dictionary

A data dictionary provides a detailed description for each element or variable in your dataset and data model. Data dictionaries are used to document important and useful information such as a descriptive name, the data type, allowed values, units, and text description. A data dictionary provides a concise guide to understanding and using the data.

Define the Data Model

A data model documents and organizes data, how it is stored and accessed, and the relationships among different types of data. The model may be abstract or concrete.

Use these guidelines to create a data model:

Identify the different data components- consider raw and processed data, as well as associated metadata (these are called entities)
Identify the relationships between the different data components (these are called associations)
Identify anticipated uses of the data (these are called requirements), with recognition that data may be most valuable in the future for unanticipated uses
Identify the strengths and constraints of the technology (hardware and software) that you plan to use during your project (this is called a technology assessment phase)
Build a draft model of the entities and their relations, attempting to keep the model independent from any specific uses or technology constraints.
Incorporate intended usage and technology constraints as needed to derive the simplest, most general model possible
Test the model with different scenarios, including best- and
worst-case (worst-case includes problems such as invalid raw data, user mistakes, failing algorithms, etc) Repeat these steps to optimize the model

Define the Parameters

The parameters reported in the data set need to have names that clearly describe the contents. Ideally, the names should be standardized across files, data sets, and projects, in order that others can readily use the information.

The documentation should contain a full description of the parameter, including the parameter name, how it was measured, the units, and the abbreviation used in the data file.

A missing value code should also be defined. Use the same notation for each missing value in the data set. Use an extreme value (-9999) and do not use character codes in a numeric field. Supply a flag or a tag in a separate field to define briefly the reason for the missing data.

Within the data file use commonly accepted abbreviations for parameter names, for example, Temp for temperature, Precip for precipitation, Lat and Long for latitude and longitude. See the references in the Bibliography for additional examples. Some systems still have length limitations for column names (e.g.13 characters in ArcGIS); lower case column names are generally more transferrable between systems; Space and special characters should not be used in attribute names. Only numbers, letters, and underscores (“_”) transfer easily between systems.

Also, be sure to use consistent capitalization (not temp, Temp, and TEMP in the same file).

Describe Format for Spatial Location

Spatial coordinates should be reported in decimal degrees format to at least 4 (preferably 5 or 6) significant digits past the decimal point. An accuracy of 1.11 meters at the equator is represented by +/- 0.00001. This does not include uncertainty introduced by a GPS instrument.

Provide latitude and longitude with south latitude and west longitude recorded as negative values, e.g., 80 30’ 00” W longitude is -80.5000.

Make sure that all location information in a file uses the same coordinate system, including coordinate type, datum, and spheroid. Document all three of these characteristics (e.g., Lat/Long decimal degrees, NAD83 (North American Datum of 1983), WGRS84 (World Geographic Reference System of 1984)). Mixing coordinate systems [e.g., NAD83 and NAD27 (North American Datum of 1927)] will cause errors in any geographic analysis of the data.

If locating field sites is more convenient using the Universal Transverse Mercator (UTM) coordinate system, be sure to record the datum and UTM zone (e.g., NAD83 and Zone 15N), and the easting and northing coordinate pair in meters, to ensure that UTM coordinates can be converted to latitude and longitude.

To assure the quality of the geospatial data, plot the locations on a map and visually check the location.

Describe Formats for Date and Time

For date, always include four digit year and use numbers for months. For example, the date format yyyy-mm-dd would appear as 2011-03-15 (March 15, 2011).

If Julian day is used, make sure the year field is also supplied. For example, mmm.yyyy would appear as 122.2011, where mmm is the Julian day.

If the date is not completely known (e.g. day not known) separate the columns into parts that do exist (e.g. separate column for year and month). Don’t introduce a day because the database date format requires it.

For time, use 24-hour notation (13:30 hrs instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.). Report in both local time and Coordinated Universal Time (UTC). Include local time zone in a separate field. As appropriate, both the begin time and end time should be reported in both local and UTC time. Because UTC and local time may be on different days, we suggest that dates be given for each time reported.

Be consistent in date and time formats within one data set.

Describe Measurement Techniques

Data measurement descriptions should:

Describe data collection methods or protocols(can include diagrams, images, schematics, etc.)
- How the data were collected
- Measurement frequency and regularity
Describe instrumentation
- Include manufacturer, model number, dates in use
- Maintenance/repair history
- Malfunction history
- Calibration methods, scale, detection limits, and history
Document measurement uncertainty, including accuracy, precision, and reproducibility. Provide values in the context of the measurements, e.g., standard error, standard deviation, confidence limits.

Describe Method to Create Derived Data Products

When describing the process for creating derived data products, the following information should be included in the data documentation or the companion metadata file:

Description of primary input data and derived data
Why processing is required
Data processing steps and assumptions
- Assumptions about primary input data
- Additional input data requirements
- Processing algorithm (e.g., volts to mol fraction, averaging)
- Assumptions and limitations of algorithm
- Describe how algorithm is applied (e.g., manually, using R, IDL)
How outcome of processing is evaluated
- How problems are identified and rectified
- Tools used to assess outcome
- Conditions under which reprocessing is required
How uncertainty in processing is assessed
- Provide a numeric estimate of uncertainty
How processing technique changes over time, if applicable

Describe the Contents of Data Files

A description of the contents of the data file should contain the following:

Define the parameters and the units on the parameter
Explain the formats for dates, time, geographic coordinates, and other parameters
Define any coded values
Describe quality flags or qualifying values
Define missing values

Describe the Overall Organization of your Dataset

Data sets or collections are often composed of multiple files that are related. Files may have come from (or still be stored in) a relational database, and the relationships among the data tables or other entities are important if the data are to be reused. These relationships should be documented for a repository.

Describe the overall organization of your data set or collection. Often, a data set or collection contains a large number of files, perhaps organized into a number of directories or database tables. By describing and documenting this organization, files and data can be easily located and used.

At a minimum, the organization and relationships between the directories and files, or database tables and other supporting materials, need to be fully described. Use a description of the data set or collection (e.g, an abstract) to describe what tables contain, where the supporting material, metadata, or other documentation are located, and/or descriptions of directory contents. Consider describing the logical relationships between data entities using an entity relationship diagram (ERD).

Associated specimens: if specimens (e.g., taxonomic vouchers, DNA samples) were collected with the data, include the name of the repository in which these specimens reside.

Describe the Research Project

The research project description should contain the following information:

Who: project personnel (principal investigator, researchers, technicians, others)
Where: location and description of study site or sites
When: range of dates for the project
Why: rational for the project (abstract)
How: description of project methods
Other useful information might include the project title, the overarching project (if any), institution(s) involved, and source of funding.

Describe the Sensor Network

If your project uses a sensor network, you should describe and document that network and the instruments it uses. This information is essential to understanding and interpreting the data you use, and should be included as a part of the metadata generated for your project’s data.

Describe the basic set-up of the sensor network installation, including such details as mount, power source, enclosures, wiring protection, etc.
Describe instrumentation, cameras and samplers (See “Describe measurement techniques” Best Practice in DataONEpedia)
Describe data loggers used by the network. Include the following:
- Manufacturer, model, serial number, dates in use
- Maintenance/repair history
- Malfunction history
- Deployment history
- Replacement history
Ensure localization and time synchronization across data logger arrays
Archive copies of any custom scripts, software, or programs used. Scripts and programs should be accompanied by documentation that includes any information pertinent to their use (metadata).
As part of metadata, create a human-readable document that describes sampling frequency and data processing performed by the data logger

Describe the Spatial Extent and Resolution of Your Dataset

The spatial extent of your data set or collection as a whole should be described. The minimum acceptable description would be a bounding box describing the northern most, southern most, western most, and eastern most limits of the data.

If the entire collection is from a single location, use the same values for northerly/southerly limits and easterly/westerly values.
Be sure to specify in the metadata what units you choose to describe your spatial extent.
Use the following guidelines for quality control:
- If the collection spans the north pole, the northerly limit should be 90.0 degrees
- If the collection spans the south pole, the southerly limit should be -90.0 degrees
- If the collection crosses the date line, the westerly limit should be greater than the easterly limit

If your data collection or dataset as a whole contains data acquired over a range of spatial locations during each collection period, it is important to document the spatial resolution of your dataset. Many metadata standards have standard terminology for describing data spacing or resolution (e.g. every half degree, 250 m resolution, etc.), but it may be necessary to describe complex data acquisition schemes textually.

Describe the Temporal Extent and Resolution of Your Dataset

The temporal extent over which the data within your dataset or collection was acquired or collected should be described. Normally this is done by providing

the earliest date of data acquisition
the date that the last data in the collection was acquired

Year, month, day, and time should be included in the description. If data collection is still ongoing, the end date can be omitted, though some statement about this should be placed in the dataset abstract. The status of the data set should indicate that data collection is still ongoing if the metadata standard being used supports this type of documentation.

Describe the temporal resolution of your dataset collection. The temporal resolution of your dataset is the frequency with which data is collected or acquired. While many metadata standards provide standard nomenclature for describing simple temporal resolutions (e.g., daily or monthly), more complex temporal collection patterns may need to be described textually.

Describe the Units of Measurement for Each Observation

The units of reported parameters need to be explicitly stated in the data file and in the documentation. We recommend SI units (The International System of Units) but recognize that each discipline has its own commonly used units of measure. The critical aspect here is that the units be defined so that others understand what is reported.

Do not use abbreviations when describing the units. For example the units for respiration are moles of carbon dioxide per meter squared per year.

Document and Store Data Using Stable File Formats

File formats are important for understanding how data can be used and possibly integrated. The following issues need to be documented:

Does the file format of the data adhere to one or more standards?
Is that file standard an open (i.e. open source) or closed (i.e. proprietary) format?
Is a particular software package required to read and work with the data file? If so, the software package, version, and operating system platform should be cited in the metadata
Do multiple files comprise the data file structure? If so, that should be specified in the metadata

When choosing a file format, data collectors should select a consistent format that can be read well into the future and is independent of changes in applications.

Appropriate file types include:
- Non-proprietary: Open, documented standard
- Common usage by research community: Standard representation (ASCII, Unicode)
- Unencrypted
- Uncompressed
ASCII formatted files will be readable into the future
- Use ASCII (comma-separated) for tabular data
For geospatial (raster) data the following provide a stable format:
- GeoTIFF/TIFF
- ASCII Grid
- Binary image files
- NetCDF
- HDF or HDF-EOS
For image (Vector) data use the following file formats (these are mostly proprietary data formats; please be sure to document the Software Package, Version, Vendor, and native platform):
- ARCVIEW software – please store components of an ArcView shape file (*.shp, *.sbx, *.sbn, *.prj, and *.dbf files) ;
- ENVI – *.evf (ENVI vector file)
- ESRI Arc/Info export file (.e00)

Document Steps Used in Data Processing

Different types of new data may be created in the course of a project, for instance visualizations, plots, statistical outputs, a new dataset created by integrating multiple datasets, etc. Whenever possible, document your workflow (the process used to clean, analyze and visualize data) noting what data products are created at each step. Depending on the nature of the project, this might be as a computer script, or it may be notes in a text file documenting the process you used (i.e. process metadata). If workflows are preserved along with data products, they can be executed and enable the data product to be reproduced.

Describe the Overall Organization of Your Dataset

Identification of any species represented in the data set should be as complete as possible.

Use a standard taxonomy whenever possible
Full taxonomic tree to most specific level available
Source of taxonomy should accompany taxonomic tree (if available)
References used for taxonomic identification should be provided, if appropriate (e.g. technical document, journal article, book, database, person, etc.)

Examples of standardized identification systems:

Integrated Taxonomic Information System (http://www.itis.gov/)
Species 2000 (http://www.sp2000.org/)
USDA Plants (http://plants.usda.gov/index.html)
Global Biodiversity Information Facility (http://www.gbif.org/informatics/name-services/using-names-data/)

Document Your Data Organization Strategy

The following are strategies for effective data organization:

Sparse matrix: Optimal data models for storing data avoid sparse matrices, i.e. if many data points within a matrix are empty a data table with a column for parameters and a column for values may be more appropriate.
Repetitive information in a wide matrix: repeated categorical information is best handled in separate tables to reduce redundancy in the data table. In database design this is called normalization of data.
Column name is a value or repeating group: If the column name contains variable information, e.g. date or species name, the parameter/value organization of data is recommended as well for storage. Although the wide matrix is needed for statistical analysis and graphing it cannot be queried or subset in that format.

Ensure Flexible Data Services for Virtual Datasets

The following are strategies for effective data organization:

Sparse matrix: Optimal data models for storing data avoid sparse matrices, i.e. if many data points within a matrix are empty a data table with a column for parameters and a column for values may be more appropriate.
Repetitive information in a wide matrix: repeated categorical information is best handled in separate tables to reduce redundancy in the data table. In database design this is called normalization of data.
Column name is a value or repeating group: If the column name contains variable information, e.g. date or species name, the parameter/value organization of data is recommended as well for storage. Although the wide matrix is needed for statistical analysis and graphing it cannot be queried or subset in that format.

Identify and Use Relevant Metadata Standards

Many times significant overlap exists among metadata content standards. You should identify those standards that include the fields needed to describe your data. In order to describe your data, you need to decide what information is required for data users to discover, use, and understand your data. The who, what, when, where, how, why, and a description of quality should be considered. The description should provide enough information so that users know what can and cannot be done with your data.

Who: The person and/or organization responsible for collecting and processing the data. Who should be contacted if there are questions about your data?
What: What parameters were measured or observed? What are the units of your measurements or results?
When: A description of the temporal characteristics of your data (e.g., time support, spacing, and extent).
Where: A description of the spatial characteristics of your data (e.g., spatial support, spacing, and extent). What is the geographic location at which the data were collected? What are the details of your field sensor deployment.
How: What methods were used (e.g., sensors, analytical instruments, etc.). Did you collect physical samples or specimens? What analytical methods did you use to measure the properties of your samples/specimens? Is your result a field or laboratory result? Is your result an observation or a model simulation?
Why: What is the purpose of the study or the data collection? This can help others determine whether your data is fit for their particular purpose or not.
Quality: Describe the quality of the data, which will help others determine whether your data is fit for their purpose or not.

Considering a number of metadata content standards may help you fine-tune your metadata content needs. There may be content details or elements from multiple standards that can be added to your requirements to help users understand your data or methods. You wouldn’t know this unless you consider multiple content standards.

If the project or grant requirements define a particular metadata standard, incorporate it into the data management plan.
If the community has a recommended or has a most commonly used metadata standard, use it
Consider using a metadata standard that is interoperable with many systems, repositories, and harvesters
If the community’s preferred metadata standard is not widely interoperable, consider creating metadata using a simple but interoperable standard, e.g. Dublin Core, in addition to the main standard.

Useful Definitions:

Metadata Content Standard: A Standard that defines elements users can expect to find in metadata and the names and meaning of those elements.
Metadata Format Standard: A Standard that defines the structures and formats used to represent or encode elements from a content standard.

Maintain Consistent Data Typing

Choose the right data type and precision for data in each column. As examples: (1) use date fields for dates; and (2) use numerical fields with decimal places precision. Comments and explanations should not be included in a column that is meant to include numeric values only. Comments should be included in a separate column that is designed for text. This allows users to take advantage of specialized search and computing functionality and improves data quality. If a particular spreadsheet or software system does not support data typing, it is still recommended that one keep the data type consistent within a column and not mix numbers, dates and text.

Provide a Citation and Document Provenance for Your Dataset

For appropriate attribution and provenance of a dataset, the following information should be included in the data documentation or the companion metadata file:

Name the people responsible for the dataset throughout the lifetime of the dataset, including for each person:
- Name
- Contact information
- Role (e.g., principal investigator, technician, data manager)

According to the International Polar Year Data and Information Service, an author is the individual(s) whose intellectual work, such as a particular field experiment or algorithm, led to the creation of the dataset. People responsible for the data can include: individuals, groups, compilers or editors.

Description of the context of the dataset with respect to a larger project or study (include links and related documentation), if applicable.
Revision history, including additions of new data and error corrections. Links to source data, if the data in one dataset were derived from data in another dataset.
List of project support (e.g., funding agencies, collaborators, material support).
Describe how to properly cite the dataset. The data citation should include:
- All contributors
- date of dataset publication
- Title of dataset
- media or URL
- Data publisher
- Identifier (Digital Object Identifier)

Provide Capabilities for Tagging and Annotation of Your Data by the Community

People have different perspectives on what data means to them, and how it can be used and interpreted in different contexts. Data users ranging from community participants to researchers in different domains can provide unique and valuable insights into data through the use of annotation and tagging. The community-generated notes and tags should be discoverable through the data search engine to enhance discovery and use.

When providing capabilities for community tagging and annotations, you should consider the following:

Differentiate between the metadata developed by the creator and additional tags or annotations to the data or metadata
Allow for community tags and annotations to be indexed as part of the terms or text that is indexed in a search
Provide easy-to-understand examples of the kinds of tagging or annotation that will promote the discovery of your data
Consider whether or not a review process for community tagging is needed
Consider whether controlled vocabularies will be used for tags
Provide clear guidelines for the addition of tags and construction of annotations
Make tags accessible via an application programming interface (API)

Provide Identifier for Dataset Used

In order to ensure replicable data access:

Choose a broadly utilized Data Identification Standard based on specific user community practices or preferences such as:
DOI (https://www.doi.org/)

The DOI Foundation is a not-for-profit organization. We govern the Digital Object Identifier (DOI) system on behalf of the agencies who manage DOI registries and provide services to their respective communities. We are the registration authority for the ISO standard (ISO 26324) for the DOI system and we are governed by our Registration Agencies

OIDs
- ARKs
- LSIDs
- XRIs
- URNs/URIs/URLs
- UUIDs
Consistently apply the standard
Maintain the linkage
Participate in implementing infrastructure for consistent access to the resources referenced by the Identifier

Separate Data Values From Annotations

A separate column should be used for data qualifiers, descriptions, and flags, otherwise there is the potential for problems to develop during analyses. Potential entries in the descriptor column:

Potential sources of error
Missing value justification (e.g. sensor off line, human error, data rejected outside of range, data not recorded
Flags for values outside of expected range, questionable etc.

Sharing Data: Legal and Policy Considerations

All research requires the sharing of information and data. The general philosophy is that data are freely and openly shared. However, funding organizations and institutions may require that their investigators cite the impact of their work, including shared data. By creating a usage rights statement and including it in data documentation, users of your data will be clear what the conditions of use are, and how to acknowledge the data source.

Include a statement describing the “usage rights” management, or reference a service that provides the information. Rights information encompasses Intellectual Property Rights (IPR), copyright, cost, or various Property Rights. For data, rights might include requirements for use, requirements for attribution, or other requirements the owner would like to impose. If there are no requirements for re-use, this should be stated.

Usage rights statements should include what are appropriate data uses, how to contact the data creators, and acknowledge the data source. Researchers should be aware of legal and policy considerations that affect the use and reuse of their data. It is important to provide the most comprehensive access possible with the fewest barriers or restrictions.

There are three primary areas that need to be addressed when producing sharable data:

Privacy and confidentiality: Adhere to your institution’s policy
Copyright and intellectual property (IP): Data is not copyrightable. Ensure that you have the appropriate permissions when using data that has multiple owners or copyright layers. Keep in mind that information documenting the context of data collection may be under copyright.
Licensing: Data can be licensed. The manner in which you license your data can determine its ability to be consumed by other scholars. For example the Creative Commons Zero License provides for very broad access.

If your data falls under any of the categories below there are additional considerations regarding sharing:

Rare, threatened or endangered species
Cultural items returned to their country of origin
Native American and Native Hawaiian human remains and objects
Any research involving human subjects

If you use data from other sources, you should review your rights to use the data and be sure you have the appropriate licenses and permissions.

Use Appropriate Field Delimiters

Delimit the columns within a data table using commas or tabs; these are listed in order of preference. Semicolons are used in many systems as line end delimiters and may cause problems if data are imported into those systems (e.g. SAS, PHP scripts). Avoid delimiters that also occur in the data fields. If this cannot be avoided, enclose data fields that also contain a delimiter in single or double quotes.

An example of a consistently delimited data file with a header row:

Date, Avg Temperature, Precipitation
01Jan2010, 32.3, 0.0
02Jan2010, 34.1, 0.5
03Jan2010, 31.4, 2.5
04Jan2010, 33.2, 0.0

Use Consistent Codes

Be consistent in the use of codes to indicate categorical variables, for example species names, sites, or land cover types. Codes should always be the same within one data set. Pay particular attention to spelling and case; most frequent problems are with abbreviations for species names and sites.

Consistent codes can be achieved most easily by defining standard categorical variables (codes) and using drop down lists (excel, database). Frequently a code is needed for ‘none of the above’ or ‘unknown’ or ‘other’ to avoid imprecise code assignment.