LibGuides: Data Management and Project Planning: Plan

About

In the Plan phase of the life cycle you will map out the processes and resources for the entire data life cycle. Begin with outlining the project goals including desired outputs, outcomes, and impacts and work backwards to build a data management plan, supporting data policies, and sustainability plans.

The following points will guide development of a well-plan data project:

Backup Data Policy

A backup policy helps manage users’ expectations and provides specific guidance on the “who, what, when, and how” of the data backup and restore process. There are several benefits to documenting your data backup policy:

Helps clarify the policies, procedures, and responsibilities
Allows you to dictate:
- where backups are located
- who can access backups and how they can be contacted
- how often data should be backed up
- what kind of backups are performed and
- what hardware and software are recommended for performing backups
Identifies any other policies or procedures that may already exist (such as contingency plans) or which ones may supersede the policy
Has a well-defined schedule for performing backups
Identifies who is responsible for performing the backups and their contact information. This should include more than one person, in case the primary person responsible is unavailable
Identifies who is responsible for checking the backups have been performed successfully, how and when they will perform this
Ensures data can be completely restored
Has training for those responsible for performing the backups and for the users who may need to access the backups
Is partially, if not fully automated
Ensures that more than one copy of the backup exists and that it is not located in same location as the originating data
Ensures that a variety of media are used to backup data, as each media type has its own inherent reliability issues
Ensures the structure of the data being backed up mirrors the originating data
Notes whether or not the data will be archived

If this information is located in one place, it makes it easier for anyone needing the information to access it. In addition, if a backup policy is in place, anyone new to the project or office can be given the documentation which will help inform them and provide guidance.

Define Expected Data Types and Outcomes

In the planning process, researchers should carefully consider what data will be produced in the course of their project.

Consider the following:

What types of data will be collected? E.g. Spatial, temporal, instrument-generated, models, simulations, images, video etc.
How many data files of each type are likely to be generated during the project? What size will they be?
For each type of data file, what are the variables that are expected to be included?
What software programs will be used to generate the data?
How will the files be organized in a directory structure on a file system or in some other system?
Will metadata information be stored separately from the data during the project?
What is the relationship between the different types of data?
Which of the data products are of primary importance and should be preserved for the long-term, and which are intermediate working versions not of long-term interest?

When preparing a data management plan, defining the types of data that will be generated helps in planning for short-term organization, the analyses to be conducted, and long-term data storage.

Define Roles and Responsibilities for Data Management

In addition to the primary researcher(s), there might be others involved in the research process that take part in aspects of data management. By clearly defining the roles and responsibilities of the parties involved, data are more likely to be available for use by the primary researchers and anyone re-using the data. Roles and responsibilities should be clearly defined, rather than assumed; this is especially important for collaborative projects that involve many researchers, institutions, and/or groups.

Examples of roles in data management:

data collector
metadata generator
data analyzer
project director
data model and/or database designer
computing staff responsible for backup and/or storage
staff responsible for running instruments
administrative support staff responsible for grant submission
specialized skills as defined in the plan (GIS, relational database design/implementation, computer programming of sensors/input forms, etc)
external data center or archive

Steps for assigning data management responsibilities:

For each task identified in your data management plan, identify the skills needed to perform the task
Match skills needed to available staff and identify gaps
Develop training/hiring plan
Develop staffing/training budget and incorporate into project budget
Assign responsible parties and monitor results

Define Data Model

A data model documents and organizes data, how it is stored and accessed, and the relationships among different types of data. The model may be abstract or concrete.

Use these guidelines to create a data model:

Identify the different data components- consider raw and processed data, as well as associated metadata (these are called entities)
Identify the relationships between the different data components (these are called associations)
Identify anticipated uses of the data (these are called requirements), with recognition that data may be most valuable in the future for unanticipated uses
Identify the strengths and constraints of the technology (hardware and software) that you plan to use during your project (this is called a technology assessment phase)
Build a draft model of the entities and their relations, attempting to keep the model independent from any specific uses or technology constraints.
Incorporate intended usage and technology constraints as needed to derive the simplest, most general model possible
Test the model with different scenarios, including best- and
worst-case (worst-case includes problems such as invalid raw data, user mistakes, failing algorithms, etc) Repeat these steps to optimize the model

Identify Data Sensitivity

Steps for the identification of the sensitivity of data and the determination of the appropriate security or privacy level are:

Determine if the data has any confidentiality concerns
- Can an unauthorized individual use the information to do limited, serious, or severe harm to individuals, assets or an organization’s operations as a result of data disclosure?
- Would unauthorized disclosure or dissemination of elements of the data violate laws, executive orders, or agency regulations (i.e., HIPPA or Privacy laws)?
- Does the data have any integrity concerns?
- What would be the impact of unauthorized modification or destruction of the data?
- Would it reduce public confidence in the originating organization?
- Would it create confusion or controversy in the user community?
- Would a potentially life-threatening decision be made based on the data or analysis of the data?
- Are there any availability concerns about the data?
- Is the information time-critical? Will another individual or system be relying on the data to make a time-sensitive decision (i.e. sensing data for earthquakes, floods, etc.)?
Document data concerns identified and determine overall sensitivity (Low, Moderate, High)
- Low criticality would result in a limited adverse effect to an organization as a result of the loss of confidentiality, integrity, or availability of the data. It might mean degradation in mission capability or result in minor harm to individuals.
- Moderate criticality would result in a serious adverse effect to an organization as a result of the loss of confidentiality, integrity, or availability of the data. It might mean a severe degradation or loss of mission capability or result in significant harm to individuals that does not involve loss of life or serious life threatening injuries.
- High criticality would result in a severe or catastrophic adverse effect as a result of the loss of confidentiality, integrity, or availability of the data. It might cause a severe degradation in or loss of mission capability or result in severe or catastrophic harm to individuals involving loss of life or serious life threatening injuries.
Develop data access and dissemination policies and procedures based on sensitivity of the data and need-to-know.
Develop data protection policies, procedures and mechanisms based on sensitivity of the data.

Identify Suitable Repositories

Shaping the data management plan towards a specific desired repository will increase the likelihood that the data will be accepted into that repository and increase the discoverability of the data within the desired repository. When beginning a data management plan:

Look to the data management guidelines of the project/grant for a required repository
Ask colleagues what repositories are used in the community
Determine if your local institution has a repository that would be appropriate (and might be required) for depositing your data
Check the DataONE website for a list of potential repositories.

Plan for Effective Multimedia Management

Multimedia data present unique challenges for data discovery, accessibility, and metadata formatting and should be thoughtfully managed. Researchers should establish their own requirements for management of multimedia during and after a research project using the following guidelines. Multimedia data includes still images, moving images, and sound. The Library of Congress has a set of web pages discussing many of the issues to be considered when creating and working with multimedia data. Researchers should consider quality, functionality and formats for multimedia data. Transcriptions and captioning are particularly important for improving discovery and accessibility.

Storage of images solely on local hard drives or servers is not recommended. Unaltered images should be preserved at the highest resolution possible. Store original images in separate locations to limit the chance of overwriting and losing the original image.

Ensure that the policies of the multimedia repository are consistent with your general data management plan.

There are a number of options for metadata for multimedia data, with many MPEG standards (http://mpeg.chiariglione.org/), and other standards such as PBCore (http://pbcore.org).

Provide Budget Information for the Data Management Plan

As a best practice, one must first acknowledge that the process of managing data will incur costs. Researchers should plan to address these costs and the allocation of resources in the early planning phases of the project. This best practice focuses on data management costs during the life cycle of the project, and does not aim to address costs of data beyond the end of the project.

Budgeting and costing for your project is dependent upon institutional resources, services, and policies. We recommended that you verify with your sponsored project office, your office of research, tech transfer resources, and other appropriate entities at your institution to understand resources available to you.

There are a variety of approaches to budgeting for data management costs. All approaches should address the following costs in each phase:

short-term costs
long-term costs
internal/external costs
equipment/services (ie. compute cycles, storage, software, and hardware) costs
overhead costs
time costs
human resource costs

Methods for Managing Costs

In-sourced costs: items that are managed directly within the research group.
Out-sourced costs: items that are contracted or managed outside of the research group.

Phases of the Data Life Cycle (see Primer on Data Management on the DataONE website for a description of the life cycle)

Collect: Likely both in-sourced and out-sourced costs. Coordinate with central IT services or community storage resources to ensure appropriate data storage environment and associated costs during this phase or throughout the life of the project.
Assure: Likely in-sourced costs. This phase is primarily focused on quality assurance/control, and costs will primarily be incurred around time and personnel.
Describe: Likely in-sourced costs. This phase includes initial and ongoing documentation as well as continuous development of metadata. Documentation captures the entire structure of the project, all configurations/parameters, as well as all processes during the course of the entire project. See the Documentation and Metadata best practices for more detail on what should be addressed.
Deposit: Likely both in-sourced and out-sourced costs.
Preserve: Likely both in-sourced and out-sourced costs. Coordinate with central IT services or community repository environments that are equipped to provide preservation services. This phase will be tied closely to the costs of the collection phase.
Discover: Likely in-sourced costs. Coordinate with librarians, IT service providers, or repository providers to identify and access data sources.
Integrate: Likely in-sourced costs. Coordinate with IT service providers or other service groups to merge and prepare data sources for analysis phase.
Analyze: Likely in-sourced costs. Coordinate with central IT services or other workspace providers to connect data sources with appropriate analysis and visualization software.

Revisit Data Management Plan Throughout the Project Life Cycle

The plan will be created at the conceptual stage of the project. It should be considered a living document and a road map for the project, and should be closely followed. Any changes to the data management plan should be made deliberately, and the plan should be updated throughout the data life cycle.

Data management planning provides crucial guidance to all stages of the data life cycle. It provides continuity for operations within the research group. The data management plan will define roles for all project participants and workflows for data collection, quality assurance, description, and deposit for preservation and access. The data management plan is a tool to communicate requirements and restrictions to all members of the project team, including researchers, archivists, librarians, IT staff and repository managers. The plan governs the active research phase of the project life cycle and makes provisions for the hand-off to a repository for preservation and data delivery.

Funding agencies and institutions require data management plans for project funding and approval.