LibGuides: Data Management and Project Planning: Integrate

About

Data from multiple sources are combined into a form that can be readily analyzed. For example, you could combine citizen science project data with other sources of data to enable new analyses and investigations. Successful data integration depends on documentation of the integration process, clearly citing and making accessible the data you are using, and employing good data management practices throughout the Data Life Cycle.

Document Steps Used in Data Processing

Different types of new data may be created in the course of a project, for instance visualizations, plots, statistical outputs, a new dataset created by integrating multiple datasets, etc. Whenever possible, document your workflow (the process used to clean, analyze and visualize data) noting what data products are created at each step. Depending on the nature of the project, this might be as a computer script, or it may be notes in a text file documenting the process you used (i.e. process metadata). If workflows are preserved along with data products, they can be executed and enable the data product to be reproduced.

Document the Integration of Multiple Datasets

Document the steps used to integrate disparate datasets.

Ideally, one would adopt mechanisms to systematically capture the integration process, e.g. in an executable form such as a script or workflow, so that it can be reproduced
In lieu of a scientific workflow system, document the process, scripts, or queries used to perform the integration of data in documentation that will accompany the data (metadata)
Provide a conceptual model that describes the relationships among datasets from different sources
Use unique identifiers in the data records to maintain data integrity by reducing duplication
Identify foreign key fields in the data records which support the relationship between the data sources
When you use datasets and data elements from within those datasets as a source for new datasets, it is important to identify and document those data within the documentation of the new/derived dataset. This is known as dataset provenance; provenance describes the origin or source of something. Just as you would cite papers that are sources for your research paper, it is critical to identify the sources of the data used within your own datasets. This will allow for:
- tracing the chain of use of datasets and data elements
- credit and attribution to accrue to the creators of the original datasets
- the possibility that if errors or new information about the original datasets or data elements comes to light, that any impact on your new datasets and interpretation of such could be traced

Provide Budget Information for Your Data Management Plan

As a best practice, one must first acknowledge that the process of managing data will incur costs. Researchers should plan to address these costs and the allocation of resources in the early planning phases of the project. This best practice focuses on data management costs during the life cycle of the project, and does not aim to address costs of data beyond the end of the project.

Budgeting and costing for your project is dependent upon institutional resources, services, and policies. We recommended that you verify with your sponsored project office, your office of research, tech transfer resources, and other appropriate entities at your institution to understand resources available to you.

There are a variety of approaches to budgeting for data management costs. All approaches should address the following costs in each phase:

short-term costs
long-term costs
internal/external costs
equipment/services (ie. compute cycles, storage, software, and hardware) costs
overhead costs
time costs
human resource costs

Methods for Managing Costs

In-sourced costs: items that are managed directly within the research group.
Out-sourced costs: items that are contracted or managed outside of the research group.

Phases of the Data Life Cycle (see Primer on Data Management on the DataONE website for a description of the life cycle)

Collect: Likely both in-sourced and out-sourced costs. Coordinate with central IT services or community storage resources to ensure appropriate data storage environment and associated costs during this phase or throughout the life of the project.
Assure: Likely in-sourced costs. This phase is primarily focused on quality assurance/control, and costs will primarily be incurred around time and personnel.
Describe: Likely in-sourced costs. This phase includes initial and ongoing documentation as well as continuous development of metadata. Documentation captures the entire structure of the project, all configurations/parameters, as well as all processes during the course of the entire project. See the Documentation and Metadata best practices for more detail on what should be addressed.
Deposit: Likely both in-sourced and out-sourced costs.
Preserve: Likely both in-sourced and out-sourced costs. Coordinate with central IT services or community repository environments that are equipped to provide preservation services. This phase will be tied closely to the costs of the collection phase.
Discover: Likely in-sourced costs. Coordinate with librarians, IT service providers, or repository providers to identify and access data sources.
Integrate: Likely in-sourced costs. Coordinate with IT service providers or other service groups to merge and prepare data sources for analysis phase.
Analyze: Likely in-sourced costs. Coordinate with central IT services or other workspace providers to connect data sources with appropriate analysis and visualization software.

Understand the Geospatial Parameters of Multiple Data Sources

Understand the input geospatial data parameters, including scale, map projection, geographic datum, and resolution, when integrating data from multiple sources. Care should be taken to ensure that the geospatial parameters of the source datasets can be legitimately combined. If working with raster data, consider the data type of the raster cell values as well as if the raster data represent discrete or continuous values. If working with vector data, consider feature representation (e.g., points, polygons, lines). It may be necessary to re-project your source data into one common projection appropriate to your intended analysis. Data product quality degradation or loss of data product utility can result when combining geospatial data that contain incompatible geospatial parameters. Spatial analysis of a dataset created from combining data having considerably different scales or map projections may result in erroneous results.

Document the geospatial parameters of any output dataset derived from combining multiple data products. Include this information in the final data product’s metadata as part of the product’s provenance or origin.