LibGuides: Data Management and Project Planning: Data Project Planning

About

Several frameworks exist for organizing and planning a data project. These frameworks are useful for helping researchers understand what decisions they have to make at different phases of the project and what tools are available for implementation. Most of these frameworks developed in industry or discipline-specific venues, but generally all have similar components like:

Formulating a research question, study objective or business problem
Collecting, exploring and modifying data
Analyzing data
Interpreting and communicating results

Some very popular frameworks include CRISP-DM and SEMMA and the DataOne Data Lifecycle. DataOne lifecycle is the most comprehensive and has a dedicate PAGE on this guide.

For more information about other frameworks please see below:

CRISP-DM (Cross-Industry Standard Process for Data Mining)
SEMMA (Sample, Explore, Modify, Model Assess)
TDSP (Team Data Science Process)

CRISP-DM

Source Wirth and Hipp 2000. https://www.semanticscholar.org/paper/CRISP-DM%3A-Towards-a-Standard-Process-Model-for-Data-Wirth-Hipp/48b9293cfd4297f855867ca278f7069abc6a9c24

CRISP-DM stands for Cross-Industry Standard Process for Data Mining. It's a widely used methodology for data mining projects, designed to guide data analysts through the process of creating, deploying, and maintaining data mining applications.

The CRISP-DM framework is structured into six phases:

Business Understanding: This initial component focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan. The CRISP-DM is oriented around business application and data mining, but an be modified as a generic management framework for projects outside of these venues. So for example the Business Understanding component can be changed to Study question or Research objectives and data mining can be re-interpreted generally as analysis.

Data Understanding: This phase starts with data collection and continues with exploratory activities to gain an understanding of the data at the table and variable level (summary statistics), to identify data quality issues, to discover initial patterns into the data, or to detect interesting subsets to form hypotheses about unseen patterns and information.

Data Preparation: The data preparation phase covers all activities needed to construct the final dataset from the initial raw data and prepare it for analysis. This may include table, record, and attribute selection as well as data cleaning and construction of new attributes.

Modeling (or Analysis): In this phase, various modeling techniques are selected and applied based on the premise of the research objective, and modelling or inferential parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements varying according to the form and type of data.

Evaluation: After modeling, this phase evaluates the model or models to ensure they meet the project objectives identified in the research question or business issues. This can include assessment of model measurements like accuracy, precision, recall and specificity in machine learning applications, etc.

Deployment: The final phase involves deploying the data mining solution to the business. This could be as simple as generating a report or as complex as implementing a repeatable data mining or machine learning process across the organization.

CRISP-DM is popular because it is generic and not tailored to any specific industry or type of data. It is also flexible and adaptable to the needs of particular data mining projects.

The phases of CRISP-DM should not be literally interpreted as discrete and linearly ordered components. To be sure before one analyzes their data, they should have conducted an exploratory analysis, but there is significant overlap in the data understanding and processing phases for example.

SEMMA

Source Keith Holdaway https://www.researchgate.net/figure/SEMMA-process-to-generate-knowledge-from-raw-data_fig1_254536111

SEMMA stands for Sample, Explore, Modify, Model, and Assess, and is a data mining methodology developed by the SAS Institute. It provides a structured approach to guide the data mining process.

Sample: Select a representative sample of data from a larger dataset. Sampling techniques might include random sampling, stratified sampling, or more complex methods depending on the data's nature and the analysis goals. The key is to create a manageable dataset that still accurately reflects the larger data pool.

Explore: Examine the data to discover patterns, trends, anomalies, and relationships that could be useful in the modeling stage. This includes visualizing data through charts and graphs, conducting descriptive statistics, and using exploratory data analysis (EDA) techniques.

Modify: Prepare the data for modeling. This involves transforming and cleaning the data to improve model accuracy and effectiveness. Typical activities include handling missing values, creating or transforming features, normalizing or scaling data, and selecting or constructing variables that are likely to be good predictors.

Model: Apply statistical or machine learning techniques to create models that predict or classify based on inputs, including choosing the appropriate modeling techniques (like regression, decision trees, neural networks), applying these models to the data, and modifying parameters to optimize performance. Model selection often depends on the problem type, data characteristics, and desired outcome.

Assess: Evaluate the performance of the models to determine their accuracy, effectiveness, and suitability for deployment. This includes using statistical tests and validation techniques like cross-validation or confusion matrices to assess the model's predictive power and generalizability, and ensure that the model performs well not just on the training data but also on new, unseen data.

Team Data Science Process

Source : Microsoft https://learn.microsoft.com/en-us/azure/architecture/data-science-process/overview

The TDSP is an agile and iterative data science methodology developed by Microsoft that can be used to develop and implement predictive analytics solutions and AI applications efficiently. The TDSP enhances team collaboration and learning by recommending optimal ways for team roles to work together. The TDSP incorporates best practices and frameworks from Microsoft and other industry leaders to help your team effectively implement data science initiatives. The TDSP enables you to fully realize the benefits of your analytics program.

See more about the process at the Microsoft page HERE

(direct URL : https://learn.microsoft.com/en-us/azure/architecture/data-science-process/overview)