Skip to Main Content

Data Exploration

A guide to methods and resources about exploratory data analysis.

Data Exploration

Data exploration, or Exploratory Data Analysis (EDA) is an elemental component of every data project.  In EDA you will use various techniques to get an understanding of what your dataset contains, what data types it contains, how the data is distributed, identify data quality issues, and make decisions about how the data should to be modified to meet the needs of your project.  This is especially useful when your dataset is large or complex and patterns, relationships and other characteristics are not immediately obvious.

EDA is one of the first steps after collecting the data for you project.   However, the techniques should not be thought of as exclusive to EDA because you will likely use both EDA techniques and pre-processing techniques throughout a project.  Additionally, after pre-processing data, you will perform a follow-up EDA to report the statistics in reports or published articles.

CONTENTS:

Table Characteristics

  • Visual inspection of data

It is always a good idea to look at your data before doing any kind of analysis.  You can simply open the dataset in MS Excel or software venue like SPSS, etc., use a programmatic approach like R or Python to view select rows and get an idea of what the dataset contains. 

  • Data types

You will want to know how your data is encoded.  Some analyses require specific formats like numeric or categorical and it is not always clear what kind of data you have  just by looking at it.  For example you might have a column of numbers that is encoded as string or text, or you might have a field of numbers that is actually categorical, (e.g. "1" and "0" for "yes" and "no," for example).  Knowing what data types you have will enable you to understand what kinds of analyses can be performed on it, or make decisions about recoding the data into another type if needed.

  • Dimensionality

Dimensionality is the size of the dataset, this is especially relevant for tabular data with columns and rows.  In general, you will want a dataset that is adequately large enough to support the generalizations that you are trying to derive from it.  To be sure, there is no one size fits all for how large a dataset should be- it will depend on the domain of your research and on decisions informed by a literature review of related research.

  • Correlation Analysis

Correlation analysis is used to evaluate the strength and direction of the linear relationship between two quantitative variables. This analysis is fundamental in data exploration and inferential statistics, because it helps to understand how variables are related to each other.   Variables that are strongly correlated can negatively influence certain analyses and result in multicollinearity or overfitting.  In EDA correlation analysis is useful for identifying variables that are strongly correlated, and plan mitigation strategies like dimensionality reduction or regularization.  See illustration below in "Visualizations."

Individual Variable and Observation Characteristics

Summary  Statistics

Summary statistics a concise and informative overview of a dataset's principal characteristics. These statistics include measures of central tendency, dispersion, and shape, and are useful for understanding the distribution, variability, and overall profile of the data. This measures facilitate quick identification of patterns, outliers, and anomalies, and simplify data interpretation and comparison across variables, datasets or groups. Summary statistics are foundational in hypothesis testing and statistical modeling, and provide some guidance in planning analyses.

VARIABILITY

  • Cardinality : The total number of values in a field.  This is the number of observations in a dataset minus the number of null values (see below).
  • Uniqueness :  This is the number of unique values in a field. If the number of unique values is much less than the cardinality of the dataset, then this can have implications for the type of analysis you are doing.  In the case of some machine learning approaches low unique values is essential to implement the algorithm effectively.
  • Randomness : Many analytic approaches assume that data are randomly distributed and not clustered or stratified. Some techniques for evaluating randomness include Kolmogorov-Smirnov Test, Shapiro-Wilk Test, Chi-Square Test.  many software and programmatic approaches have built in features for randomizing data before analysis as well.

CENTRAL TENDENCY

  • Mean: The average of all data points. It is calculated by summing all the values and dividing by the number of values.
  • Median: The middle value when the data points are arranged in ascending or descending order. If there is an even number of observations, the median is the average of the two middle numbers.
  • Mode: The most frequently occurring value(s) in the dataset. A dataset may have one mode (unimodal), more than one mode (bimodal or multimodal), or no mode at all.

DISPERSION

  • Range: The difference between the highest and lowest values in the dataset.
  • Interquartile Range (IQR): The difference between the 75th percentile (Q3) and the 25th percentile (Q1) in the dataset. It shows the spread of the middle 50% of the data.
  • Variance: The average of the squared differences from the Mean. It gives a measure of how the data spreads around the mean.
  • Standard Deviation (SD): The square root of the variance. It provides a measure of the dispersion of the data points around the mean in the same units as the data.

SHAPE

  • Skewness: A measure of the asymmetry of the probability distribution. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values, while negative skewness indicates a tail extending toward more negative values.
  • Kurtosis: A measure of the "tailed-ness" of the probability distribution. High kurtosis means more of the variance is due to infrequent extreme deviations, as opposed to frequent modestly sized deviations.

Evaluating Assumptions About Distribution

Many statistical tests and machine learning approaches assume that distributions are of a specific type, like linear and polynomial, etc.  Using visualizations in the EDA  process can reveal distributions and inform decision making about what type of analysis can be done.

Data Quality

Evaluating the overall quality of the dataset is essential before attempting any analysis.  Things like missing values, data entry errors, outliers etc. will negatively influence the quality of your analysis.  When using data prepared by someone else this can become challenging , because you may not know what circumstances the data was created under, how many people contributed to it, or how careful the data preparation was.  For example, distinguishing a data error entry from a real observation can be very challenging, if it is within the mean values of the dataset.  

  • Null values

Null values are a data quality issue and are missing, errant or non-existent values, and can be represented in data as black spaces, NaN.  It must be noted that "0" is not a null but simply represents the quantity of zero.  However, it should also be noted that dummy numbers like "99999" or "0," are sometimes used when there is a missing observation.  This is a bad practice, and understanding your data is essential to identifying possible practices like this.

Null values can skew the results of your analysis such as distorting summary statistics, or confounding techniques like machine learning and more.  The influence of null values must be mitigated before performing an analysis.  The mitigation strategy depends on several factors:

  • Are all nulls limited to one column?
  • Are they distributed randomly throughout the dataset?
  • What is the total number of nulls?  Is it less than 5 percent?

In general if the total number of nulls are less than five percent, then some technique like imputation can be used.  If a single column has a significant number of nulls, then you might considered removing the column from the analysis.

  • Outliers

Many analyses assume that the data variables are distributed normally, i.e. a bell curve.  Outliers can skew the distribution and give misleading or confounded results and outcome metrics.  Some outliers might be data entry errors while other may be real, though anomalous, observations.

Some approaches to outlier detection include z-score, inter-quartile range or by inspection with visualizations like box plots or violin charts.  Mitigation strategies include removal, altering the scale with techniques like like logarithmic transformation, replacement with imputation, limiting the range of values used in the analysis (Winsorization) or binning into categories.

  • Duplication

Duplication of observations will similarly skew the outcomes and evaluation of the analysis.  Distinguishing between duplicate entries and separate observations that have the same value(s) (coincidental or natural duplicates) is very difficult if not impossible if the dataset is second-hand.  Fields like unique ID or timestamps may provide some hints about whether an observation is duplicate or not. Patterns in duplication such as every nth row or clustering of duplicates may indicate some problem with data collection or processing and should be considered for mitigation.

Mitigation strategies include removal, and sensitivity study where an analysis is performed with and without the duplicates to understand how the outcomes change.  When collecting data first-hand in a study then documenting natural duplicates or assigning a unique ID will prevent later problems in interpretation.

  • Errant White Spaces

If there are errant white spaces in individual values most analytic and programmatic approaches will interpret the value as a separate values and increase the number of unique values.   Some mitigation strategies include removing white spaces or manually editing the dataset in editing software like MS Excel.

  • Encoding Issues

Character encoding in a dataset can lead to unusual characters or strings appear like "&" which is an HTML encoding for  "&."  This will change the uniqueness of the field.  Similar problems can occur when a programmatic approach anticipates UTF-8 and gets ASCII encoding, for example.

  • Confused/ Inconsistent Data Types

Values like "1", "M" "Male" may all mean the same thing though will most likely be interpreted as separate values in an analysis.

  • Scale and Unit 

Many techniques require data types to be scaled or normalized, when there are different units of measurement being used.  for example a dataset might contain "Age" in years and "Income" in dollars and so these would have to be  changed in the preprocessing phase

  • Cross Field Validation Errors

Two values in different fields might be assumed to have congruent values.  "Year of Birth" and "Age" should be naturally comparable.  If there is some inconsistency then it may be because one field was created at a different time then the other or some other issue, and so this would require some mitigation strategy.

Visualizations

Data Visualization is one of the principal components of a data project usually coming after the analysis phase after you have made conclusion about your project.  However , data visualization can be a very useful part of data  exploration as well especially when getting info about how data is distributed.  Shown below are some of the more useful visualization techniques for an EDA

 

Scatterplot with Trend Line

Scatterplots are useful for evaluating the relationship between dependent and independent variables, or in the context of machine learning predictor and target variables.

Bar Charts

 

Bar charts are a useful way to visualize and distributions of categorical variables especially

Box Plot

Box plots are an effective method of visualizing median values, distribution of a variable and outliers in the dataset, providing a summary of several important statistics in EDA.

 

Violin Chart

 

Violin plots combine aspects of box plots and density plots. Like box plots, violin plots show key characteristics of a variable, such as the median and interquartile ranges. Additionally, they include a kernel density estimation that visually represents the distribution's density, indicating how data points are distributed across different values. This feature allows violin plots to show the overall distribution's shape and density, providing deeper insights into the data compared to box plots alone.

Histogram

Histograms are useful in EDAs for displaying the distribution of numerical data by grouping it into bins and showing the frequency of data points in each bin. This helps to identify the shape of the distribution, detect outliers, and understand central tendencies. This visualization is important for guiding data transformations and informing further statistical analyses.

Pie Chart

Pie charts are useful in EDA for visually representing the proportional distribution of categorical data. By dividing a circle into slices that each represent the relationship between a part of a category to the whole, pie charts make it easy to compare relative sizes of cateogries at a glance. This is effective for showing any data where understanding the proportionate contribution of various categories is important.

Line Graph

Line charts are useful in EDAs for visualizing changes and trends over time. By connecting data points with a line, they illustrate how a variable evolves across different time intervals, allowing for easy identification of upward or downward trends, cyclical patterns, and anomalies. This makes line charts especially useful for time series data in fields such as finance, economics, and healthcare. They provide a clear, straightforward way to track the progress of metrics over time and assess the impact of specific events or interventions on these metrics.

Correlation Matrix

Correlation matrices are useful in EDAs for examining the relationships between multiple variables simultaneously. They provide an overview of correlation coefficients indicating the strength and direction of the linear relationships between pairs of variables. Variables that are strongly related can be quickly identified and considered for dimension reduction.

©2018 Morgan State University | 1700 East Cold Spring Lane Baltimore, Maryland 21251 | 443-885-3333 | Privacy | Accessibility