Skip to Main Content

Data Exploration

A guide to methods and resources about exploratory data analysis.

About

This page gives you a concise overview of some programmatic approaches to exploratory data analysis using the Python programming language.  These scripts can generally be used in any programming environment, but have be optimized for Google Colab notebooks.  

Contents:

Libraries
Reading Files
Table Characteristics
Summary Statistics (univariate)
Data Quality Assessment
Visualizations

Libraries

#Import Libraries code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
! pip install seaborn
import seaborn as sb

"numpy" and "pandas" are foundational libraries for numeric computation and data science. They have been imported here with aliases "as np" and "as pd" to facilitate easy use later in the script. matplotlib is a versatile library for data visualization and has been imported with the alias "plt." Seaborn is a more advanced visualization library that is not included in the basic python programming environment and must be installed before using. In Google Colab notes books and exclamation point "!" is added before the pip package manager - this is a practice common in online programming environments, but is typically not needed in desktop based environments.

Reading File

data = pd.read_csv('filename.csv')

The above function will "read" the file that you uploaded into the Colab programming environment. Replace the dataset in dataset.csv with the file name exactly as it appears in the Colab file manager tab. If your dataset is not a csv format try one of the functions below as needed:

  • data = pd.read_excel('dataset.xlsx')
  • data = pd.read_sql_query('SELECT * FROM table_name', connection)
  • data = pd.read_sas('filename.sas7bdat')
  • data = pd.read_parquet('filename.parquet')
  • data = pd.read_feather('filename.feather')
  • data = pd.read_orc('filename.orc')
  • data = pd.read_fwf('filename.txt')

The below examples anticipate using excel, csv, and tsv datatypes.

Table Characteristics

data.head()

This will display the first 6 lines of your dataset. Not that in Python the first number is "0." This function will give you a snippet view of the dataset so you can see what is in it and is also a simple way to verify that the data have been loaded and read correctly. You can modify the number of rows displayed by adding a number within parentheses: data.head(10) will display the first 10 rows.

Example from the "mtcars" dataset

    mpg  cylinders  displacement  horsepower  weight  acceleration  \
0  18.0          8         307.0       130.0    3504          12.0   
1  15.0          8         350.0       165.0    3693          11.5   
2  18.0          8         318.0       150.0    3436          11.0   
3  16.0          8         304.0       150.0    3433          12.0   
4  17.0          8         302.0       140.0    3449          10.5   

data.tail()

This works in the same way as data.head() , though will display the last few lines.

data.iloc[10:20]

Similarly the .iloc[] function will display a selected range, in this example lines 10-20.  This is very useful for diagnostic purposes if you are getting an error message for a specific line.

len(data)

The len() function will display the number of observations (rows) in the dataset

len(data.columns)

A modification of the foregoing, this will display the number of columns in a dataset.

data.info()

This will display the variables names, number of non-null values and data types.  The data types are representative of the pandas library.  Numeric types are denoted as float64 and int64, while string types are denoted as object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB

Correlation Matrix

#Correlation Matrix
plt.figure(figsize=(10, 8))
sb.heatmap(data.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.show()

The correlation matrix is made with the matplotlib library and an inserted seaborn function.  plt. is imported as an alias in the libraries above, and seaborn as "sb".
The first parameter is "figsize" with the dimension specified in a tuple.  The first is width and the second is height.  The unit is nominally inches, though 10 x 8 will generally produce a square .

The seaborn function for heatmaps is called next and this will create a much nicer looking matrix  than the native matplotlib and one that is easier to create.

The parameters for the Seaborn heatmap are data.corr() function followed by the annotation argument represented by "annot" and is set to True meaning that the correlation coefficient will be displayed.

The next argument is "cmap" or color map, and this regulates what kind of color range will be displayed.  There are several options for this argument:

  • coolwarm: A blue to red diverging colormap, good for center-zero data.
  • bwr: Blue-white-red diverging colormap.
  • seismic: A blue-black-red diverging colormap.
  • Spectral: A red-orange-yellow-green-blue diverging colormap.

Then the next argument is the precision of the coefficients fmt=".2f" which will display two decimals

The final argument is the line widths which creates a line around each matrix cell and increase  legibility.

Summary Statistics

Summary Statistics are descriptive summaries of the range, distribution and central tendency of a dataset, and include things like:

  • Cardinality
  • Mean
  • Median
  • Standard Deviation
  • Skew
  • Kurtosis

Cardinality/ Number of Unique values

print(data['column_name'].nunique())

This function will generate the total number of unique values in a field.

Central Tendency - Mean

data['column_name'].mean()

This function will find the arithmetic average of a numeric-type field

Central Tendency - Median

data['column_name'].median()

Similarly this function will produce the median value in a numeric-type variable

Standard Deviation

data['column_name'].std()

This function will generate the standard deviation of a numeric-type  variable

Variance

data['column_name'].var()

This function will generate a statistical measure that describes the spread of values in the numeric variable

Skewedness 

data['column_name'].skew()

This function produces a statistic that represents the asymmetry of a distribution

Kurtosis

data['column_name'].kurt()

Kurtosis is a measure of the peakedness of a distribution

Maximum value

data['column_name1'].max()

This function will indicate the highest value in a field

Minimum value

data['column_name1'].min()

This function will indicate the lowest value in a field

Range

data['column_name1'].max() - data['column_name1'].min()

By subtracting the minimum value from the maximum value the total range of values can be determined.

Data Quality Assessment

Some common data quality issues that must be identified in an EDA include:

  • Null Values
  • Outliers
  • Duplicate observations
  • Errant White Spaces
  • Encoding Issues
  • Differences in Scale and Unit between observations
  • Cross Field Validation Errors

Null Values

A brief code using the Titanic dataset with the .isna() and .sum() functions

import seaborn as sns
import pandas as pd
data = sns.load_dataset('titanic')
print(data.isna().sum())

This will generate an output like this showing the number of null values for each variables:

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64  

For most purposes this will be adequate to make mitigating pre-processing decisions,  However if you want to see a complete output of each value you can use the .isna() function without adding .sum().  You will also need to adjust the output options, because the default settings will show the first and last few lines.  This can be done with the pd.set_option('display.max_rows', None) function:

import seaborn as sns
import pandas as pd
data = sns.load_dataset('titanic')
pd.set_option('display.max_rows', None)
print(data.isna())

The output will look that below, where a null value is represented with True and a valid value is represented with False.  This is because by using the .isna() function you are querying the dataset if a data is null.

 who  adult_male   deck  embark_town  alive  alone  
0    False       False   True        False  False  False  
1    False       False  False        False  False  False  
2    False       False   True        False  False  False  
3    False       False  False        False  False  False  
4    False       False   True        False  False  False  
5    False       False   True        False  False  False  
6    False       False  False        False  False  False  
7    False       False   True        False  False  False

Outliers

Some common techniques to identify outliers include Z-score and inter-quartile range), visual approaches like box plots or violin charts and more advanced techniques include machine learning approaches like isolation forests and density-based clustering.  Here we consider z-scores and IQR:

  • Z-score

import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats
data = sns.load_dataset('titanic').select_dtypes(include=[np.number]).dropna()  
zScores = np.abs(stats.zscore(data))
print(data[(zScores > 3).any(axis=1)])

  • Inter-quartile Range

import seaborn as sns
import numpy as np
import pandas as pd
data = sns.load_dataset('titanic').select_dtypes(include=[np.number]).dropna()
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
pd.set_option('display.max_rows', None)
print(data[((data < (Q1 - 1.5 * IQR)) | (Q3 + 1.5 * IQR < data)).any(axis=1)]) 

In the above example we set the inter-quartile ranges to 0.25 and 0.75, but these can be changed as needed.

Duplication

Using the pandas library, inter-observation duplicates can be identified using the following:

print(data.duplicated)

This function will return each row and a boolean value for whether it is a duplicate (TRUE) or not (FALSE)
print(data[data.duplicated()])

This function will return rows that are duplicates

print(df[df.duplicated(subset=['fieldName1', 'fieldName2'])])

This function will return duplicate rows in specific variables

Errant White Spaces

This issue is especially important when using categorical or text data.  Fr example Morgan State and M organ State are two different values in all programming environments, and most software applications. These can be detected using a cardinality measure which may show differences in expected values; or by ordering values alphabetically.  

Print alphabetically:

print(sorted(df['fieldName]))
Encoding Issues

Non-Standard Characters: Data files might contain characters that do not render correctly due to incorrect or inconsistent character encoding across different systems (e.g., UTF-8 vs. ISO-8859-1).  This will result in some characters looking strange, e.g. "&" will look like &amp;.  Encoding issues often manifest early in the EDA process, but can confound analyses if not detected and corrected early on.  To verify encoding:

import pandas as pd
data = pd.read_csv('file.csv', encoding=encoding)
print(data.head())


Data Type Mismatches

Numeric vs. Categorical: Numeric codes used for categorical data (like 1 for Male, 2 for Female) might be mistakenly treated as numeric data, leading to inappropriate statistical analyses.

Data type mismatches can be detected using data.info(), where categorical data and numeric data in the same field will be described as an object.

Date Formats

Different representations of dates (e.g., DD/MM/YYYY vs. MM/DD/YYYY) can lead to incorrect parsing and interpretation of data. There is a wide range of possible date formats, but the following solution can summarize several different types and give some indication about date format variations:

import pandas as pd
from dateutil import parser
import re

def detect_format(date_str):
    try:
        parsed_date = parser.parse(date_str, fuzzy=True)
      
        for fmt in ["%B %d, %Y", "%Y %B %d", "%m/%d/%y", "%B %d", "%m %B", "%m/%Y"]:
            try:
                if parsed_date.strftime(fmt) == date_str:
                    return fmt
            except ValueError:
                continue
           return "Unknown or mixed format"
    except ValueError:
        return "Unparsable format"

formats = [detect_format(date) for date in data]

summary = pd.Series(formats).value_counts()

print(summary)

 

Text Encoding in Numerical Data

  • Commas in Numbers: Commas used as thousand separators  or as decimals in some countries can be misinterpreted as decimal points or delimiters, leading to incorrect parsing, especially in comma-separated values datasets.  

import pandas as pd
import locale
from locale import atof, setlocale, LC_NUMERIC

setlocale(LC_NUMERIC, 'en_US.UTF-8')

def convert_numeric(value):
    try:
        return atof(value)
    except ValueError:
        return pd.NA  # Return a missing value indicator if conversion fails

converters = {'Amount': convert_numeric}

df = pd.read_csv('path_to_your_file.csv', converters=converters)

print(df.head())

Scale and Unit 

Identifying scale and unit inconsistencies in datasets presents a range of complex challenges, including decipherment of undocumented data leading to ambiguous interpretations about what exactly is being encoded and what scale is being used.  A data-dictionary should mitigate this issue, and where no dictionary exists and the field names are not descriptive enough, the dataset may simply be unusable.  Additionally datasets may be encoded using different systems, such as imperial and metric, which will confound certain analyses like regression

Many scale an unit issues can be detected with visualizations and a personal knowledge of how variables should 'behave' and the context in which data were collected, and can be more easily  understood with comparative visualizations.

If your data can be grouped (e.g., by geographical area, time, or category), compare summary statistics or visualizations for each group. Significant differences might indicate scale or unit discrepancies.

groups = data.groupby('category')
print(groups['data_column'].describe())
data.boxplot(column='fieldName', by='category')
plt.show()

If you know what units are expected, write assertions or checks to confirm these assumptions. For example, if temperatures are supposed to be in Celsius, values like 200 might be erroneous or indicative of scale/unit issues (e.g., Fahrenheit vs. Celsius).

assert data['temperature'].max() < 60, "Temperature values suggest incorrect units"

Anomalies can be detected using certain statistical tests or algorithms, which can be due to scale or unit errors. This could be statistical methods like Z-scores or more advanced machine learning approaches.

from scipy.stats import zscore
data['z_scores'] = zscore(data['fieldName'])
# Assuming any z-score greater than 3 is an anomaly
print(data[data['z_scores'].abs() > 3])

Visualizations

Example of Scatter Plot Code using the Iris Dataset

# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load the Iris dataset
iris = sns.load_dataset('iris')

# Display the first few rows of the dataset to understand its structure
print(iris.head())

# Create a scatter plot
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width', hue='species', style='species', palette='bright')

# Adding title and labels
plt.title('Iris Dataset: Sepal Length vs Sepal Width')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')

# Show the plot
plt.show()


 

Example of a Scatterplot with Trendline Code using the Tips dataset

# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Load the 'tips' dataset from seaborn
tips = sns.load_dataset('tips')

# Display the first few rows of the dataset to understand its structure
print(tips.head())

# Create a scatter plot with a regression line
sns.lmplot(x='total_bill', y='tip', data=tips, aspect=1.5, ci=None, scatter_kws={'alpha':0.5})

# Adding title and labels
plt.title('Relationship Between Total Bill and Tip')
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')

# Show the plot
plt.show()

Example of a Multivariate Barchart Code using the Titanic Dataset:

# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Load the 'titanic' dataset from seaborn
titanic = sns.load_dataset('titanic')

# Display the first few rows of the dataset to understand its structure
print(titanic.head())

# Drop rows with missing values to avoid errors during plotting
titanic = titanic.dropna(subset=['age', 'embarked'])

# Create a multivariate bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x='class', y='survived', hue='sex', data=titanic, errorbar=None, estimator=sum)

# Adding title and labels
plt.title('Survival Counts by Passenger Class and Sex')
plt.xlabel('Passenger Class')
plt.ylabel('Number of Survivors')
plt.show()

Example of Barchart Code using Flights Dataset

# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Load the 'flights' dataset from seaborn
flights = sns.load_dataset('flights')

# Aggregate data to get the total number of passengers per year
yearly_passengers = flights.groupby('year')['passengers'].sum().reset_index()

# Create a simple bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x='year', y='passengers', data=yearly_passengers, color='b')  # Using a consistent color for simplicity

# Adding title and labels
plt.title('Total Number of Passengers per Year')
plt.xlabel('Year')
plt.ylabel('Total Passengers')

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

# Show the plot
plt.show()

Example of Histogram Code using Diamonds Dataset

# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Load the 'diamonds' dataset from seaborn
diamonds = sns.load_dataset('diamonds')

# Display the first few rows of the dataset to understand its structure
print(diamonds.head())

# Create a simple histogram of diamond prices
plt.figure(figsize=(10, 6))
sns.histplot(diamonds['price'], kde=False, color='blue', binwidth=500)

# Adding title and labels
plt.title('Distribution of Diamond Prices')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')

# Show the plot
plt.show()

 

Example of Pie Chart Code using Titanic Dataset

# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Load the 'titanic' dataset from seaborn
titanic = sns.load_dataset('titanic')

# Display the first few rows of the dataset to understand its structure
print(titanic.head())

# Prepare data for the pie chart: count the number of passengers in each class
class_counts = titanic['class'].value_counts()

# Create a pie chart
plt.figure(figsize=(8, 8))  # Square figure to ensure the pie chart is circular
plt.pie(class_counts, labels=class_counts.index, autopct='%1.1f%%', startangle=140, colors=['skyblue', 'coral', 'gold'])

# Adding title
plt.title('Distribution of Passengers by Class')

# Show the plot
plt.show()

 

Example of Violin Plot Code Using Iris Data

# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Load the 'iris' dataset from seaborn
iris = sns.load_dataset('iris')

# Display the first few rows of the dataset to understand its structure
print(iris.head())

# Create a violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(x='species', y='sepal_length', data=iris)

# Adding title and labels
plt.title('Distribution of Sepal Lengths Across Iris Species')
plt.xlabel('Species')
plt.ylabel('Sepal Length (cm)')

# Show the plot
plt.show()

 

 

©2018 Morgan State University | 1700 East Cold Spring Lane Baltimore, Maryland 21251 | 443-885-3333 | Privacy | Accessibility