This page gives you a concise overview of some programmatic approaches to exploratory data analysis using the Python programming language. These scripts can generally be used in any programming environment, but have be optimized for Google Colab notebooks.
Contents:
Libraries
Reading Files
Table Characteristics
Summary Statistics (univariate)
Data Quality Assessment
Visualizations
#Import Libraries code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
! pip install seaborn
import seaborn as sb
"numpy" and "pandas" are foundational libraries for numeric computation and data science. They have been imported here with aliases "as np" and "as pd" to facilitate easy use later in the script. matplotlib is a versatile library for data visualization and has been imported with the alias "plt." Seaborn is a more advanced visualization library that is not included in the basic python programming environment and must be installed before using. In Google Colab notes books and exclamation point "!" is added before the pip package manager - this is a practice common in online programming environments, but is typically not needed in desktop based environments.
data = pd.read_csv('filename.csv')
The above function will "read" the file that you uploaded into the Colab programming environment. Replace the dataset in dataset.csv with the file name exactly as it appears in the Colab file manager tab. If your dataset is not a csv format try one of the functions below as needed:
The below examples anticipate using excel, csv, and tsv datatypes.
data.head()
This will display the first 6 lines of your dataset. Not that in Python the first number is "0." This function will give you a snippet view of the dataset so you can see what is in it and is also a simple way to verify that the data have been loaded and read correctly. You can modify the number of rows displayed by adding a number within parentheses: data.head(10) will display the first 10 rows.
Example from the "mtcars" dataset
mpg cylinders displacement horsepower weight acceleration \
0 18.0 8 307.0 130.0 3504 12.0
1 15.0 8 350.0 165.0 3693 11.5
2 18.0 8 318.0 150.0 3436 11.0
3 16.0 8 304.0 150.0 3433 12.0
4 17.0 8 302.0 140.0 3449 10.5
data.tail()
This works in the same way as data.head() , though will display the last few lines.
data.iloc[10:20]
Similarly the .iloc[] function will display a selected range, in this example lines 10-20. This is very useful for diagnostic purposes if you are getting an error message for a specific line.
len(data)
The len() function will display the number of observations (rows) in the dataset
len(data.columns)
A modification of the foregoing, this will display the number of columns in a dataset.
data.info()
This will display the variables names, number of non-null values and data types. The data types are representative of the pandas library. Numeric types are denoted as float64 and int64, while string types are denoted as object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 398 non-null float64
1 cylinders 398 non-null int64
2 displacement 398 non-null float64
3 horsepower 392 non-null float64
4 weight 398 non-null int64
5 acceleration 398 non-null float64
6 model year 398 non-null int64
7 origin 398 non-null int64
8 car name 398 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 28.1+ KB
Correlation Matrix
#Correlation Matrix
plt.figure(figsize=(10, 8))
sb.heatmap(data.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.show()
The correlation matrix is made with the matplotlib library and an inserted seaborn function. plt. is imported as an alias in the libraries above, and seaborn as "sb".
The first parameter is "figsize" with the dimension specified in a tuple. The first is width and the second is height. The unit is nominally inches, though 10 x 8 will generally produce a square .
The seaborn function for heatmaps is called next and this will create a much nicer looking matrix than the native matplotlib and one that is easier to create.
The parameters for the Seaborn heatmap are data.corr() function followed by the annotation argument represented by "annot" and is set to True meaning that the correlation coefficient will be displayed.
The next argument is "cmap" or color map, and this regulates what kind of color range will be displayed. There are several options for this argument:
Then the next argument is the precision of the coefficients fmt=".2f" which will display two decimals
The final argument is the line widths which creates a line around each matrix cell and increase legibility.
Summary Statistics are descriptive summaries of the range, distribution and central tendency of a dataset, and include things like:
Cardinality/ Number of Unique values
print(data['column_name'].nunique())
This function will generate the total number of unique values in a field.
Central Tendency - Mean
data['column_name'].mean()
This function will find the arithmetic average of a numeric-type field
Central Tendency - Median
data['column_name'].median()
Similarly this function will produce the median value in a numeric-type variable
Standard Deviation
data['column_name'].std()
This function will generate the standard deviation of a numeric-type variable
Variance
data['column_name'].var()
This function will generate a statistical measure that describes the spread of values in the numeric variable
Skewedness
data['column_name'].skew()
This function produces a statistic that represents the asymmetry of a distribution
Kurtosis
data['column_name'].kurt()
Kurtosis is a measure of the peakedness of a distribution
Maximum value
data['column_name1'].max()
This function will indicate the highest value in a field
Minimum value
data['column_name1'].min()
This function will indicate the lowest value in a field
Range
data['column_name1'].max() - data['column_name1'].min()
By subtracting the minimum value from the maximum value the total range of values can be determined.
Some common data quality issues that must be identified in an EDA include:
Null Values
A brief code using the Titanic dataset with the .isna() and .sum() functions
import seaborn as sns
import pandas as pd
data = sns.load_dataset('titanic')
print(data.isna().sum())
This will generate an output like this showing the number of null values for each variables:
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64
For most purposes this will be adequate to make mitigating pre-processing decisions, However if you want to see a complete output of each value you can use the .isna() function without adding .sum(). You will also need to adjust the output options, because the default settings will show the first and last few lines. This can be done with the pd.set_option('display.max_rows', None) function:
import seaborn as sns
import pandas as pd
data = sns.load_dataset('titanic')
pd.set_option('display.max_rows', None)
print(data.isna())
The output will look that below, where a null value is represented with True and a valid value is represented with False. This is because by using the .isna() function you are querying the dataset if a data is null.
who adult_male deck embark_town alive alone
0 False False True False False False
1 False False False False False False
2 False False True False False False
3 False False False False False False
4 False False True False False False
5 False False True False False False
6 False False False False False False
7 False False True False False False
Outliers
Some common techniques to identify outliers include Z-score and inter-quartile range), visual approaches like box plots or violin charts and more advanced techniques include machine learning approaches like isolation forests and density-based clustering. Here we consider z-scores and IQR:
import seaborn as sns
import numpy as np
import pandas as pd
from scipy import stats
data = sns.load_dataset('titanic').select_dtypes(include=[np.number]).dropna()
zScores = np.abs(stats.zscore(data))
print(data[(zScores > 3).any(axis=1)])
import seaborn as sns
import numpy as np
import pandas as pd
data = sns.load_dataset('titanic').select_dtypes(include=[np.number]).dropna()
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
pd.set_option('display.max_rows', None)
print(data[((data < (Q1 - 1.5 * IQR)) | (Q3 + 1.5 * IQR < data)).any(axis=1)])
In the above example we set the inter-quartile ranges to 0.25 and 0.75, but these can be changed as needed.
Duplication
Using the pandas library, inter-observation duplicates can be identified using the following:
print(data.duplicated)
This function will return each row and a boolean value for whether it is a duplicate (TRUE) or not (FALSE)
print(data[data.duplicated()])
This function will return rows that are duplicates
print(df[df.duplicated(subset=['fieldName1', 'fieldName2'])])
This function will return duplicate rows in specific variables
Errant White Spaces
This issue is especially important when using categorical or text data. Fr example Morgan State and M organ State are two different values in all programming environments, and most software applications. These can be detected using a cardinality measure which may show differences in expected values; or by ordering values alphabetically.
Print alphabetically:
print(sorted(df['fieldName]))
Encoding Issues
Non-Standard Characters: Data files might contain characters that do not render correctly due to incorrect or inconsistent character encoding across different systems (e.g., UTF-8 vs. ISO-8859-1). This will result in some characters looking strange, e.g. "&" will look like &. Encoding issues often manifest early in the EDA process, but can confound analyses if not detected and corrected early on. To verify encoding:
import pandas as pd
data = pd.read_csv('file.csv', encoding=encoding)
print(data.head())
Data Type Mismatches
Numeric vs. Categorical: Numeric codes used for categorical data (like 1 for Male, 2 for Female) might be mistakenly treated as numeric data, leading to inappropriate statistical analyses.
Data type mismatches can be detected using data.info(), where categorical data and numeric data in the same field will be described as an object.
Date Formats
Different representations of dates (e.g., DD/MM/YYYY vs. MM/DD/YYYY) can lead to incorrect parsing and interpretation of data. There is a wide range of possible date formats, but the following solution can summarize several different types and give some indication about date format variations:
import pandas as pd
from dateutil import parser
import re
def detect_format(date_str):
try:
parsed_date = parser.parse(date_str, fuzzy=True)
for fmt in ["%B %d, %Y", "%Y %B %d", "%m/%d/%y", "%B %d", "%m %B", "%m/%Y"]:
try:
if parsed_date.strftime(fmt) == date_str:
return fmt
except ValueError:
continue
return "Unknown or mixed format"
except ValueError:
return "Unparsable format"
formats = [detect_format(date) for date in data]
summary = pd.Series(formats).value_counts()
print(summary)
Text Encoding in Numerical Data
import pandas as pd
import locale
from locale import atof, setlocale, LC_NUMERIC
setlocale(LC_NUMERIC, 'en_US.UTF-8')
def convert_numeric(value):
try:
return atof(value)
except ValueError:
return pd.NA # Return a missing value indicator if conversion fails
converters = {'Amount': convert_numeric}
df = pd.read_csv('path_to_your_file.csv', converters=converters)
print(df.head())
Scale and Unit
Identifying scale and unit inconsistencies in datasets presents a range of complex challenges, including decipherment of undocumented data leading to ambiguous interpretations about what exactly is being encoded and what scale is being used. A data-dictionary should mitigate this issue, and where no dictionary exists and the field names are not descriptive enough, the dataset may simply be unusable. Additionally datasets may be encoded using different systems, such as imperial and metric, which will confound certain analyses like regression
Many scale an unit issues can be detected with visualizations and a personal knowledge of how variables should 'behave' and the context in which data were collected, and can be more easily understood with comparative visualizations.
If your data can be grouped (e.g., by geographical area, time, or category), compare summary statistics or visualizations for each group. Significant differences might indicate scale or unit discrepancies.
groups = data.groupby('category')
print(groups['data_column'].describe())
data.boxplot(column='fieldName', by='category')
plt.show()
If you know what units are expected, write assertions or checks to confirm these assumptions. For example, if temperatures are supposed to be in Celsius, values like 200 might be erroneous or indicative of scale/unit issues (e.g., Fahrenheit vs. Celsius).
assert data['temperature'].max() < 60, "Temperature values suggest incorrect units"
Anomalies can be detected using certain statistical tests or algorithms, which can be due to scale or unit errors. This could be statistical methods like Z-scores or more advanced machine learning approaches.
from scipy.stats import zscore
data['z_scores'] = zscore(data['fieldName'])
# Assuming any z-score greater than 3 is an anomaly
print(data[data['z_scores'].abs() > 3])
Example of Scatter Plot Code using the Iris Dataset
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load the Iris dataset
iris = sns.load_dataset('iris')
# Display the first few rows of the dataset to understand its structure
print(iris.head())
# Create a scatter plot
sns.scatterplot(data=iris, x='sepal_length', y='sepal_width', hue='species', style='species', palette='bright')
# Adding title and labels
plt.title('Iris Dataset: Sepal Length vs Sepal Width')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
# Show the plot
plt.show()
Example of a Scatterplot with Trendline Code using the Tips dataset
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Load the 'tips' dataset from seaborn
tips = sns.load_dataset('tips')
# Display the first few rows of the dataset to understand its structure
print(tips.head())
# Create a scatter plot with a regression line
sns.lmplot(x='total_bill', y='tip', data=tips, aspect=1.5, ci=None, scatter_kws={'alpha':0.5})
# Adding title and labels
plt.title('Relationship Between Total Bill and Tip')
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')
# Show the plot
plt.show()
Example of a Multivariate Barchart Code using the Titanic Dataset:
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Load the 'titanic' dataset from seaborn
titanic = sns.load_dataset('titanic')
# Display the first few rows of the dataset to understand its structure
print(titanic.head())
# Drop rows with missing values to avoid errors during plotting
titanic = titanic.dropna(subset=['age', 'embarked'])
# Create a multivariate bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x='class', y='survived', hue='sex', data=titanic, errorbar=None, estimator=sum)
# Adding title and labels
plt.title('Survival Counts by Passenger Class and Sex')
plt.xlabel('Passenger Class')
plt.ylabel('Number of Survivors')
plt.show()
Example of Barchart Code using Flights Dataset
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Load the 'flights' dataset from seaborn
flights = sns.load_dataset('flights')
# Aggregate data to get the total number of passengers per year
yearly_passengers = flights.groupby('year')['passengers'].sum().reset_index()
# Create a simple bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x='year', y='passengers', data=yearly_passengers, color='b') # Using a consistent color for simplicity
# Adding title and labels
plt.title('Total Number of Passengers per Year')
plt.xlabel('Year')
plt.ylabel('Total Passengers')
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Show the plot
plt.show()
Example of Histogram Code using Diamonds Dataset
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Load the 'diamonds' dataset from seaborn
diamonds = sns.load_dataset('diamonds')
# Display the first few rows of the dataset to understand its structure
print(diamonds.head())
# Create a simple histogram of diamond prices
plt.figure(figsize=(10, 6))
sns.histplot(diamonds['price'], kde=False, color='blue', binwidth=500)
# Adding title and labels
plt.title('Distribution of Diamond Prices')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
# Show the plot
plt.show()
Example of Pie Chart Code using Titanic Dataset
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Load the 'titanic' dataset from seaborn
titanic = sns.load_dataset('titanic')
# Display the first few rows of the dataset to understand its structure
print(titanic.head())
# Prepare data for the pie chart: count the number of passengers in each class
class_counts = titanic['class'].value_counts()
# Create a pie chart
plt.figure(figsize=(8, 8)) # Square figure to ensure the pie chart is circular
plt.pie(class_counts, labels=class_counts.index, autopct='%1.1f%%', startangle=140, colors=['skyblue', 'coral', 'gold'])
# Adding title
plt.title('Distribution of Passengers by Class')
# Show the plot
plt.show()
Example of Violin Plot Code Using Iris Data
# Import necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Load the 'iris' dataset from seaborn
iris = sns.load_dataset('iris')
# Display the first few rows of the dataset to understand its structure
print(iris.head())
# Create a violin plot
plt.figure(figsize=(10, 6))
sns.violinplot(x='species', y='sepal_length', data=iris)
# Adding title and labels
plt.title('Distribution of Sepal Lengths Across Iris Species')
plt.xlabel('Species')
plt.ylabel('Sepal Length (cm)')
# Show the plot
plt.show()