Skip to Main Content

Data Exploration

A guide to methods and resources about exploratory data analysis.

About

This page provides a concise overview of some programmatic approaches to exploratory data analysis using the R programming language.  These scripts have be optimized for RStudio.  

Contents:

Libraries
Reading Files
Table Characteristics
Summary Statistics (univariate)
Data Quality Assessment
Visualizations

Libraries

install.packages("tidyverse")
install.packages("data.table")
install.packages("plotly")
install.packages("shiny")
install.packages("corrplot")

Reading data

setwd("C:/Users/name/Desktop/file") 

The setwd() function sets the working directory.  This will point to the location where the file is located but will not include the file name in the code
data<-read.csv(file="file.csv", head=TRUE, sep=",", as.is=FALSE)

The read.csv() function is used to point to the file you will use for your project - it will be located within the directory specified in the line above.  This function has several arguments.  The first is the file="" which contains the actual name of the file, head= indicates if the first line has column names (TRUE) or not (FALSE).  sep="" indicates what character separated the data.  In the above example we have "csv" or comma separated values so we use the "," comma symbol to indicate the separator.  

Other data formats can be read with similar functions:

data <- read.csv2(file = "file.csv", header = TRUE, sep = ";", dec = ",", as.is = FALSE)
csv2 : Used for reading CSV files where the separator is a semicolon (;), which is common in countries where the comma is used as the decimal point.

data <- read.table(file = "file.txt", header = TRUE, sep = "\t", as.is = FALSE)
A more general function for reading data from a file in table format. It allows for various delimiters and handling of different data types.

data <- read.delim("file.txt", header = TRUE)
These are variants of read.table() for reading tab-delimited files. read.delim2() is used for cases where the decimal point is a comma.

library(readxl)
data <- read_xlsx("file.xlsx", sheet = 1)

From the readxl package, this function is used to read Excel files.

data <- read.fwf(file = "file.fwf", widths = c(5, 5, 4), header = TRUE)
This function is used for reading files with fixed-width formatted data.

lines <- readLines(con = "file.txt")
For reading lines from a text file, particularly useful when you need to process the text line by line.

data <- readRDS(file = "data.rds")
This function is used for reading in R objects saved in the RDS format.

load("data.RData")
To read in saved R data files (.RData or .rda) which contain several R objects.

library(jsonlite)
data <- fromJSON("file.json")

From the jsonlite package, this function is used for reading JSON files.

Table Characteristics

data.head()

This will display the first 5 lines of your dataset. This function will give you a snippet view of the dataset so you can see what is in it and is also a simple way to verify that the data have been loaded and read correctly. You can modify the number of rows displayed by adding a number within parentheses: data.head(10) will display the first 10 rows.

Example from the "mtcars" dataset:

    mpg  cylinders  displacement  horsepower  weight  acceleration  \
1  18.0          8         307.0       130.0    3504          12.0   
2  15.0          8         350.0       165.0    3693          11.5   
3  18.0          8         318.0       150.0    3436          11.0   
4  16.0          8         304.0       150.0    3433          12.0   
5  17.0          8         302.0       140.0    3449          10.5   

data.tail()

This works in the same way as data.head(), though it will display the last few lines.

data[10:20]

Similarly, this R syntax will display a selected range, in this example rows 10 to 20. This is very useful for diagnostic purposes if you are getting an error message for a specific line.

nrow(data)

The nrow() function will display the number of observations (rows) in the dataset.

ncol(data)

A modification of the foregoing, this will display the number of columns in a dataset.

str(data)

This will display the variable names, number of non-null values, and data types. Numeric types are denoted as numeric, while string types are denoted as character.

str(mtcars)

'data.frame':  32 obs. of  11 variables:
 $ mpg         : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl         : num  6 6 4 6 8 ...
 $ disp        : num  160 160 108 258 360 ...
 $ hp          : num  110 110 93 110 175 ...
 $ drat        : num  3.9 3.9 3.85 3.08 3.15 ...
 $ wt          : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec        : num  16.5 17 18.6 19.4 17 ...
 $ vs          : num  0 0 1 1 0 ...
 $ am          : num  1 1 1 0 0 ...
 $ gear        : num  4 4 4 3 3 ...
 $ carb        : num  4 4 1 1 2 ...

 

# Correlation Matrix
library(ggplot2)
library(corrplot)
corrplot(cor(data, use = "complete.obs"), method = "circle")

The correlation matrix is made with the corrplot library. The first parameter is cor(data, use = "complete.obs") which computes the correlation matrix. The method = "circle" argument specifies the type of plot. This will create a visually appealing matrix that is easier to interpret than the native plot functions in R.

Variable Characteristics

Summary Statistics are descriptive summaries of the range, distribution, and central tendency of a dataset, and include things like:

Cardinality / Number of Unique Values

length(unique(data$column_name))
This function will generate the total number of unique values in a field.

Central Tendency - Mean

mean(data$column_name, na.rm = TRUE)
This function will find the arithmetic average of a numeric-type field.

Central Tendency - Median

median(data$column_name, na.rm = TRUE)
Similarly, this function will produce the median value in a numeric-type variable.

Standard Deviation

sd(data$column_name, na.rm = TRUE)
This function will generate the standard deviation of a numeric-type variable.

Variance

var(data$column_name, na.rm = TRUE)
This function will generate a statistical measure that describes the spread of values in the numeric variable.

Skewness

library(e1071)
skewness(data$column_name, na.rm = TRUE)

This function produces a statistic that represents the asymmetry of a distribution.

Kurtosis

kurtosis(data$column_name, na.rm = TRUE)
Kurtosis is a measure of the peakedness of a distribution.

Maximum Value

max(data$column_name, na.rm = TRUE)
This function will indicate the highest value in a field.

Minimum Value

min(data$column_name, na.rm = TRUE)
This function will indicate the lowest value in a field.

Range

max(data$column_name, na.rm = TRUE) - min(data$column_name, na.rm = TRUE)
By subtracting the minimum value from the maximum value, the total range of values can be determined.

Data Quality Assessment

Some common data quality issues that must be identified in an EDA include:

  • Null Values
  • Outliers
  • Duplicate observations
  • Errant White Spaces
  • Encoding Issues
  • Differences in Scale and Unit between observations

Null Values

A brief code using the Titanic dataset with the is.na() and colSums() functions

data("titanic_train", package = "titanic")
na_counts <- colSums(is.na(titanic_train))
print(na_counts)

This will generate an output like this showing the number of null values for each variables:

PassengerId    Survived      Pclass        Name         Sex         Age       SibSp       Parch      Ticket 
          0                  0                  0               0               0          177            0                0           0 
       Fare       Cabin    Embarked 
           0           0           0 

For most purposes this will be adequate to make mitigating pre-processing decisions,  However if you want to see a complete output of each value you can use the is.na() function without adding Colsum().  You will also need to adjust the output options, because the default settings will show the first and last few lines.  This can be done with the options(max.print = 99999) function:

library(ggplot2)
library(titanic)
data <- titanic_train
options(max.print = 99999)
print(is.na(data))

The output will look that below, where a null value is represented with True and a valid value is represented with False.  This is because by using the is.na() function you are querying the dataset if a data is null.

[42,]       FALSE    FALSE  FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE    FALSE
 [43,]       FALSE    FALSE  FALSE FALSE FALSE  TRUE FALSE FALSE  FALSE FALSE FALSE    FALSE
 [44,]       FALSE    FALSE  FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE    FALSE
 [45,]       FALSE    FALSE  FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE    FALSE
 [46,]       FALSE    FALSE  FALSE FALSE FALSE  TRUE FALSE FALSE  FALSE FALSE FALSE    FALSE
 [47,]       FALSE    FALSE  FALSE FALSE FALSE  TRUE FALSE FALSE  FALSE FALSE FALSE    FALSE
 [48,]       FALSE    FALSE  FALSE FALSE FALSE  TRUE FALSE FALSE  FALSE FALSE FALSE    FALSE
 [49,]       FALSE    FALSE  FALSE FALSE FALSE  TRUE FALSE FALSE  FALSE FALSE FALSE    FALSE
 [50,]       FALSE    FALSE  FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE    FALSE
 [51,]       FALSE    FALSE  FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE    FALSE

Outliers

Some common techniques to identify outliers include Z-score and inter-quartile range, visual approaches like box plots or violin charts and more advanced techniques include machine learning approaches like isolation forests and density-based clustering.  Here we consider z-scores and IQR:

  • Z-score

library(ggplot2)
library(titanic)
library(dplyr)
library(stats)
data <- titanic_train %>%
  select_if(is.numeric) %>%
  na.omit()
zScores <- apply(data, 2, function(x) abs(scale(x)))
outliers <- data[rowSums(zScores > 3) > 0, ]
print(outliers)

The output will look like this:

    PassengerId Survived Pclass Age SibSp Parch     Fare
14           14            0            3     39     1          5         31.2750
17           17            0            3      2      4          1         29.1250
26           26            1            3     38     1          5         31.3875
28           28            0            1     19      3         2         263.0000
51           51            0            3      7       4         1         39.6875
60           60            0            3      11      5         2         46.9000
69           69            1            3      17      4         2         7.9250

  • Inter-quartile Range

library(ggplot2)
library(titanic)
library(dplyr)
data <- titanic_train %>%
  select_if(is.numeric) %>%
  na.omit()
Q1 <- apply(data, 2, quantile, probs = 0.25)
Q3 <- apply(data, 2, quantile, probs = 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
outliers <- data[apply(data, 1, function(row) any(row < lower_bound | row > upper_bound)), ]
options(max.print = 99999)
print(outliers)

Example output:

      PassengerId Survived Pclass   Age SibSp Parch     Fare
8    8                   0             3            2     3        1             21.0750
14  14                 0             3            39   1        5             31.2750
17  17                 0             3            2     4        1             29.1250
25  25                 0             3            8     3        1             21.0750
26  26                 1             3            38   1        5             31.3875
28  28                  0            1           19    3        2             263.0000

Duplication
Using R, inter-observation duplicates can be identified using the following:

library(dplyr)
print(duplicated(data))
print(data[duplicated(data), ])
print(data[duplicated(data, by = c("fieldName1", "fieldName2")), ])

Errant White Spaces
This issue is especially important when using categorical or text data. For example, 'Morgan State' and 'M organ State' are two different values in all programming environments and most software applications. These can be detected using a cardinality measure which may show differences in expected values; or by ordering values alphabetically.

Print alphabetically:

print(sort(data$fieldName))

Encoding Issues
Non-Standard Characters: Data files might contain characters that do not render correctly due to incorrect or inconsistent character encoding across different systems (e.g., UTF-8 vs. ISO-8859-1). This will result in some characters looking strange, e.g., "&" will look like &. Encoding issues often manifest early in the EDA process but can confound analyses if not detected and corrected early on. To verify encoding:

library(readr)
data <- read_csv("file.csv", locale = locale(encoding = "encoding"))
print(head(data))

Data Type Mismatches

Numeric vs. Categorical: Numeric codes used for categorical data (like 1 for Male, 2 for Female) might be mistakenly treated as numeric data, leading to inappropriate statistical analyses.

Data type mismatches can be detected using str(data), where categorical data and numeric data in the same field will be described as an object.

Date Formats
Different representations of dates (e.g., DD/MM/YYYY vs. MM/DD/YYYY) can lead to incorrect parsing and interpretation of data. There is a wide range of possible date formats, but the following solution in R can summarize several different types and give some indication about date format variations:

library(lubridate)
library(dplyr)
detect_format <- function(date_str) {
  formats <- c("%B %d, %Y", "%Y %B %d", "%m/%d/%y", "%B %d", "%m %B", "%m/%Y")
  for (fmt in formats) {
    if (!is.na(tryCatch(ymd(date_str, format = fmt), error = function(e) NA))) {
      return(fmt)
    }
  }
  return("Unknown or mixed format")
}
data <- c("January 1, 2020", "2020 January 2", "12/31/20", "January 31", "5 January", "05/2020")
formats <- sapply(data, detect_format)
summary <- table(formats)
print(summary)

Text Encoding in Numerical Data

Commas in Numbers: Commas used as thousand separators or as decimals in some countries can be misinterpreted as decimal points or delimiters, leading to incorrect parsing, especially in comma-separated values datasets.

library(readr)
locale_set <- locale(decimal_mark = ".", grouping_mark = ",")
convert_numeric <- function(value) {
  parsed <- parse_number(value, locale = locale_set)
  if (is.na(parsed)) {
    return(NA)  # Return a missing value indicator if conversion fails
  }
  return(parsed)
}
data <- read_csv("path_to_your_file.csv", col_types = cols(Amount = col_double()), locale = locale_set)
print(head(data))

Scale and Unit
Identifying scale and unit inconsistencies in datasets presents a range of complex challenges, including decipherment of undocumented data leading to ambiguous interpretations about what exactly is being encoded and what scale is being used. Many scale and unit issues can be detected with visualizations and a personal knowledge of how variables should 'behave' and the context in which data were collected, and can be more easily understood with comparative visualizations.

library(ggplot2)
library(dplyr)
groups <- group_by(data, category)
summary_stats <- summarise(groups, describe = summary(data_column))
print(summary_stats)
ggplot(data, aes(x = category, y = data_column)) +
  geom_boxplot() +
  labs(title = "Boxplot by Category", x = "Category", y = "Data Column")
if(any(data$temperature > 60, na.rm = TRUE)) {
  stop("Temperature values suggest incorrect units")
}
data <- mutate(data, z_scores = scale(data_column))
print(filter(data, abs(z_scores) > 3))

 

Visualizations

Example of Scatter Plot Code using the Iris Dataset

# Load necessary libraries
library(ggplot2)
library(dplyr)

# The Iris dataset is already included in base R, so no need to load it from a package
data(iris)

# Display the first few rows of the dataset to understand its structure
print(head(iris))

# Create a scatter plot using ggplot2
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species, shape = Species)) +
  geom_point() +
  labs(title = 'Iris Dataset: Sepal Length vs Sepal Width',
       x = 'Sepal Length (cm)',
       y = 'Sepal Width (cm)') +
  theme_minimal()  # Adds a minimalistic theme to the plot

Example of a Scatterplot with Trendline Code using the Tips dataset

# Load necessary libraries
library(ggplot2)
library(reshape2)  # Ensure this package is installed for the 'tips' dataset

# Load the 'tips' dataset from the reshape2 package
data("tips", package = "reshape2")

# Display the first few rows of the dataset to understand its structure
print(head(tips))

# Create a scatter plot with a regression line using ggplot2
p <- ggplot(tips, aes(x = total_bill, y = tip)) +
  geom_point(alpha = 0.5) +  # Adding transparency to points
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Add linear model without confidence interval
  labs(title = 'Relationship Between Total Bill and Tip',
       x = 'Total Bill ($)',
       y = 'Tip ($)') +
  theme_minimal()  # Adds a minimalistic theme to the plot

# Print the plot
print(p)

Example of a Multivariate Barchart Code using the Titanic Dataset:

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(titanic)  # This package contains the Titanic dataset

# Load the 'titanic' dataset
data(titanic_train, package = "titanic")
titanic <- titanic_train

# Display the first few rows of the dataset to understand its structure
print(head(titanic))

# Drop rows with missing values to avoid errors during plotting, specifically in 'Age' and 'Embarked' columns
titanic <- titanic %>% 
  filter(!is.na(Age) & !is.na(Embarked))

# Prepare the data by summarizing survival counts
titanic_summary <- titanic %>%
  group_by(Pclass, Sex) %>%
  summarise(Survival_Count = sum(Survived), .groups = "drop")

# Create a multivariate bar chart using ggplot2
p <- ggplot(titanic_summary, aes(x = Pclass, y = Survival_Count, fill = Sex)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  labs(title = 'Survival Counts by Passenger Class and Sex',
       x = 'Passenger Class',
       y = 'Number of Survivors') +
  theme_minimal()  # Clean and minimal theme

# Print the plot
print(p)

Example of Barchart Code using Mtcars Dataset

# Load necessary library
library(ggplot2)
library(dplyr)

# Use the built-in 'mtcars' dataset
data("mtcars")

# Prepare the data by summarizing the average miles per gallon by the number of cylinders
cyl_mpg <- mtcars %>%
  group_by(cyl) %>%
  summarise(Average_MPG = mean(mpg), .groups = "drop")

# Create a simple bar chart using ggplot2
p <- ggplot(cyl_mpg, aes(x = factor(cyl), y = Average_MPG, fill = factor(cyl))) +
  geom_col(show.legend = FALSE) +  # Using geom_col for already summarized data
  labs(title = 'Average MPG by Number of Cylinders',
       x = 'Number of Cylinders',
       y = 'Average Miles Per Gallon (MPG)') +
  theme_minimal() +  # Using a clean minimalistic theme
  scale_fill_brewer(palette = "Pastel1")  

# Print the plot
print(p)

Example of Histogram Code using Diamonds Dataset

# Load necessary library
library(ggplot2)
library(dplyr)

# Use the built-in 'mtcars' dataset
data("mtcars")

# Prepare the data by summarizing the average miles per gallon by the number of cylinders
cyl_mpg <- mtcars %>%
  group_by(cyl) %>%
  summarise(Average_MPG = mean(mpg), .groups = "drop")

# Create a simple bar chart using ggplot2
p <- ggplot(cyl_mpg, aes(x = factor(cyl), y = Average_MPG, fill = factor(cyl))) +
  geom_col(show.legend = FALSE) +  
  labs(title = 'Average MPG by Number of Cylinders',
       x = 'Number of Cylinders',
       y = 'Average Miles Per Gallon (MPG)') +
  theme_minimal() +  
  scale_fill_brewer(palette = "Pastel1")  

# Print the plot
print(p)

Example of Pie Chart Code using Titanic Dataset

# Load necessary libraries
library(ggplot2)
library(dplyr)

# The Titanic dataset in R is formatted as a multi-dimensional table
data("Titanic")
titanic <- as.data.frame(Titanic)  # Convert the table to a data frame for easier use with ggplot2

# Summarize the data to count passengers in each class
titanic_summary <- titanic %>%
  group_by(Class) %>%
  summarise(PassengerCount = sum(Freq), .groups = 'drop')  # Summing up the 'Freq' column to get total counts per class

# Create a pie chart using ggplot2 by converting a bar chart to polar coordinates
p <- ggplot(titanic_summary, aes(x = "", y = PassengerCount, fill = Class)) +
  geom_bar(stat = "identity", width = 1) +  # Create a bar chart with width = 1 to fit a pie chart
  coord_polar(theta = "y") +  # Convert the bar chart to pie chart using polar coordinates
  labs(title = "Distribution of Passengers by Class", x = NULL, y = NULL) +  
  theme_void() +  # Remove background, gridlines, and text
  scale_fill_brewer(palette = "Pastel1")  # Color setup

# Print the pie chart
print(p)

Example of Violin Plot Code Using Iris Data

# Load necessary library
library(ggplot2)

# Use the built-in 'iris' dataset
data("iris")

# Create an enhanced violin plot using ggplot2
p <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_violin(trim = FALSE) +  # Create violin plot with full range of data
  stat_summary(fun = median, geom = "point", color = "white", size = 3, shape = 23) +  # Add median as point
  stat_summary(fun.data = mean_sdl, fun.args = list(mult = 1), geom = "errorbar", width = 0.2, color = "darkred") +  # Add mean and standard deviation
  labs(title = 'Distribution of Sepal Lengths Across Iris Species',
       x = 'Species',
       y = 'Sepal Length (cm)') +
  theme_minimal() +  # Apply a minimal theme
  scale_fill_brewer(palette = "Pastel1")  # Color setup for the violins

# Print the enhanced plot
print(p)

 

 

 

©2018 Morgan State University | 1700 East Cold Spring Lane Baltimore, Maryland 21251 | 443-885-3333 | Privacy | Accessibility