LibGuides: Data Exploration: Other Resources

External Links

This section contains some links to external sources covering methods and tools that can be used for exploratory data analysis.

What is Exploratory Data Analysis? [via Geeksforgeeks.org]

A very thorough overview of all aspects of EDA, with details about the purpose and approaches to EDA together with R and Python libraries relevant to EDA. A complete step-by-step guide leads researchers through the steps of EDA.

Exploratory Data Analysis [via the Environmental Protection Agency]

This page gives a good overview of the tools and methods used in and EDA, focusing on data visualizations especially.

How to Conduct an Effective Exploratory Data Analysis (EDA) [ via Medium- requires login] by Richard Warepam (2 Nov. 2023)

This page gives a comprehensive overview of EDA in Python together with some initial pre-processing techniques.

What is exploratory data analysis (EDA)? [via IBM]

A succinct, though comprehensive overview of concepts and approaches to EDA.

Articles

Here are a few articles focusing on various aspects and applications of exploratory data analysis

Dey, S. K., Rahman, Md. M., Siddiqi, U. R., & Howlader, A. (2020). Analyzing the epidemiological outbreak of COVID‐19: A visual exploratory data analysis approach. Journal of Medical Virology, 92(6), 632–638. https://doi.org/10.1002/jmv.25743

Abstract:

There is an obvious concern globally regarding the fact about the emerging coronavirus 2019 novel coronavirus (2019‐nCoV) as a worldwide public health threat. As the outbreak of COVID‐19 causes by the severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) progresses within China and beyond, rapidly available epidemiological data are needed to guide strategies for situational awareness and intervention. The recent outbreak of pneumonia in Wuhan, China, caused by the SARS‐CoV‐2 emphasizes the importance of analyzing the epidemiological data of this novel virus and predicting their risks of infecting people all around the globe. In this study, we present an effort to compile and analyze epidemiological outbreak information on COVID‐19 based on the several open datasets on 2019‐nCoV provided by the Johns Hopkins University, World Health Organization, Chinese Center for Disease Control and Prevention, National Health Commission, and DXY. An exploratory data analysis with visualizations has been made to understand the number of different cases reported (confirmed, death, and recovered) in different provinces of China and outside of China. Overall, at the outset of an outbreak like this, it is highly important to readily provide information to begin the evaluation necessary to understand the risks and begin containment activities.

Hsieh, F., Chou, E. P., & Chen, T.-L. (2021). Mimicking Complexity of Structured Data Matrix’s Information Content: Categorical Exploratory Data Analysis. Entropy, 23(5), 594-. https://doi.org/10.3390/e23050594

Abstract:
We develop Categorical Exploratory Data Analysis (CEDA) with mimicking to explore and exhibit the complexity of information content that is contained within any data matrix: categorical, discrete, or continuous. Such complexity is shown through visible and explainable serial multiscale structural dependency with heterogeneity. CEDA is developed upon all features’ categorical nature via histogram and it is guided by all features’ associative patterns (order-2 dependence) in a mutual conditional entropy matrix. Higher-order structural dependency of k(≥ 3) features is exhibited through block patterns within heatmaps that are constructed by permuting contingency-kD-lattices of counts. By growing k, the resultant heatmap series contains global and large scales of structural dependency that constitute the data matrix’s information content. When involving continuous features, the principal component analysis (PCA) extracts fine-scale information content from each block in the final heatmap. Our mimicking protocol coherently simulates this heatmap series by preserving global-to-fine scales structural dependency. Upon every step of mimicking process, each accepted simulated heatmap is subject to constraints with respect to all of the reliable observed categorical patterns. For reliability and robustness in sciences, CEDA with mimicking enhances data visualization by revealing deterministic and stochastic structures within each scale-specific structural dependency. For inferences in Machine Learning (ML) and Statistics, it clarifies, upon which scales, which covariate feature-groups have major-vs.-minor predictive powers on response features. For the social justice of Artificial Intelligence (AI) products, it checks whether a data matrix incompletely prescribes the targeted system.

Goloborodko, A. A., Levitsky, L. I., Ivanov, M. V., & Gorshkov, M. V. (2013). Pyteomics—a Python Framework for Exploratory Data Analysis and Rapid Software Prototyping in Proteomics. Journal of the American Society for Mass Spectrometry, 24(2), 301–304. https://doi.org/10.1007/s13361-012-0516-6

Abstract:
Pyteomics is a cross-platform, open-source Python library providing a rich set of tools for MS-based proteomics. It provides modules for reading LC-MS/MSdata, search engine output, protein sequence databases, theoretical prediction of retention times, electrochemical properties of polypeptides, mass and m/z calculations, and sequence parsing. Pyteomics is available under Apache license; release versions are available at the Python Package Index http://pypi.python.org/pyteomics,the source code repository at http://hg.theorchromo.ru/pyteomics, documentation athttp://packages.python.org/pyteomics. Pyteomics.biolccc documentation is available at http://packages.python.org/pyteomics.biolccc/. Questions on installation and usage can be addressed to pyteomics mailing list: pyteomics@googlegroups.com

EDA Videos

The following videos provide overviews of EDA methods:

Python Series [Youtube- Data Science with Onur]

A series of videos on different aspects of EDA using python in various application domains

Learn Exploratory Data Analysis (EDA) in Python [Youtube- Mark Keith]

A comprehensive introduction ot EDA using python. Topics covered include: Univariate, correlations, p-value, heteroskedasticity, t tests and more