This is the beginning of a series on Exploratory Data Analysis.
Defining EDA
EDA is an acronym for Exploratory Data Analysis. As the name suggests, it's all about exploring your data using simple diagrams and summary statistics. EDA features visualizations like scatter plots, box plots, correlation matrices, and summary values such as mean, median, interquartile range, and much much more! The goal of EDA is to give you a visual overview of your dataset before you jump into predictive modeling. As such, it's a process that typically takes place at the beginning stages of a machine learning project.
Some other Definitions, because I care
Hold on! What's a dataset? What's predictive modeling?
I'm glad you asked.
A dataset is just a collection of all the data you are going to use in a machine learning project usually arranged in a tabular structure (more on that later).
Predictive modeling is using statistics to predict outcomes of an unknown event—this could be a future event, like the price of a certain type of dog food at a store in a certain area next month—or it could even be an unknown past event (for example, it can be used to predict who most likely committed a crime).
Motivation for EDA
The reason you'll want to do EDA at the beginning of the project is because your discoveries during this process will guide what models you build with the data. However, you can return to EDA at other points during your project cycle, especially when your model isn't giving you results you expect. If you're getting strange results it can be a good idea to take another look at your data—see if there are any messy outliers, for example—because a machine learning model is only as good as the data that goes into it.
Defining Structured Data
The types of EDAs I will be discussing in this and future blog posts work on structured data—that means data consisting of rows and columns. Think of a table, an excel spreadsheet, or a dataframe from the Python pandas library (I will have more to say about pandas in later articles). Since it's arranged in a table, this type of data is usually called tabular.
Unstructured data, on the other hand, is data like the pixels making up an image, or the sounds making up a stream of speech. These things cannot be used for EDAs described in this article. We can't get a meaningful average from pixel values, for example.
Types of structured data
There are 2 kinds of structured data:
- numerical data consists of numbers
- for example: data points that are prices of dog food
- categorical data has a certain fixed amount of categories
- for example: data points that are brands of dog food
There are 2 kinds of numerical data:
- continuous numerical data which you can visualize as data on a number line
- for example: prices which can be $1, $5, $10, and any of the numbers in between
- discrete numerical data which can only be whole, concrete numbers
- for example: the exact amount of dogs that eat a certain type of dog food. We can't have half a dog!
and there are 2 kinds of categorical data:
- binary/boolean data is a special kind of categorical data with just 2 possible options:
- for example: whether the dog food is made for a particular dietary restriction
- 0 or 1
- true or false
- yes or no
- for example: whether the dog food is made for a particular dietary restriction
- ordinal data is is categorical data that has a specific ordering:
- for example: customer satisfaction ratings of dog food on a scale of 1-5
Why is it important to categorize your data?
After all, at some point, all the data is going to be turned into numbers when you feed it into the computer program, so why go to all this trouble of categorizing it before hand?
The type of data will help you choose what kind of analysis to use on it. It will also help the computer program you feed the data into determine how to process it. For example, what kind of visual display you want for your data (scatter plot or box plot?), or what kind of predictive modeling to use (linear regression or decision tree?)
Machine learning and data science libraries, like scikit-learn and the pandas library in Python have special functions that operate on certain types of data but not others, so you will need to know what type of data you are dealing with to use these technologies successfully. For example, some functions need to know whether the data is categorical or ordinal.
Tabular/Rectangular data
Tabular data, also known as rectangular data, refers to a data object. A data object can take different forms depending on the technology you are working with. It could be in the form of an excel spreadsheet, a database table, or a pandas dataframe. All these technologies have this in common: they describe a two dimensional matrix that consists of cases and features.
A quick note: if you are starting with unstructured data, it won't be in this rectangular form, and may need to be processed before EDA is possible. (Alternatively, you may choose different approaches for your EDA, other than discussed here.)
Non-rectangular data structures include, time series, spatial data structures, graph/network data structures, but I won't be discussing these types of data in this series.
Further reading
The above article was adapted from my notes on part of Chapter 1 of Practical Statistics for Data Science by Peter Bruce, et al. So far I like this book because the writing style is straightforward and concise, yet doesn't skimp on technical terminology. I have not finished the book, but based on what I've seen so far and its reviews, it's a good one to read if you want farther information on this topic.
Here's a link to the Google books preview, if you're interested.
Come back in two weeks to hear more about EDA! I will be discussing different options for summary statistics.