Pandas Profiling is a library that generates reports from a pandas DataFrame. The pandas df.describe()
function that we normally use in Pandas is great but it is a bit basic for a more serious and detailed exploratory data analysis. pandas_profiling
extends the pandas DataFrame with df.profile_report()
for a quick data analysis.
The following statistics for each column are presented in an interactive report.
For this Notebook we will be working on the dataset found in the following link Meteorite landings
This comprehensive data set from the Meteorological Society contains information on all known meteorite landings. It is interesting to observe the places on earth where these objects have fallen, following the coordinates of the dataset.
Meteorite fall map
Let's start now by importing the dataset, in order to understand a little bit the data we will work with
import pandas as pd
We have saved the Meteorite Falling dataset (Meteorite landings) in the 'datasets' folder of the present working environment, so we selected the right path for the import
df = pd.read_csv("datasets/Meteorite_Landings.csv")
And now we check the data
df.head()
df.shape
It is a very interesting dataset, where we can observe the name that the scientists gave to the meteorite, the type of class recclass
, the weight in grams mass (g)
the date in which it fell and the coordinates in which it fell.
It is also important to note that this is a very complete dataset, with 45,716 records and 10 columns. This information is given by the .shape
For more information about the dataset you can enter here: Meteorite - Nasa
Now, as we mentioned at the beginning, with Pandas we have gotten used to running the .describe()
command to generate a descriptive analysis on the dataset in question. Descriptive statistics include those that summarize the central tendency, dispersion, and form of distribution of a data set, excluding the NaN
values.
It analyzes both numerical and object series, as well as DataFrame column sets of mixed data types. The result will vary depending on what is provided. For more information about .describe()
and the parameters we can pass, you can find the info here: Pandas describe
Now let's run this command over our dataset and see the result
df.describe()
The describe() method skips over the categorical columns (string type) and makes a descriptive statistical analysis on the numerical columns. Here we could see that the id
column might not be useful for this analysis, since it is only a single indicator for each row (primary key of the table), while the mass
is useful and interesting to understand, for example, the minimum and maximum value, the mean and the percentiles (25, 50, 75).
As we can see, this is a very basic analysis, without further relevant information. If we want more relevant information we must start writing code.
This is where Pandas profiling comes in, and its usefulness. The documentation of this library can be found in the following link. Pandas profiling. The installation will be done in the following way
!pip3 install 'pandas-profiling[notebook,html]'
It is mandatory to pass between quotes the command and followed by notebook,html
this is because we will need these two functions of the library.
If you are using Conda, these are the other ways to install it: Installing Pandas profiling
Now we are going to create a series of columns that will be relevant to the analysis that we will do with Pandas Profiling, the first one will be to create a constant variable for all the records, this time we will say that all the records belong to NASA, and we do the following
df['source'] = "NASA"
df.head()
As we can see, this column was eventually created. We are now going to create a boolean variable at random, simulating some kind of boolean output for each record.
Remember that this is done so that our exploratory analysis can identify this type of data in the result.
# we imported numpy, it should have been installed with Pandas. If you don't have it, you can do it with
# the `pip3 install numpy` command
import numpy as np
# numpy is going to help us create those random booleans in the next line of code
df['boolean'] = np.random.choice([True, False], df.shape[0])
df.head()
As we see, the column boolean
was created with random values of True
or False
for each of the rows of our dataset, this is thanks to the df.shape[0]
that refers to the rows or records of the dataset, that is to say that it made this operation 45,716 times, which is the total number of records.
Let's do now something similar, but mixing numerical data types and categorical data types (strings)
df['mixed'] = np.random.choice([1, 'A'], df.shape[0])
df.head()
As we can see, here we are simulating that a column has two types of data mixed together, both numerical and categorical. This is something that we can find in real datasets, and describe()
of Pandas will simply ignore them, and will not give us any analysis about that column (remember that describe()
only gives results about numerical columns, it even ignores the boolean columns too)
Now let's do something even more interesting. We are going to create a new column by simulating a high correlation with an existing column. In particular, we will do it on the column reclat
that talks about the latitude where the meteorite has fallen, and we will add a normal distribution with a standard deviation of 5
and a sample size equal to the dataset longitude.
If you want to see how to create a simulation of a normal distribution with random numbers with Numpy, check this link. Random normal numpy
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5, size=(len(df)))
df.head()
Let's check the result of the last command, we can see that this column reclat_city
now has a high correlation with reclat
, because when one observation or row is positive the other one too, and when one is negative, the other one too.
To analyze correlations with Pandas we use a different method than describe()
, in this case we use the corr()
command. However, with Pandas profiling both analyses (descriptive statistics and correlations) we will obtain them with only one command. We will see this in a few moments when we run our exploratory analysis.
Remember that for now what we are doing is adding columns to the dataframe in order to see all the possibilities offered by the Pandas profiling tool.
We are now going to simulate another common scenario in the datasets, and that is to have duplicate observations or rows. This we will do it like this:
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add
What we just did was to create a new dataframe from the first 10 rows of our original dataframe. To do this we use an iloc
that serves to select rows and a slice selector to select from row 0 to row 10-1.
Now let's change the name to identify them later, but the other values remain the same
duplicates_to_add['name'] = duplicates_to_add['name'] + " copy"
duplicates_to_add
If we look, now all the names have the word 'copy' at the end. We already have this new dataset ready to concatenate it to the original dataset, so we can have duplicated data. Let's do now the append
df = df.append(duplicates_to_add, ignore_index=True)
df.head()
df.shape
The original dataset contains 45716 rows, now we have 10 more rows, which are the duplicate rows. In fact we can see some of them in the above display!
Now we have arrived at the expected moment, we have added some columns to the dataset that will allow us to see interesting analyses on it. But before that, we must be fair to the pandas describe()
and see what analysis it gives us on the resulting dataset
df.describe()
As we see, very little difference, it does not give us additional information about:
This is where Pandas profiling shines by its simplicity to perform an exploratory analysis on our datasets. Without further ado, let's run the following command
## we already have the library installed, now we need to import it
import pandas_profiling
from pandas_profiling.utils.cache import cache_file
## now we run the report
report = df.profile_report(sort='None', html={'style':{'full_width':True}})
report
The output speaks for itself. In comparison with Pandas describe()
or even Pandas corr()
it is quite significant, and from the beginning we can observe a lot of additional data and analysis that will help us to better interpret the dataset we are working with. Let's analyze for example the columns we recently added
Duplicate rows 10
BOOL 1
reclat_city is highly correlated with reclat High correlation
mixed
we can see the analysis of the randomly generated valuesInteractions
we can see the different types of graphs and their correlations between variables.Finally we might want to have this report in a different format than a Jupyter Notebook, the library offers us the possibility to export the report to html
, which is useful to show it in a more friendly environment for the end user. In which you can even interact by means of navigation bars.
report.to_file(output_file="report_eda.html")
The file in this case has remained within my work environment
If we click on it, it will open in the browser. This format, personally I like quite a lot, since it does not influence the code, but you can navigate through the analysis and show it to the interested stackeholders in the analysis and make decisions based on them.
As you can see, it is very easy to use the tool, and it is a first step before starting to perform feature engineering and/or predictions. However there are some disadvantages about the tool that are important to take into account:
data = df.sample(n=1000)
data.head()
len(data)
As we can see, 1000 samples have been selected at random, so the analysis will not be done on more than 40,000 samples. If we have, say 1,000,000 samples, the difference in performance will be significant, so this would be a good practice
profile_in_sample = data.profile_report(sort='None', html={'style':{'full_width':True}})
profile_in_sample
As we see it takes less time to run with a sample of 1,000 examples.
profile_min = data.profile_report(minimal=True)
profile_min
As we can see, it is a faster report but with less information about the exploratory analysis of the data. We leave it up to you to decide what type of report you want to generate. If you want to see more advanced features of the library please go to the following link: Advanced Pandas profiling
We hope you liked this notebook, see you in the next one!