Using Pandas Profiling to Accelerate Our Exploratory Analysis¶

Pandas Profiling is a library that generates reports from a pandas DataFrame. The pandas df.describe() function that we normally use in Pandas is great but it is a bit basic for a more serious and detailed exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for a quick data analysis.

The following statistics for each column are presented in an interactive report.

Type information: detect the types of columns in a dataframe.
Essentials: type, single values, missing values
Quantile statistics as minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics such as mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, asymmetry
The most frequent values
Histograms
Outstanding correlations of highly correlated variables, Spearman, Pearson and Kendall matrices
Lost Value Matrix, Count, Heat Map and Lost Value Dendrogram
Text Analysis learns about categories (Shift, Space), hyphens (Latin, Cyrillic) and blocks (ASCII) of text data.
File and image analysis extracts file sizes, creation dates, and dimensions and scans images that are truncated or contain EXIF information.

For this Notebook we will be working on the dataset found in the following link Meteorite landings

This comprehensive data set from the Meteorological Society contains information on all known meteorite landings. It is interesting to observe the places on earth where these objects have fallen, following the coordinates of the dataset.

Meteorite fall map

Screenshot%20from%202020-11-10%2007-22-17.png

Let's start now by importing the dataset, in order to understand a little bit the data we will work with

import pandas as pd

We have saved the Meteorite Falling dataset (Meteorite landings) in the 'datasets' folder of the present working environment, so we selected the right path for the import

df = pd.read_csv("datasets/Meteorite_Landings.csv")

And now we check the data

df.head()

df.shape

(45716, 10)

It is a very interesting dataset, where we can observe the name that the scientists gave to the meteorite, the type of class recclass, the weight in grams mass (g) the date in which it fell and the coordinates in which it fell.

It is also important to note that this is a very complete dataset, with 45,716 records and 10 columns. This information is given by the .shape

For more information about the dataset you can enter here: Meteorite - Nasa

Now, as we mentioned at the beginning, with Pandas we have gotten used to running the .describe() command to generate a descriptive analysis on the dataset in question. Descriptive statistics include those that summarize the central tendency, dispersion, and form of distribution of a data set, excluding the NaN values.

It analyzes both numerical and object series, as well as DataFrame column sets of mixed data types. The result will vary depending on what is provided. For more information about .describe() and the parameters we can pass, you can find the info here: Pandas describe

Now let's run this command over our dataset and see the result

df.describe()

The describe() method skips over the categorical columns (string type) and makes a descriptive statistical analysis on the numerical columns. Here we could see that the id column might not be useful for this analysis, since it is only a single indicator for each row (primary key of the table), while the mass is useful and interesting to understand, for example, the minimum and maximum value, the mean and the percentiles (25, 50, 75).

As we can see, this is a very basic analysis, without further relevant information. If we want more relevant information we must start writing code.

This is where Pandas profiling comes in, and its usefulness. The documentation of this library can be found in the following link. Pandas profiling. The installation will be done in the following way

Installing and importing Pandas Profiling¶

!pip3 install 'pandas-profiling[notebook,html]'

Requirement already satisfied: pandas-profiling[html,notebook] in ./lib/python3.6/site-packages (2.9.0)
  WARNING: pandas-profiling 2.9.0 does not provide the extra 'html'
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (1.1.4)
Requirement already satisfied: numpy>=1.16.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (1.19.4)
Requirement already satisfied: joblib in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.17.0)
Requirement already satisfied: confuse>=1.0.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (1.3.0)
Requirement already satisfied: attrs>=19.3.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (20.3.0)
Requirement already satisfied: tangled-up-in-unicode>=0.0.6 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.0.6)
Requirement already satisfied: missingno>=0.4.2 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.4.2)
Requirement already satisfied: phik>=0.9.10 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.10.0)
Requirement already satisfied: seaborn>=0.10.1 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.11.0)
Requirement already satisfied: ipywidgets>=7.5.1 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (7.5.1)
Requirement already satisfied: scipy>=1.4.1 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (1.5.4)
Requirement already satisfied: visions[type_image_path]==0.5.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.5.0)
Requirement already satisfied: requests>=2.23.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (2.24.0)
Requirement already satisfied: jinja2>=2.11.1 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (2.11.2)
Requirement already satisfied: tqdm>=4.43.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (4.51.0)
Requirement already satisfied: htmlmin>=0.1.12 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (0.1.12)
Requirement already satisfied: matplotlib>=3.2.0 in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (3.3.2)
Requirement already satisfied: jupyter-client>=6.0.0; extra == "notebook" in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (6.1.7)
Requirement already satisfied: jupyter-core>=4.6.3; extra == "notebook" in ./lib/python3.6/site-packages (from pandas-profiling[html,notebook]) (4.6.3)
Requirement already satisfied: python-dateutil>=2.7.3 in ./lib/python3.6/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[html,notebook]) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in ./lib/python3.6/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[html,notebook]) (2020.4)
Requirement already satisfied: pyyaml in ./lib/python3.6/site-packages (from confuse>=1.0.0->pandas-profiling[html,notebook]) (5.3.1)
Requirement already satisfied: numba>=0.38.1 in ./lib/python3.6/site-packages (from phik>=0.9.10->pandas-profiling[html,notebook]) (0.51.2)
Requirement already satisfied: traitlets>=4.3.1 in ./lib/python3.6/site-packages (from ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (4.3.3)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in ./lib/python3.6/site-packages (from ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (7.16.1)
Requirement already satisfied: ipykernel>=4.5.1 in ./lib/python3.6/site-packages (from ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (5.3.4)
Requirement already satisfied: nbformat>=4.2.0 in ./lib/python3.6/site-packages (from ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (5.0.8)
Requirement already satisfied: widgetsnbextension~=3.5.0 in ./lib/python3.6/site-packages (from ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (3.5.1)
Requirement already satisfied: networkx>=2.4 in ./lib/python3.6/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling[html,notebook]) (2.5)
Requirement already satisfied: imagehash; extra == "type_image_path" in ./lib/python3.6/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling[html,notebook]) (4.1.0)
Requirement already satisfied: Pillow; extra == "type_image_path" in ./lib/python3.6/site-packages (from visions[type_image_path]==0.5.0->pandas-profiling[html,notebook]) (8.0.1)
Requirement already satisfied: certifi>=2017.4.17 in ./lib/python3.6/site-packages (from requests>=2.23.0->pandas-profiling[html,notebook]) (2020.11.8)
Requirement already satisfied: chardet<4,>=3.0.2 in ./lib/python3.6/site-packages (from requests>=2.23.0->pandas-profiling[html,notebook]) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in ./lib/python3.6/site-packages (from requests>=2.23.0->pandas-profiling[html,notebook]) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in ./lib/python3.6/site-packages (from requests>=2.23.0->pandas-profiling[html,notebook]) (1.25.11)
Requirement already satisfied: MarkupSafe>=0.23 in ./lib/python3.6/site-packages (from jinja2>=2.11.1->pandas-profiling[html,notebook]) (1.1.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in ./lib/python3.6/site-packages (from matplotlib>=3.2.0->pandas-profiling[html,notebook]) (2.4.7)
Requirement already satisfied: cycler>=0.10 in ./lib/python3.6/site-packages (from matplotlib>=3.2.0->pandas-profiling[html,notebook]) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in ./lib/python3.6/site-packages (from matplotlib>=3.2.0->pandas-profiling[html,notebook]) (1.3.1)
Requirement already satisfied: tornado>=4.1 in ./lib/python3.6/site-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[html,notebook]) (6.1)
Requirement already satisfied: pyzmq>=13 in ./lib/python3.6/site-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[html,notebook]) (19.0.2)
Requirement already satisfied: six>=1.5 in ./lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[html,notebook]) (1.15.0)
Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in ./lib/python3.6/site-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling[html,notebook]) (0.34.0)
Requirement already satisfied: setuptools in ./lib/python3.6/site-packages (from numba>=0.38.1->phik>=0.9.10->pandas-profiling[html,notebook]) (50.2.0)
Requirement already satisfied: ipython-genutils in ./lib/python3.6/site-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.2.0)
Requirement already satisfied: decorator in ./lib/python3.6/site-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (4.4.2)
Requirement already satisfied: pygments in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (2.7.2)
Requirement already satisfied: pickleshare in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (3.0.8)
Requirement already satisfied: pexpect; sys_platform != "win32" in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (4.8.0)
Requirement already satisfied: jedi>=0.10 in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.17.2)
Requirement already satisfied: backcall in ./lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.2.0)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in ./lib/python3.6/site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (3.2.0)
Requirement already satisfied: notebook>=4.4.1 in ./lib/python3.6/site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (6.1.5)
Requirement already satisfied: PyWavelets in ./lib/python3.6/site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.5.0->pandas-profiling[html,notebook]) (1.1.1)
Requirement already satisfied: wcwidth in ./lib/python3.6/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.2.5)
Requirement already satisfied: ptyprocess>=0.5 in ./lib/python3.6/site-packages (from pexpect; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.6.0)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in ./lib/python3.6/site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.7.1)
Requirement already satisfied: pyrsistent>=0.14.0 in ./lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.17.3)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in ./lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (2.0.0)
Requirement already satisfied: Send2Trash in ./lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (1.5.0)
Requirement already satisfied: nbconvert in ./lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (6.0.7)
Requirement already satisfied: terminado>=0.8.3 in ./lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.9.1)
Requirement already satisfied: prometheus-client in ./lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.8.0)
Requirement already satisfied: argon2-cffi in ./lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (20.1.0)
Requirement already satisfied: zipp>=0.5 in ./lib/python3.6/site-packages (from importlib-metadata; python_version < "3.8"->jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (3.4.0)
Requirement already satisfied: testpath in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.4.4)
Requirement already satisfied: defusedxml in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.6.0)
Requirement already satisfied: bleach in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (3.2.1)
Requirement already satisfied: pandocfilters>=1.4.1 in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (1.4.3)
Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.5.1)
Requirement already satisfied: jupyterlab-pygments in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.1.2)
Requirement already satisfied: entrypoints>=0.2.2 in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.3)
Requirement already satisfied: mistune<2,>=0.8.1 in ./lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.8.4)
Requirement already satisfied: cffi>=1.0.0 in ./lib/python3.6/site-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (1.14.3)
Requirement already satisfied: packaging in ./lib/python3.6/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (20.4)
Requirement already satisfied: webencodings in ./lib/python3.6/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (0.5.1)
Requirement already satisfied: nest-asyncio in ./lib/python3.6/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (1.4.2)
Requirement already satisfied: async-generator in ./lib/python3.6/site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (1.10)
Requirement already satisfied: pycparser in ./lib/python3.6/site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[html,notebook]) (2.20)
WARNING: You are using pip version 20.2.3; however, version 20.2.4 is available.
You should consider upgrading via the '/home/daniel/Desktop/narrativetext_project/notebooks/bin/python -m pip install --upgrade pip' command.

It is mandatory to pass between quotes the command and followed by notebook,html this is because we will need these two functions of the library.

If you are using Conda, these are the other ways to install it: Installing Pandas profiling

Creating relevant columns to the analysis¶

Now we are going to create a series of columns that will be relevant to the analysis that we will do with Pandas Profiling, the first one will be to create a constant variable for all the records, this time we will say that all the records belong to NASA, and we do the following

df['source'] = "NASA"

df.head()

As we can see, this column was eventually created. We are now going to create a boolean variable at random, simulating some kind of boolean output for each record.

Remember that this is done so that our exploratory analysis can identify this type of data in the result.

# we imported numpy, it should have been installed with Pandas. If you don't have it, you can do it with 
# the `pip3 install numpy` command

import numpy as np

# numpy is going to help us create those random booleans in the next line of code

df['boolean'] = np.random.choice([True, False], df.shape[0])

df.head()

As we see, the column boolean was created with random values of True or False for each of the rows of our dataset, this is thanks to the df.shape[0] that refers to the rows or records of the dataset, that is to say that it made this operation 45,716 times, which is the total number of records.

Let's do now something similar, but mixing numerical data types and categorical data types (strings)

df['mixed'] = np.random.choice([1, 'A'], df.shape[0])

df.head()

As we can see, here we are simulating that a column has two types of data mixed together, both numerical and categorical. This is something that we can find in real datasets, and describe() of Pandas will simply ignore them, and will not give us any analysis about that column (remember that describe() only gives results about numerical columns, it even ignores the boolean columns too)

Now let's do something even more interesting. We are going to create a new column by simulating a high correlation with an existing column. In particular, we will do it on the column reclat that talks about the latitude where the meteorite has fallen, and we will add a normal distribution with a standard deviation of 5 and a sample size equal to the dataset longitude.

If you want to see how to create a simulation of a normal distribution with random numbers with Numpy, check this link. Random normal numpy

df['reclat_city'] = df['reclat'] + np.random.normal(scale=5, size=(len(df)))

df.head()

Let's check the result of the last command, we can see that this column reclat_city now has a high correlation with reclat, because when one observation or row is positive the other one too, and when one is negative, the other one too.

To analyze correlations with Pandas we use a different method than describe(), in this case we use the corr() command. However, with Pandas profiling both analyses (descriptive statistics and correlations) we will obtain them with only one command. We will see this in a few moments when we run our exploratory analysis.

Remember that for now what we are doing is adding columns to the dataframe in order to see all the possibilities offered by the Pandas profiling tool.

We are now going to simulate another common scenario in the datasets, and that is to have duplicate observations or rows. This we will do it like this:

duplicates_to_add = pd.DataFrame(df.iloc[0:10])

duplicates_to_add

What we just did was to create a new dataframe from the first 10 rows of our original dataframe. To do this we use an iloc that serves to select rows and a slice selector to select from row 0 to row 10-1.

Now let's change the name to identify them later, but the other values remain the same

duplicates_to_add['name'] = duplicates_to_add['name'] + " copy"

duplicates_to_add

If we look, now all the names have the word 'copy' at the end. We already have this new dataset ready to concatenate it to the original dataset, so we can have duplicated data. Let's do now the append

df = df.append(duplicates_to_add, ignore_index=True)

df.head()

df.shape

(45726, 14)

The original dataset contains 45716 rows, now we have 10 more rows, which are the duplicate rows. In fact we can see some of them in the above display!

Using Pandas profiling¶

Now we have arrived at the expected moment, we have added some columns to the dataset that will allow us to see interesting analyses on it. But before that, we must be fair to the pandas describe() and see what analysis it gives us on the resulting dataset

df.describe()

As we see, very little difference, it does not give us additional information about:

Boolean columns
Mixed columns
Correlations

This is where Pandas profiling shines by its simplicity to perform an exploratory analysis on our datasets. Without further ado, let's run the following command

## we already have the library installed, now we need to import it

import pandas_profiling
from pandas_profiling.utils.cache import cache_file

## now we run the report

report = df.profile_report(sort='None', html={'style':{'full_width':True}})

report

Understanding the results¶

The output speaks for itself. In comparison with Pandas describe() or even Pandas corr() it is quite significant, and from the beginning we can observe a lot of additional data and analysis that will help us to better interpret the dataset we are working with. Let's analyze for example the columns we recently added

In the Overview we can see the duplicate rows report: Duplicate rows 10
In the Type of variable we can see the Boolean column: BOOL 1
In the Overview, but in the Warnings we can see the high correlation between the columns we created: reclat_city is highly correlated with reclat High correlation
We can see after the Overview an analysis of each column/variable
In the variable mixed we can see the analysis of the randomly generated values
Further down in the section Interactions we can see the different types of graphs and their correlations between variables.
Then we can see an analysis of correlations, which is always important to understand the interdependence of the data and the possible predictive power that these variables have
We can also see an analysis of the "missing values", which is always interesting to make some kind of cleaning or normalization of the data.

Finally we might want to have this report in a different format than a Jupyter Notebook, the library offers us the possibility to export the report to html, which is useful to show it in a more friendly environment for the end user. In which you can even interact by means of navigation bars.

report.to_file(output_file="report_eda.html")

The file in this case has remained within my work environment

Screenshot%20from%202020-11-10%2010-21-59.png

If we click on it, it will open in the browser. This format, personally I like quite a lot, since it does not influence the code, but you can navigate through the analysis and show it to the interested stackeholders in the analysis and make decisions based on them.

Screenshot%20from%202020-11-10%2010-23-25.png

Final Notes¶

As you can see, it is very easy to use the tool, and it is a first step before starting to perform feature engineering and/or predictions. However there are some disadvantages about the tool that are important to take into account:

The main disadvantage of pandas profiling is its use with large data sets. With increasing data size, the time to generate the report also increases a lot.
One way to solve this problem is to generate the profile report for a portion of the data set. But while doing this, it is very important to make sure that the data are sampled randomly so that they are representative of all the data we have. We can do this for example:

data = df.sample(n=1000)

data.head()

len(data)

1000

As we can see, 1000 samples have been selected at random, so the analysis will not be done on more than 40,000 samples. If we have, say 1,000,000 samples, the difference in performance will be significant, so this would be a good practice

profile_in_sample = data.profile_report(sort='None', html={'style':{'full_width':True}})

profile_in_sample

As we see it takes less time to run with a sample of 1,000 examples.

Alternatively, if you insist on getting the report of the whole data set, you can do it using the minimum mode.
In the minimum mode a simplified report will be generated with less information than the full one, but it can be generated relatively quickly for a large data set.
The code for it is given below:

profile_min = data.profile_report(minimal=True)

profile_min

As we can see, it is a faster report but with less information about the exploratory analysis of the data. We leave it up to you to decide what type of report you want to generate. If you want to see more advanced features of the library please go to the following link: Advanced Pandas profiling

We hope you liked this notebook, see you in the next one!

	name	id	nametype	recclass	mass (g)	fall	year	reclat	reclong	GeoLocation
0	Aachen	1	Valid	L5	21.0	Fell	01/01/1880 12:00:00 AM	50.77500	6.08333	(50.775, 6.08333)
1	Aarhus	2	Valid	H6	720.0	Fell	01/01/1951 12:00:00 AM	56.18333	10.23333	(56.18333, 10.23333)
2	Abee	6	Valid	EH4	107000.0	Fell	01/01/1952 12:00:00 AM	54.21667	-113.00000	(54.21667, -113.0)
3	Acapulco	10	Valid	Acapulcoite	1914.0	Fell	01/01/1976 12:00:00 AM	16.88333	-99.90000	(16.88333, -99.9)
4	Achiras	370	Valid	L6	780.0	Fell	01/01/1902 12:00:00 AM	-33.16667	-64.95000	(-33.16667, -64.95)

	id	mass (g)	reclat	reclong
count	45716.000000	4.558500e+04	38401.000000	38401.000000
mean	26889.735104	1.327808e+04	-39.122580	61.074319
std	16860.683030	5.749889e+05	46.378511	80.647298
min	1.000000	0.000000e+00	-87.366670	-165.433330
25%	12688.750000	7.200000e+00	-76.714240	0.000000
50%	24261.500000	3.260000e+01	-71.500000	35.666670
75%	40656.750000	2.026000e+02	0.000000	157.166670
max	57458.000000	6.000000e+07	81.166670	354.473330

	name	id	nametype	recclass	mass (g)	fall	year	reclat	reclong	GeoLocation	source
0	Aachen	1	Valid	L5	21.0	Fell	01/01/1880 12:00:00 AM	50.77500	6.08333	(50.775, 6.08333)	NASA
1	Aarhus	2	Valid	H6	720.0	Fell	01/01/1951 12:00:00 AM	56.18333	10.23333	(56.18333, 10.23333)	NASA
2	Abee	6	Valid	EH4	107000.0	Fell	01/01/1952 12:00:00 AM	54.21667	-113.00000	(54.21667, -113.0)	NASA
3	Acapulco	10	Valid	Acapulcoite	1914.0	Fell	01/01/1976 12:00:00 AM	16.88333	-99.90000	(16.88333, -99.9)	NASA
4	Achiras	370	Valid	L6	780.0	Fell	01/01/1902 12:00:00 AM	-33.16667	-64.95000	(-33.16667, -64.95)	NASA

	name	id	nametype	recclass	mass (g)	fall	year	reclat	reclong	GeoLocation	source	boolean
0	Aachen	1	Valid	L5	21.0	Fell	01/01/1880 12:00:00 AM	50.77500	6.08333	(50.775, 6.08333)	NASA	False
1	Aarhus	2	Valid	H6	720.0	Fell	01/01/1951 12:00:00 AM	56.18333	10.23333	(56.18333, 10.23333)	NASA	False
2	Abee	6	Valid	EH4	107000.0	Fell	01/01/1952 12:00:00 AM	54.21667	-113.00000	(54.21667, -113.0)	NASA	True
3	Acapulco	10	Valid	Acapulcoite	1914.0	Fell	01/01/1976 12:00:00 AM	16.88333	-99.90000	(16.88333, -99.9)	NASA	False
4	Achiras	370	Valid	L6	780.0	Fell	01/01/1902 12:00:00 AM	-33.16667	-64.95000	(-33.16667, -64.95)	NASA	False

	name	id	nametype	recclass	mass (g)	fall	year	reclat	reclong	GeoLocation	source	boolean	mixed
0	Aachen	1	Valid	L5	21.0	Fell	01/01/1880 12:00:00 AM	50.77500	6.08333	(50.775, 6.08333)	NASA	False	A
1	Aarhus	2	Valid	H6	720.0	Fell	01/01/1951 12:00:00 AM	56.18333	10.23333	(56.18333, 10.23333)	NASA	False	1
2	Abee	6	Valid	EH4	107000.0	Fell	01/01/1952 12:00:00 AM	54.21667	-113.00000	(54.21667, -113.0)	NASA	True	A
3	Acapulco	10	Valid	Acapulcoite	1914.0	Fell	01/01/1976 12:00:00 AM	16.88333	-99.90000	(16.88333, -99.9)	NASA	False	1
4	Achiras	370	Valid	L6	780.0	Fell	01/01/1902 12:00:00 AM	-33.16667	-64.95000	(-33.16667, -64.95)	NASA	False	1

	name	id	nametype	recclass	mass (g)	fall	year	reclat	reclong	GeoLocation	source	boolean	mixed	reclat_city
0	Aachen copy	1	Valid	L5	21.0	Fell	01/01/1880 12:00:00 AM	50.77500	6.08333	(50.775, 6.08333)	NASA	False	A	45.108012
1	Aarhus copy	2	Valid	H6	720.0	Fell	01/01/1951 12:00:00 AM	56.18333	10.23333	(56.18333, 10.23333)	NASA	False	1	44.845134
2	Abee copy	6	Valid	EH4	107000.0	Fell	01/01/1952 12:00:00 AM	54.21667	-113.00000	(54.21667, -113.0)	NASA	True	A	56.421836
3	Acapulco copy	10	Valid	Acapulcoite	1914.0	Fell	01/01/1976 12:00:00 AM	16.88333	-99.90000	(16.88333, -99.9)	NASA	False	1	20.536132
4	Achiras copy	370	Valid	L6	780.0	Fell	01/01/1902 12:00:00 AM	-33.16667	-64.95000	(-33.16667, -64.95)	NASA	False	1	-24.426722
5	Adhi Kot copy	379	Valid	EH4	4239.0	Fell	01/01/1919 12:00:00 AM	32.10000	71.80000	(32.1, 71.8)	NASA	False	A	35.951071
6	Adzhi-Bogdo (stone) copy	390	Valid	LL3-6	910.0	Fell	01/01/1949 12:00:00 AM	44.83333	95.16667	(44.83333, 95.16667)	NASA	True	A	48.626966
7	Agen copy	392	Valid	H5	30000.0	Fell	01/01/1814 12:00:00 AM	44.21667	0.61667	(44.21667, 0.61667)	NASA	False	1	43.906622
8	Aguada copy	398	Valid	L6	1620.0	Fell	01/01/1930 12:00:00 AM	-31.60000	-65.23333	(-31.6, -65.23333)	NASA	True	A	-33.367825
9	Aguila Blanca copy	417	Valid	L	1440.0	Fell	01/01/1920 12:00:00 AM	-30.86667	-64.55000	(-30.86667, -64.55)	NASA	True	A	-18.912785

	id	mass (g)	reclat	reclong	reclat_city
count	45726.000000	4.559500e+04	38411.000000	38411.000000	38411.000000
mean	26883.906202	1.327843e+04	-39.107095	61.052594	-39.086822
std	16863.445566	5.749260e+05	46.386011	80.655258	46.606354
min	1.000000	0.000000e+00	-87.366670	-165.433330	-104.333501
25%	12681.250000	7.200000e+00	-76.713770	0.000000	-78.167480
50%	24256.500000	3.261000e+01	-71.500000	35.666670	-68.929999
75%	40653.500000	2.029000e+02	0.000000	157.166670	4.711421
max	57458.000000	6.000000e+07	81.166670	354.473330	80.190247

	name	id	nametype	recclass	mass (g)	fall	year	reclat	reclong	GeoLocation	source	boolean	mixed	reclat_city
28503	Northwest Africa 4261	35384	Valid	L4-6	201.00	Found	01/01/2003 12:00:00 AM	NaN	NaN	NaN	NASA	True	1	NaN
39638	Yamato 790342	25691	Valid	H4	0.94	Found	01/01/1979 12:00:00 AM	-71.50000	35.66667	(-71.5, 35.66667)	NASA	True	1	-72.684310
40111	Yamato 790870	26219	Valid	LL6	3.44	Found	01/01/1979 12:00:00 AM	-71.50000	35.66667	(-71.5, 35.66667)	NASA	True	A	-74.082399
21649	Littlerock	14667	Valid	H6	19050.00	Found	01/01/1979 12:00:00 AM	34.51667	-117.98333	(34.51667, -117.98333)	NASA	False	1	34.935314
21328	Lewis Cliff 88606	14344	Valid	LL6	2.70	Found	01/01/1988 12:00:00 AM	-84.27551	161.40034	(-84.27551, 161.40034)	NASA	False	1	-81.448415