This tutorial covers the entire ML process, from data ingestion, pre-processing, model training, hyper-parameter fitting, predicting and storing the model for later use.
We will complete all these steps in less than 10 commands that are naturally constructed and very intuitive to remember, such as
create_model(),
tune_model(),
compare_models()
plot_model()
evaluate_model()
predict_model()
Let's see the whole picture
Recreating the entire experiment without PyCaret requires more than 100 lines of code in most libraries. The library also allows you to do more advanced things, such as advanced pre-processing, ensembling, generalized stacking, and other techniques that allow you to fully customize the ML pipeline and are a must for any data scientist.
PyCaret is an open source, low-level library for ML with Python that allows you to go from preparing your data to deploying your model in minutes. Allows scientists and data analysts to perform iterative data science experiments from start to finish efficiently and allows them to reach conclusions faster because much less time is spent on programming. This library is very similar to Caret de R, but implemented in python
When working on a data science project, it usually takes a long time to understand the data (EDA and feature engineering). So, what if we could cut the time we spend on the modeling part of the project in half?
Let's see how
First we need this pre-requisites
Here you can find the library docs and others.
First of all, please run this command: !pip3 install pycaret
For Google Colab users: If you are running this notebook in Google Colab, run the following code at the top of your notebook to display interactive images
from pycaret.utils import enable_colab
enable_colab()
Pycaret is divided according to the task we want to perform, and has different modules, which represent each type of learning (supervised or unsupervised). For this tutorial, we will be working on the supervised learning module with a binary classification algorithm.
The PyCaret classification module (pycaret.classification
) is a supervised machine learning module used to classify elements into a binary group based on various techniques and algorithms. Some common uses of classification problems include predicting client default (yes or no), client abandonment (client will leave or stay), disease encountered (positive or negative) and so on.
The PyCaret classification module can be used for binary or multi-class classification problems. It has more than 18 algorithms and 14 plots for analyzing model performance. Whether it's hyper-parameter tuning, ensembling or advanced techniques such as stacking, PyCaret's classification module has it all.
Classificacion models
For this tutorial we will use an UCI data set called Default of Credit Card Clients Dataset. This data set contains information about default payments, demographics, credit data, payment history and billing statements of credit card customers in Taiwan from April 2005 to September 2005. There are 24,000 samples and 25 characteristics.
The dataset can be found here. Or here you'll find a direct link to download.
So, download the dataset to your environment, and then we are going to load it like this
import pandas as pd
df = pd.read_csv('datasets/default of credit card clients.csv')
df.head()
Unnamed: 0 | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | ... | X15 | X16 | X17 | X18 | X19 | X20 | X21 | X22 | X23 | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_0 | PAY_2 | PAY_3 | PAY_4 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default payment next month |
1 | 1 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
2 | 2 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
3 | 3 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
4 | 4 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 | 0 |
5 rows × 25 columns
We also have another way to load it. In fact this will be the default way we will be working with in this tutorial. It is directly from the PyCaret datasets, and it is the first method of our Pipeline
from pycaret.datasets import get_data
dataset = get_data('credit')
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | -2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 689.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
1 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | ... | 14331.0 | 14948.0 | 15549.0 | 1518.0 | 1500.0 | 1000.0 | 1000.0 | 1000.0 | 5000.0 | 0 |
2 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | 0 | ... | 28314.0 | 28959.0 | 29547.0 | 2000.0 | 2019.0 | 1200.0 | 1100.0 | 1069.0 | 1000.0 | 0 |
3 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | 0 | ... | 20940.0 | 19146.0 | 19131.0 | 2000.0 | 36681.0 | 10000.0 | 9000.0 | 689.0 | 679.0 | 0 |
4 | 50000 | 1 | 1 | 2 | 37 | 0 | 0 | 0 | 0 | 0 | ... | 19394.0 | 19619.0 | 20024.0 | 2500.0 | 1815.0 | 657.0 | 1000.0 | 1000.0 | 800.0 | 0 |
5 rows × 24 columns
#check the shape of data
dataset.shape
(24000, 24)
In order to demonstrate the predict_model()
function on unseen data, a sample of 1200 records from the original dataset has been retained for use in the predictions. This should not be confused with a train/test split, since this particular split is made to simulate a real-life scenario. Another way of thinking about this is that these 1200 records are not available at the time the ML experiment was performed.
## sample returns a random sample from an axis of the object. That would be 22,800 samples, not 24,000
data = dataset.sample(frac=0.95, random_state=786)
data
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20534 | 270000 | 2 | 1 | 2 | 34 | 0 | 0 | 2 | 0 | 0 | ... | 44908.0 | 19508.0 | 15860.0 | 4025.0 | 5.0 | 34000.0 | 0.0 | 0.0 | 0.0 | 0 |
6885 | 160000 | 2 | 1 | 2 | 42 | -2 | -2 | -2 | -2 | -2 | ... | 0.0 | 741.0 | 0.0 | 0.0 | 0.0 | 0.0 | 741.0 | 0.0 | 0.0 | 0 |
1553 | 360000 | 2 | 1 | 2 | 30 | 0 | 0 | 0 | 0 | 0 | ... | 146117.0 | 145884.0 | 147645.0 | 6000.0 | 6000.0 | 4818.0 | 5000.0 | 5000.0 | 4500.0 | 0 |
1952 | 20000 | 2 | 1 | 2 | 25 | 0 | 0 | 0 | 0 | 0 | ... | 18964.0 | 19676.0 | 20116.0 | 1700.0 | 1300.0 | 662.0 | 1000.0 | 747.0 | 602.0 | 0 |
21422 | 70000 | 1 | 2 | 2 | 29 | 0 | 0 | 0 | 0 | 0 | ... | 48538.0 | 49034.0 | 49689.0 | 2200.0 | 8808.0 | 2200.0 | 2000.0 | 2000.0 | 2300.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4516 | 130000 | 1 | 3 | 2 | 45 | 0 | 0 | -1 | 0 | -1 | ... | 1261.0 | 390.0 | 390.0 | 1000.0 | 2522.0 | 0.0 | 390.0 | 390.0 | 390.0 | 0 |
8641 | 290000 | 2 | 1 | 2 | 29 | 0 | 0 | 0 | 0 | -1 | ... | -77.0 | 8123.0 | 210989.0 | 1690.0 | 3000.0 | 0.0 | 8200.0 | 205000.0 | 6000.0 | 0 |
6206 | 210000 | 1 | 2 | 1 | 41 | 1 | 2 | 0 | 0 | 0 | ... | 69670.0 | 59502.0 | 119494.0 | 0.0 | 5000.0 | 3600.0 | 2000.0 | 2000.0 | 5000.0 | 0 |
2110 | 550000 | 1 | 2 | 1 | 47 | 0 | 0 | 0 | 0 | 0 | ... | 30000.0 | 0.0 | 0.0 | 10000.0 | 20000.0 | 5000.0 | 0.0 | 0.0 | 0.0 | 0 |
4042 | 200000 | 1 | 1 | 2 | 28 | 0 | 0 | 0 | 0 | 0 | ... | 161221.0 | 162438.0 | 157415.0 | 7000.0 | 8016.0 | 5000.0 | 12000.0 | 6000.0 | 7000.0 | 0 |
22800 rows × 24 columns
# we remove from the original dataset this random data
data_unseen = dataset.drop(data.index)
data_unseen
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 100000 | 2 | 2 | 2 | 23 | 0 | -1 | -1 | 0 | 0 | ... | 221.0 | -159.0 | 567.0 | 380.0 | 601.0 | 0.0 | 581.0 | 1687.0 | 1542.0 | 0 |
39 | 380000 | 1 | 2 | 2 | 32 | -1 | -1 | -1 | -1 | -1 | ... | 32018.0 | 11849.0 | 11873.0 | 21540.0 | 15138.0 | 24677.0 | 11851.0 | 11875.0 | 8251.0 | 0 |
57 | 200000 | 2 | 2 | 1 | 32 | -1 | -1 | -1 | -1 | 2 | ... | 5247.0 | 3848.0 | 3151.0 | 5818.0 | 15.0 | 9102.0 | 17.0 | 3165.0 | 1395.0 | 0 |
72 | 200000 | 1 | 1 | 1 | 53 | 2 | 2 | 2 | 2 | 2 | ... | 144098.0 | 147124.0 | 149531.0 | 6300.0 | 5500.0 | 5500.0 | 5500.0 | 5000.0 | 5000.0 | 1 |
103 | 240000 | 1 | 1 | 2 | 41 | 1 | -1 | -1 | 0 | 0 | ... | 3164.0 | 360.0 | 1737.0 | 2622.0 | 3301.0 | 0.0 | 360.0 | 1737.0 | 924.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23978 | 50000 | 1 | 2 | 1 | 37 | 1 | 2 | 2 | 2 | 0 | ... | 2846.0 | 1585.0 | 1324.0 | 0.0 | 3000.0 | 0.0 | 0.0 | 1000.0 | 1000.0 | 1 |
23979 | 220000 | 1 | 2 | 1 | 41 | 0 | 0 | -1 | -1 | -2 | ... | 5924.0 | 1759.0 | 1824.0 | 8840.0 | 6643.0 | 5924.0 | 1759.0 | 1824.0 | 7022.0 | 0 |
23981 | 420000 | 1 | 1 | 2 | 34 | 0 | 0 | 0 | 0 | 0 | ... | 141695.0 | 144839.0 | 147954.0 | 7000.0 | 7000.0 | 5500.0 | 5500.0 | 5600.0 | 5000.0 | 0 |
23985 | 90000 | 1 | 2 | 1 | 36 | 0 | 0 | 0 | 0 | 0 | ... | 11328.0 | 12036.0 | 14329.0 | 1500.0 | 1500.0 | 1500.0 | 1200.0 | 2500.0 | 0.0 | 1 |
23999 | 50000 | 1 | 2 | 1 | 46 | 0 | 0 | 0 | 0 | 0 | ... | 36535.0 | 32428.0 | 15313.0 | 2078.0 | 1800.0 | 1430.0 | 1000.0 | 1000.0 | 1000.0 | 1 |
1200 rows × 24 columns
## we reset the index of both datasets
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Data for Modeling: (22800, 24) Unseen Data For Predictions: (1200, 24)
The way we divide our data set is important because there is data that we'll not use during the modeling process, and we'll use at the end to validate our results by simulating real data. The data we use for modeling we sub-divide it in order to evaluate two scenarios, training and testing. Therefore, the following has been done
data_unseen
)Now let's set up the Pycaret environment. The setup()
function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup()
must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. Most of this part of the configuration is done automatically, but some parameters can be set manually. For example:
70:30
(as we see in above paragraph), but can be changed with "train_size
".10
by defaultsession_id
" is our classic "random_state
"## setting up the environment
from pycaret.classification import *
Note: After you run the following command you must press enter to finish the process. We will explain how they do it. The setup process may take some time to complete.
model_setup = setup(data=data, target='default', session_id=123)
Description | Value | |
---|---|---|
0 | session_id | 123 |
1 | Target | default |
2 | Target Type | Binary |
3 | Label Encoded | 0: 0, 1: 1 |
4 | Original Data | (22800, 24) |
5 | Missing Values | False |
6 | Numeric Features | 14 |
7 | Categorical Features | 9 |
8 | Ordinal Features | False |
9 | High Cardinality Features | False |
10 | High Cardinality Method | None |
11 | Transformed Train Set | (15959, 88) |
12 | Transformed Test Set | (6841, 88) |
13 | Shuffle Train-Test | True |
14 | Stratify Train-Test | False |
15 | Fold Generator | StratifiedKFold |
16 | Fold Number | 10 |
17 | CPU Jobs | -1 |
18 | Use GPU | False |
19 | Log Experiment | False |
20 | Experiment Name | clf-default-name |
21 | USI | 6e18 |
22 | Imputation Type | simple |
23 | Iterative Imputation Iteration | None |
24 | Numeric Imputer | mean |
25 | Iterative Imputation Numeric Model | None |
26 | Categorical Imputer | constant |
27 | Iterative Imputation Categorical Model | None |
28 | Unknown Categoricals Handling | least_frequent |
29 | Normalize | False |
30 | Normalize Method | None |
31 | Transformation | False |
32 | Transformation Method | None |
33 | PCA | False |
34 | PCA Method | None |
35 | PCA Components | None |
36 | Ignore Low Variance | False |
37 | Combine Rare Levels | False |
38 | Rare Level Threshold | None |
39 | Numeric Binning | False |
40 | Remove Outliers | False |
41 | Outliers Threshold | None |
42 | Remove Multicollinearity | False |
43 | Multicollinearity Threshold | None |
44 | Clustering | False |
45 | Clustering Iteration | None |
46 | Polynomial Features | False |
47 | Polynomial Degree | None |
48 | Trignometry Features | False |
49 | Polynomial Threshold | None |
50 | Group Features | False |
51 | Feature Selection | False |
52 | Features Selection Threshold | None |
53 | Feature Interaction | False |
54 | Feature Ratio | False |
55 | Interaction Threshold | None |
56 | Fix Imbalance | False |
57 | Fix Imbalance Method | SMOTE |
When you run setup()
, PyCaret's inference algorithm will automatically deduce the data types of all features based on certain properties. The data type must be inferred correctly but this is not always the case. To take this into account, PyCaret displays a table containing the features and their inferred data types after setup()
is executed. If all data types are correctly identified, you can press enter to continue or exit to end the experiment. We press enter, and should come out the same output as we got above.
Ensuring that the data types are correct is critical in PyCaret, as it automatically performs some pre-processing tasks that are essential to any ML experiment. These tasks are performed differently for each type of data, which means that it is very important that they are correctly configured.
We could overwrite the type of data inferred from PyCaret using the numeric_features
and categorical_features
parameters in setup()
. Once the setup has been successfully executed, the information grid containing several important pieces of information is printed. Most of the information is related to the pre-processing pipeline that is built when you run setup()
Most of these features are out of scope for the purposes of this tutorial, however, some important things to keep in mind at this stage include
session_id
: A pseduo-random number distributed as a seed in all functions for later reproducibility. (0 : No, 1 : Yes)
as reference(22800, 24)
==> Remember: "Seeing data"True
(22800, 24)
is transformed into (15959, 91)
for the transformed train set and the number of features has increased from 24
to 91
due to the categorical coding6,841
samples in the test set. This split is based on the default value of 70/30
which can be changed using the train_size
parameter in the configuration.Note how some tasks that are imperative to perform the modeling are handled automatically, such as imputation of missing values (in this case there are no missing values in the training data, but we still need imputers for the unseen data), categorical encoding, etc.
Most of the setup()
parameters are optional and are used to customize the preprocessing pipeline.
In order to understand how PyCaret compares the models and the next steps in the pipeline, it is necessary to understand the concept of N-Fold Coss-Validation.
Calculating how much of your data should be divided into your test set is a delicate question. If your training set is too small, your algorithm may not have enough data to learn effectively. On the other hand, if your test set is too small, then your accuracy, precision, recall and F1 score could have a large variation.
You may be very lucky or very unlucky! In general, putting 70% of your data in the training set and 30% of your data in the test set is a good starting point. Sometimes your data set is so small that dividing it 70/30 will result in a large amount of variance.
One solution to this is to perform N-Fold cross-validation. The central idea here is that we are going to do this whole process N
times and then average the accuracy. For example, in a 10 times cross validation, we will make the test set the first 10% of the data and calculate the accuracy, precision, recall and F1 score.
Then, we will make the cross-validation establish the second 10% of the data and we will calculate these statistics again. We can do this process 10 times, and each time the test set will be a different piece of data. Then we average all the accuracies, and we will have a better idea of how our model works on average.
Note: Validation Set (yellow here) is the Test Set in our case
Understanding the accuracy of your model is invaluable because you can start adjusting the parameters of your model to increase its performance. For example, in the K-Nearest Neighbors algorithm, you can see what happens to the accuracy as you increase or decrease K
. Once you are satisfied with the performance of your model, it's time to enter the validation set. This is the part of your data that you split at the beginning of his experiment (unseen_data
in our case).
It is supposed to be a substitute for the real-world data that you are really interested in sorting out. It works very similar to the test set, except that you never touched this data while building or refining your model. By finding the precision metrics, you get a good understanding of how well your algorithm will perform in the real world.
Comparing all models to evaluate performance is the recommended starting point for modeling once the PyCaret setup()
is completed (unless you know exactly what type of model is needed, which is often not the case), this function trains all models in the model library and scores them using a stratified cross-validation for the evaluation of the metrics.
The output prints a score grid that shows the average of the Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC across the folds (10
by default) along with the training times. Let's do it!
best_model = compare_models()
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|
ridge | Ridge Classifier | 0.8254 | 0.0000 | 0.3637 | 0.6913 | 0.4764 | 0.3836 | 0.4122 | 0.0360 |
lda | Linear Discriminant Analysis | 0.8247 | 0.7634 | 0.3755 | 0.6794 | 0.4835 | 0.3884 | 0.4132 | 0.2240 |
gbc | Gradient Boosting Classifier | 0.8226 | 0.7789 | 0.3551 | 0.6806 | 0.4664 | 0.3725 | 0.4010 | 1.8550 |
ada | Ada Boost Classifier | 0.8221 | 0.7697 | 0.3505 | 0.6811 | 0.4626 | 0.3690 | 0.3983 | 0.4490 |
catboost | CatBoost Classifier | 0.8215 | 0.7760 | 0.3657 | 0.6678 | 0.4724 | 0.3759 | 0.4007 | 5.0580 |
lightgbm | Light Gradient Boosting Machine | 0.8210 | 0.7750 | 0.3609 | 0.6679 | 0.4683 | 0.3721 | 0.3977 | 0.1440 |
rf | Random Forest Classifier | 0.8199 | 0.7598 | 0.3663 | 0.6601 | 0.4707 | 0.3727 | 0.3965 | 1.0680 |
xgboost | Extreme Gradient Boosting | 0.8160 | 0.7561 | 0.3629 | 0.6391 | 0.4626 | 0.3617 | 0.3829 | 1.6420 |
et | Extra Trees Classifier | 0.8092 | 0.7377 | 0.3677 | 0.6047 | 0.4571 | 0.3497 | 0.3657 | 0.9820 |
lr | Logistic Regression | 0.7814 | 0.6410 | 0.0003 | 0.1000 | 0.0006 | 0.0003 | 0.0034 | 0.7750 |
knn | K Neighbors Classifier | 0.7547 | 0.5939 | 0.1763 | 0.3719 | 0.2388 | 0.1145 | 0.1259 | 0.4270 |
dt | Decision Tree Classifier | 0.7293 | 0.6147 | 0.4104 | 0.3878 | 0.3986 | 0.2242 | 0.2245 | 0.1430 |
svm | SVM - Linear Kernel | 0.7277 | 0.0000 | 0.1017 | 0.1671 | 0.0984 | 0.0067 | 0.0075 | 0.2180 |
qda | Quadratic Discriminant Analysis | 0.4886 | 0.5350 | 0.6176 | 0.2435 | 0.3453 | 0.0485 | 0.0601 | 0.1760 |
nb | Naive Bayes | 0.3760 | 0.6442 | 0.8845 | 0.2441 | 0.3826 | 0.0608 | 0.1207 | 0.0380 |
The compare_models()
function allows you to compare many models at once. This is one of the great advantages of using PyCaret. In one line, you have a comparison table between many models. Two simple words of code (not even one line) have trained and evaluated more than 15 models using the N-Fold cross-validation.
The above printed table highlights the highest performance metrics for comparison purposes only. The default table is sorted using "Accuracy" (highest to lowest) which can be changed by passing a parameter. For example, compare_models(sort = 'Recall')
will sort the grid by Recall instead of Accuracy.
If you want to change the Fold
parameter from the default value of 10
to a different value, you can use the fold
parameter. For example compare_models(fold = 5)
will compare all models in a 5-fold cross-validation.
Reducing the number of folds will improve the training time.
By default, compare_models
returns the best performing model based on the default sort order, but it can be used to return a list of the top N models using the n_select
parameter. In addition, it returns some metrics such as accuracy, AUC and F1. Another cool thing is how the library automatically highlights the best results. Once you choose your model, you can create it and then refine it. Let's go with other methods.
print(best_model)
RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=123, solver='auto', tol=0.001)
create_model
is the most granular function in PyCaret and is often the basis for most of PyCaret's functionality. As its name indicates, this function trains and evaluates a model using a cross-validation that can be set with the parameter fold
. The output prints a scoring table showing by Fold the Precision, AUC, Recall, F1, Kappa and MCC.
For the rest of this tutorial, we will work with the following models as our candidate models. The selections are for illustrative purposes only and do not necessarily mean that they are the best performers or ideal for this type of data
There are 18 classifiers available in the PyCaret model library. To see a list of all classifiers, check the documentation or use the models()
function to view the library.
models()
Name | Reference | Turbo | |
---|---|---|---|
ID | |||
lr | Logistic Regression | sklearn.linear_model._logistic.LogisticRegression | True |
knn | K Neighbors Classifier | sklearn.neighbors._classification.KNeighborsCl... | True |
nb | Naive Bayes | sklearn.naive_bayes.GaussianNB | True |
dt | Decision Tree Classifier | sklearn.tree._classes.DecisionTreeClassifier | True |
svm | SVM - Linear Kernel | sklearn.linear_model._stochastic_gradient.SGDC... | True |
rbfsvm | SVM - Radial Kernel | sklearn.svm._classes.SVC | False |
gpc | Gaussian Process Classifier | sklearn.gaussian_process._gpc.GaussianProcessC... | False |
mlp | MLP Classifier | pycaret.internal.tunable.TunableMLPClassifier | False |
ridge | Ridge Classifier | sklearn.linear_model._ridge.RidgeClassifier | True |
rf | Random Forest Classifier | sklearn.ensemble._forest.RandomForestClassifier | True |
qda | Quadratic Discriminant Analysis | sklearn.discriminant_analysis.QuadraticDiscrim... | True |
ada | Ada Boost Classifier | sklearn.ensemble._weight_boosting.AdaBoostClas... | True |
gbc | Gradient Boosting Classifier | sklearn.ensemble._gb.GradientBoostingClassifier | True |
lda | Linear Discriminant Analysis | sklearn.discriminant_analysis.LinearDiscrimina... | True |
et | Extra Trees Classifier | sklearn.ensemble._forest.ExtraTreesClassifier | True |
xgboost | Extreme Gradient Boosting | xgboost.sklearn.XGBClassifier | True |
lightgbm | Light Gradient Boosting Machine | lightgbm.sklearn.LGBMClassifier | True |
catboost | CatBoost Classifier | catboost.core.CatBoostClassifier | True |
dt = create_model('dt')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.7343 | 0.6257 | 0.4327 | 0.4005 | 0.4160 | 0.2444 | 0.2447 |
1 | 0.7325 | 0.6277 | 0.4384 | 0.3984 | 0.4175 | 0.2443 | 0.2448 |
2 | 0.7431 | 0.6282 | 0.4241 | 0.4146 | 0.4193 | 0.2544 | 0.2544 |
3 | 0.7274 | 0.6151 | 0.4155 | 0.3856 | 0.4000 | 0.2240 | 0.2242 |
4 | 0.7187 | 0.6054 | 0.4040 | 0.3691 | 0.3858 | 0.2038 | 0.2042 |
5 | 0.7187 | 0.6014 | 0.3897 | 0.3656 | 0.3773 | 0.1958 | 0.1960 |
6 | 0.7206 | 0.6128 | 0.4212 | 0.3760 | 0.3973 | 0.2162 | 0.2168 |
7 | 0.7331 | 0.5986 | 0.3610 | 0.3830 | 0.3717 | 0.2024 | 0.2026 |
8 | 0.7206 | 0.6045 | 0.3983 | 0.3707 | 0.3840 | 0.2036 | 0.2038 |
9 | 0.7442 | 0.6272 | 0.4195 | 0.4148 | 0.4171 | 0.2533 | 0.2533 |
Mean | 0.7293 | 0.6147 | 0.4104 | 0.3878 | 0.3986 | 0.2242 | 0.2245 |
SD | 0.0092 | 0.0112 | 0.0218 | 0.0174 | 0.0173 | 0.0218 | 0.0218 |
#trained model object is stored in the variable 'dt'.
print(dt)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=123, splitter='best')
knn = create_model('knn')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.7469 | 0.6020 | 0.1920 | 0.3545 | 0.2491 | 0.1128 | 0.1204 |
1 | 0.7550 | 0.5894 | 0.2092 | 0.3883 | 0.2719 | 0.1402 | 0.1500 |
2 | 0.7506 | 0.5883 | 0.1576 | 0.3459 | 0.2165 | 0.0923 | 0.1024 |
3 | 0.7419 | 0.5818 | 0.1519 | 0.3136 | 0.2046 | 0.0723 | 0.0790 |
4 | 0.7563 | 0.5908 | 0.1490 | 0.3611 | 0.2110 | 0.0954 | 0.1085 |
5 | 0.7550 | 0.5997 | 0.1748 | 0.3720 | 0.2378 | 0.1139 | 0.1255 |
6 | 0.7638 | 0.5890 | 0.1891 | 0.4125 | 0.2593 | 0.1413 | 0.1565 |
7 | 0.7613 | 0.6240 | 0.1633 | 0.3904 | 0.2303 | 0.1163 | 0.1318 |
8 | 0.7619 | 0.5988 | 0.1862 | 0.4037 | 0.2549 | 0.1356 | 0.1500 |
9 | 0.7549 | 0.5756 | 0.1897 | 0.3771 | 0.2524 | 0.1246 | 0.1351 |
Mean | 0.7547 | 0.5939 | 0.1763 | 0.3719 | 0.2388 | 0.1145 | 0.1259 |
SD | 0.0065 | 0.0126 | 0.0191 | 0.0279 | 0.0214 | 0.0214 | 0.0230 |
print(knn)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=-1, n_neighbors=5, p=2, weights='uniform')
rf = create_model('rf')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.8133 | 0.7673 | 0.3610 | 0.6269 | 0.4582 | 0.3551 | 0.3749 |
1 | 0.8239 | 0.7615 | 0.3782 | 0.6735 | 0.4844 | 0.3882 | 0.4117 |
2 | 0.8258 | 0.7708 | 0.3467 | 0.7076 | 0.4654 | 0.3756 | 0.4098 |
3 | 0.8177 | 0.7605 | 0.3725 | 0.6436 | 0.4719 | 0.3710 | 0.3913 |
4 | 0.8208 | 0.7642 | 0.3725 | 0.6599 | 0.4762 | 0.3780 | 0.4006 |
5 | 0.8283 | 0.7638 | 0.3954 | 0.6866 | 0.5018 | 0.4070 | 0.4297 |
6 | 0.8127 | 0.7647 | 0.3582 | 0.6250 | 0.4554 | 0.3522 | 0.3721 |
7 | 0.8283 | 0.7390 | 0.3553 | 0.7168 | 0.4751 | 0.3861 | 0.4202 |
8 | 0.8108 | 0.7496 | 0.3610 | 0.6146 | 0.4549 | 0.3496 | 0.3678 |
9 | 0.8176 | 0.7565 | 0.3621 | 0.6462 | 0.4641 | 0.3645 | 0.3867 |
Mean | 0.8199 | 0.7598 | 0.3663 | 0.6601 | 0.4707 | 0.3727 | 0.3965 |
SD | 0.0062 | 0.0089 | 0.0131 | 0.0335 | 0.0139 | 0.0172 | 0.0202 |
print(rf)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False)
Note that the average score of all models matches the score printed on compare_models()
. This is because the metrics printed in the compare_models()
score grid are the average scores of all the folds.
You can also see in each print()
of each model the hyperparameters with which they were built. This is very important because it is the basis for improving them. You can see the parameters for RandomForestClassifier
max_depth=None
max_features='auto'
min_samples_leaf=1
min_samples_split=2
min_weight_fraction_leaf=0.0
n_estimators=100
n_jobs=-1
When creating a model using the create_model()
function the default hyperparameters are used to train the model. To tune the hyperparameters the tune_model()
function is used. This function automatically tunes the hyperparameters of a model using the Random Grid Search in a predefined search space.
The output prints a score grid showing the accuracy, AUC, Recall, Precision, F1, Kappa and MCC by Fold for the best model. To use a custom search grid, you can pass the custom_grid
parameter in the tune_model
function
tuned_rf = tune_model(rf)
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
0 | 0.8158 | 0.7508 | 0.3181 | 0.6647 | 0.4302 | 0.3363 | 0.3689 |
1 | 0.8283 | 0.7675 | 0.3295 | 0.7419 | 0.4563 | 0.3719 | 0.4152 |
2 | 0.8139 | 0.7337 | 0.3181 | 0.6529 | 0.4277 | 0.3321 | 0.3628 |
3 | 0.8246 | 0.7588 | 0.3095 | 0.7347 | 0.4355 | 0.3514 | 0.3976 |
4 | 0.8170 | 0.7567 | 0.3438 | 0.6557 | 0.4511 | 0.3539 | 0.3805 |
5 | 0.8258 | 0.7506 | 0.3324 | 0.7205 | 0.4549 | 0.3676 | 0.4067 |
6 | 0.8170 | 0.7530 | 0.3324 | 0.6629 | 0.4427 | 0.3474 | 0.3771 |
7 | 0.8221 | 0.7507 | 0.3381 | 0.6901 | 0.4538 | 0.3621 | 0.3951 |
8 | 0.8177 | 0.7201 | 0.2980 | 0.6933 | 0.4168 | 0.3286 | 0.3699 |
9 | 0.8207 | 0.7484 | 0.3132 | 0.6987 | 0.4325 | 0.3439 | 0.3831 |
Mean | 0.8203 | 0.7490 | 0.3233 | 0.6915 | 0.4402 | 0.3495 | 0.3857 |
SD | 0.0045 | 0.0126 | 0.0135 | 0.0310 | 0.0129 | 0.0140 | 0.0165 |
If we compare the Accuracy metrics of this refined RandomForestClassifier model with the previous RandomForestClassifier, we see a difference, because it went from an Accuracy of 0.8199
to an Accuracy of 0.8203
.
#tuned model object is stored in the variable 'tuned_dt'.
print(tuned_rf)
RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight={}, criterion='entropy', max_depth=5, max_features=1.0, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0002, min_impurity_split=None, min_samples_leaf=5, min_samples_split=10, min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False)
Let's compare now the hyperparameters. We had these before.
max_depth=None
max_features='auto'
min_samples_leaf=1
min_samples_split=2
min_weight_fraction_leaf=0.0
n_estimators=100
n_jobs=-1
Now these:
max_depth=5
max_features=1.0
min_samples_leaf=5
min_samples_split=10
min_weight_fraction_leaf=0.0
n_estimators=150
n_jobs=-1
You can make this same comparisson with knn
and dt
by yourself and explore the differences in the hyperparameters.
By default, tune_model
optimizes Accuracy but this can be changed using the optimize
parameter. For example: tune_model(dt, optimize = 'AUC')
will look for the hyperparameters of a Decision Tree Classifier that results in the highest AUC instead of Accuracy. For the purposes of this example, we have used Accuracy's default metric only for simplicity.
Generally, when the data set is unbalanced (like the credit data set we are working with) Accuracy is not a good metric to consider. The methodology underlying the selection of the correct metric to evaluate a rating is beyond the scope of this tutorial.
Metrics alone are not the only criteria you should consider when selecting the best model for production. Other factors to consider include training time, standard deviation of k-folds, etc. For now, let's go ahead and consider the Random Forest Classifier tuned_rf
, as our best model for the rest of this tutorial
Before finalizing the model (Step # 8), the plot_model()
function can be used to analyze the performance through different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a trained model object and returns a graph based on the training/test set.
There are 15 different plots available, please refer to plot_model()
documentation for a list of available plots.
## AUC Plot
plot_model(tuned_rf, plot = 'auc')
## Precision-recall curve
plot_model(tuned_rf, plot = 'pr')
## feature importance
plot_model(tuned_rf, plot='feature')
## Consufion matrix
plot_model(tuned_rf, plot = 'confusion_matrix')
Another way to analyze model performance is to use the evaluate_model()
function which displays a user interface for all available graphics for a given model. Internally it uses the plot_model()
function.
evaluate_model(tuned_rf)
The completion of the model is the last step of the experiment. A normal machine learning workflow in PyCaret starts with setup()
, followed by comparison of all models using compare_models()
and pre-selection of some candidate models (based on the metric of interest) to perform various modeling techniques, such as hyperparameter fitting, assembly, stacking, etc.
This workflow will eventually lead you to the best model to use for making predictions on new and unseen data. The finalize_model()
function fits the model to the complete data set, including the test sample (30% in this case). The purpose of this function is to train the model on the complete data set before it is deployed into production. We can execute this method after or before the predict_model()
. We're going to execute it after of it.
One last word of caution. Once the model is finalized using finalize_model()
, the entire data set, including the test set, is used for training. Therefore, if the model is used to make predictions about the test set after finalize_model()
is used, the printed information grid will be misleading since it is trying to make predictions about the same data that was used for the modeling.
To demonstrate this point, we will use final_rf
in predict_model()
to compare the information grid with the previous.
final_rf = finalize_model(tuned_rf)
#Final Random Forest model parameters for deployment
print(final_rf)
RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight={}, criterion='entropy', max_depth=5, max_features=1.0, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0002, min_impurity_split=None, min_samples_leaf=5, min_samples_split=10, min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=-1, oob_score=False, random_state=123, verbose=0, warm_start=False)
Before finalizing the model, it is advisable to perform a final check by predicting the test/hold-out set (data_unseen
in our case) and reviewing the evaluation metrics. If you look at the information table, you will see that 30% (6,841 samples) of the data have been separated as training/set samples.
All of the evaluation metrics we have seen above are cross-validated results based on the training set (70%) only. Now, using our final training model stored in the tuned_rf
variable we predict against the test sample and evaluate the metrics to see if they are materially different from the CV results
predict_model(final_rf)
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Random Forest Classifier | 0.8184 | 0.7526 | 0.3533 | 0.6985 | 0.4692 | 0.3736 | 0.4053 |
LIMIT_BAL | AGE | BILL_AMT1 | BILL_AMT2 | BILL_AMT3 | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | ... | PAY_6_2 | PAY_6_3 | PAY_6_4 | PAY_6_5 | PAY_6_6 | PAY_6_7 | PAY_6_8 | default | Label | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 80000.0 | 29.0 | 6228.0 | 589.0 | 390.0 | 390.0 | 390.0 | 383.0 | 589.0 | 390.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0.8051 |
1 | 180000.0 | 30.0 | 149069.0 | 152317.0 | 156282.0 | 161163.0 | 172190.0 | 148963.0 | 7500.0 | 8000.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | 0.9121 |
2 | 100000.0 | 26.0 | 18999.0 | 23699.0 | 9390.0 | 5781.0 | 8065.0 | 17277.0 | 5129.0 | 1227.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0.8051 |
3 | 500000.0 | 36.0 | 396.0 | 1043.0 | 19230.0 | 116696.0 | 194483.0 | 195454.0 | 1043.0 | 19230.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0.9121 |
4 | 190000.0 | 47.0 | 192493.0 | 193297.0 | 193400.0 | 193278.0 | 192956.0 | 193039.0 | 7200.0 | 7222.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0.9121 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6836 | 120000.0 | 44.0 | 75294.0 | 76465.0 | 74675.0 | 79629.0 | 77748.0 | 82497.0 | 3000.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 0.5013 |
6837 | 50000.0 | 26.0 | 47095.0 | 48085.0 | 49039.0 | 49662.0 | 0.0 | 0.0 | 2073.0 | 2027.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0.8603 |
6838 | 80000.0 | 39.0 | 46401.0 | 39456.0 | 30712.0 | 29629.0 | 28241.0 | 28030.0 | 1560.0 | 1421.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0.8051 |
6839 | 200000.0 | 33.0 | 50612.0 | 10537.0 | 5552.0 | 2506.0 | 9443.0 | 11818.0 | 10023.0 | 27.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0.8051 |
6840 | 210000.0 | 35.0 | 25806.0 | 5861.0 | 1666.0 | 1010.0 | 300.0 | 300.0 | 1035.0 | 1666.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0.9121 |
6841 rows × 91 columns
The accuracy of the test set is 0.8199
compared to the 0.8203
achieved in the results of the tuned_rf
.
This is not a significant difference. If there is a large variation between the results of the test set and the training set, this would normally indicate an over-fitting, but it could also be due to several other factors and would require further investigation.
In this case, we will proceed with the completion of the model and the prediction on unseen data (the 5% that we had separated at the beginning and that was never exposed to PyCaret).
(TIP: It is always good to look at the standard deviation of the results of the training set when using create_model()
.
The predict_model()
function is also used to predict about the unseen data set. The only difference is that this time we will pass the parameter data_unseen
. data_unseen
is the variable created at the beginning of the tutorial and contains 5% (1200 samples) of the original data set that was never exposed to PyCaret.
unseen_predictions = predict_model(final_rf, data=data_unseen)
unseen_predictions.head()
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default | Label | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100000 | 2 | 2 | 2 | 23 | 0 | -1 | -1 | 0 | 0 | ... | 567.0 | 380.0 | 601.0 | 0.0 | 581.0 | 1687.0 | 1542.0 | 0 | 0 | 0.8051 |
1 | 380000 | 1 | 2 | 2 | 32 | -1 | -1 | -1 | -1 | -1 | ... | 11873.0 | 21540.0 | 15138.0 | 24677.0 | 11851.0 | 11875.0 | 8251.0 | 0 | 0 | 0.9121 |
2 | 200000 | 2 | 2 | 1 | 32 | -1 | -1 | -1 | -1 | 2 | ... | 3151.0 | 5818.0 | 15.0 | 9102.0 | 17.0 | 3165.0 | 1395.0 | 0 | 0 | 0.8051 |
3 | 200000 | 1 | 1 | 1 | 53 | 2 | 2 | 2 | 2 | 2 | ... | 149531.0 | 6300.0 | 5500.0 | 5500.0 | 5500.0 | 5000.0 | 5000.0 | 1 | 1 | 0.7911 |
4 | 240000 | 1 | 1 | 2 | 41 | 1 | -1 | -1 | 0 | 0 | ... | 1737.0 | 2622.0 | 3301.0 | 0.0 | 360.0 | 1737.0 | 924.0 | 0 | 0 | 0.9121 |
5 rows × 26 columns
Please go to the last column of this previous result, and you will see a new feature called Score
Label is the prediction and score is the probability of the prediction. Note that the predicted results are concatenated with the original data set, while all transformations are automatically performed in the background.
We have finished the experiment finalizing the tuned_rf
model that now is stored in the final_rf
variable. We have also used the model stored in final_rf
to predict data_unseen
. This brings us to the end of our experiment, but one question remains: What happens when you have more new data to predict? Do you have to go through the whole experiment again? The answer is no, PyCaret's built-in save_model()
function allows you to save the model along with all the transformation pipe for later use and is stored in a Pickle in the local environment
(TIP: It's always good to use the date in the file name when saving models, it's good for version control)
Let's see it in the next step
save_model(final_rf, 'datasets/Final RF Model 19Nov2020')
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-6d18c679a86c> in <module> ----> 1 save_model(final_rf, 'datasets/Final RF Model 19Nov2020') NameError: name 'save_model' is not defined
To load a model saved at a future date in the same or an alternative environment, we would use PyCaret's load_model()
function and then easily apply the saved model to new unseen data for the prediction
saved_final_rf = load_model('datasets/Final RF Model 19Nov2020')
Transformation Pipeline and Model Successfully Loaded
Once the model is loaded into the environment, it can simply be used to predict any new data using the same predict_model()
function. Next we have applied the loaded model to predict the same data_unseen
we used before.
new_prediction = predict_model(saved_final_rf, data=data_unseen)
new_prediction.head()
LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | PAY_5 | ... | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | default | Label | Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100000 | 2 | 2 | 2 | 23 | 0 | -1 | -1 | 0 | 0 | ... | 567.0 | 380.0 | 601.0 | 0.0 | 581.0 | 1687.0 | 1542.0 | 0 | 0 | 0.8051 |
1 | 380000 | 1 | 2 | 2 | 32 | -1 | -1 | -1 | -1 | -1 | ... | 11873.0 | 21540.0 | 15138.0 | 24677.0 | 11851.0 | 11875.0 | 8251.0 | 0 | 0 | 0.9121 |
2 | 200000 | 2 | 2 | 1 | 32 | -1 | -1 | -1 | -1 | 2 | ... | 3151.0 | 5818.0 | 15.0 | 9102.0 | 17.0 | 3165.0 | 1395.0 | 0 | 0 | 0.8051 |
3 | 200000 | 1 | 1 | 1 | 53 | 2 | 2 | 2 | 2 | 2 | ... | 149531.0 | 6300.0 | 5500.0 | 5500.0 | 5500.0 | 5000.0 | 5000.0 | 1 | 1 | 0.7911 |
4 | 240000 | 1 | 1 | 2 | 41 | 1 | -1 | -1 | 0 | 0 | ... | 1737.0 | 2622.0 | 3301.0 | 0.0 | 360.0 | 1737.0 | 924.0 | 0 | 0 | 0.9121 |
5 rows × 26 columns
from pycaret.utils import check_metric
check_metric(new_prediction.default, new_prediction.Label, 'Accuracy')
0.8167
As with any new library, there is still room for improvement. We'll list some of the pros and cons we found while using the library.
This tutorial has covered the entire ML process, from data ingestion, pre-processing, model training, hyper-parameter fitting, predicting and storing the model for later use. We have completed all these steps in less than 10 commands that are naturally constructed and very intuitive to remember, such as create_model(), tune_model(), compare_models()
. Recreating the whole experiment without PyCaret would have required more than 100 lines of code in most of the libraries.
The library also allows you to do more advanced things, such as advanced pre-processing, assembly, generalized stacking, and other techniques that allow you to fully customize the ML pipeline and are a must for any data scientist