A few days ago we finished the data science competition called "Predicting Purchase Intention on a Web Page" in which 76 data scientists joined, 35 of them submitted at least 1 machine learning model to the platform and we received and evaluated a total of more than 3,100 models. From here we can draw several conclusions, and that is the need to build different models to evaluate their effectiveness and find the one with the best result.
The best result obtained a score of 0.817631034576143, followed closely by another with a score of 0.816491504853038.
Due to these good results, we wanted to know in detail what the competitors who placed first did. Here are the questions and answers of our winners.
Cristian Camilo Hidalgo Garcia - Colombia - First Place
Q: In general terms, how did you approach the problem posed in the competition?
A: First understand the problem, what is the objective, what do you want to answer, how do the covariates relate to my variable of interest? (supervised learning in this case, objective:classification).
Q: For this particular competency, did you have any prior experience in this field?
A: Yes, I am a statistician and all the time I am working on data modeling, how to extract useful information for my company from a set of the data, think about what transformations, what way to clean it, helps to get a better fit of the models.
Q: What important findings/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: I found that in certain cases, the transformations of categorical variables play an important role in the final outcome of the model, the challenge of was to process both the test and validation set in such a way that the transformations done and the preprocessing is homogeneous. Variables such as the time a user spends on a web page help to understand purchase intent.
Q: In general terms, what data processing and feature engineering did you do for this competition?
A: As I used tree-based methods, no scaling of the quantitative variables was necessary, which gives an advantage in interpretability as these remain at their original scale. For the categorical variables I used one-hot encoding, and all this transformation was done prior to test-train joining to make sure to keep the same structure. I studied the ratio of ones and zeros to understand if they were unbalanced (as indeed they were) whether I should use an optimal cutoff point for classification.
Q: What Machine Learning algorithms did you use for the competition?
A: I used about 10 machine learning classification algorithms, these were: xgboost, catboost, lightgbm, random forest, booting, adaboost, cart, logistic regression, extreme random trees, etc. And then I studied which assemblies generated more information in the dataset.
Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?
A: The podium is between catboost and xgboost, most likely the boosting method, which is the basis in both, manages to map a non-parametric structure from partitions generated in the hyperplane of the covariates in order to improve the explanation in the response variable. I also found that for this particular problem, the random forest is a very good assembler.
Q: What libraries did you use for this particular competition?
A: I used the scikit learn libraries (with many of their functions), catboost, pandas, numpy, xgboost, lightgbm, seaborn and matplotlib.
Q: How many years of experience do you have in Data Science and where are you currently working?
A: I have five years of data science experience and I am currently working for the company Seguros Suramericana S. A.
Q: What advice would you give to those who did not score so well in the competition?
A: Do a very good feature engineering, work from the understanding of the problem in the creation of new variables that can contribute to the explanation of the variables of interest in their joint variation (of the covariates).
Oscar Bartolome Pato - Spain - Second Place
Q: In general terms, how did you approach the problem posed in the competition?
A: First I did an exploratory analysis of the data, then I did a preprocessing of the data and finally I tested different models.
Q: For this particular competition, did you have any previous experience in this field?
A: I am currently training in Data Science and had implemented some models, but not professionally.
Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: The data are complete (no NA).
The numerical variables are on different scales.
Also, in certain variables there were outliers.
Q: In general terms, what data processing and feature engineering did you do for this competition?
A: Centering and scaling of numerical variables.
Elimination of outliers.
Elimination of variables with zero or near-zero variance.
Q: What Machine Learning algorithms did you use for the competition?
A: Logistic Regression, Random Forests and XGBoost.
Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?
A: XGBoost. It is a model with difficult hyper-parameter settings, but with great predictive power.
Q: What libraries did you use for this particular competition?
A: tidyverse, caret and H2O.
Q: How many years of experience do you have in Data Science and where are you currently working?
A: I am currently training as a Data Scientist and working as a Data Engineer.
Q: What advice would you give to those who did not score so well in the competition?
A: As they learn about data science, look for datasets and put it into practice.
Santiago Serna - Colombia - Third Place
Q: In general terms, how did you approach the problem posed in the competition?
A: I did an initial exploratory analysis of each variable vs. the target variable, in order to get an initial intuition of what each variable was and identify trends and possible transformations or new variables to create. I then applied variable selection techniques and tried some classification algorithms to finally optimize hyperparameters.
Q: For this particular competition, did you have any previous experience in this field?
A: No. My experience has been focused on financial data.
Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: For this problem the pagevalues variable was very important, this variable alone makes a great discrimination of the target variable, just setting revenue = 0 for all pagevalues equal to 0, and the rest as revenue = 1, you manage to get a metric of 0.79 and what you managed to improve with the models was relatively little. I didn't have enough time to explore the problem separately, maybe I could have achieved better results...
Q: In general terms, what data processing and feature engineering did you do for this competition?
A: Standardization, creation of new features from the findings in the exploratory analysis and use of variable selection algorithms.
Q: What Machine Learning algorithms did you use for the competition?
A: I tested with LightGBM, XGBoost and Catboost.
Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?
A: LightGBM, XGBoost and Catboost were the ones that gave me the best results.
Q: What libraries did you use for this particular competition?
A: sckit-learn, lightgbm, imbalanced-learn, borutapy, optuna
Q: How many years of experience do you have in Data Science and where are you currently working?
A: I have 12 years in the Bancolombia group.
Q: What advice would you give to those who did not score so well in the competition?
A: The most important thing of all is to understand the data, understand the meaning of the variables, how they relate to the target variable and follow good practices when training the models to avoid overfitting.
Jonathan Loscalzo - Argentina - Eighth Place
Q: In general terms, how did you approach the problem posed in the competition?
A: What I usually do in most of these competitions is to quickly read the statement, find the main ideas like: type of model (classification/regression...), metrics to optimize (F1, accuracy, ...) and download the data to start testing.
After obtaining a complete flow to upload the results, start iterating to improve it, adding feature engineering, testing models, hyper parameters, etc... That is, a generalist flow.
On this occasion I wanted to test a framework and it was excellent for the occasion: optuna.
Q: For this particular competition, did you have any previous experience in this field?
A: I've been a developer for about 7 years, and 1 year and a half ago I realized that the work is boring, making CRUDs became a monotonous task.
I discovered this whole world, and every day I'm researching and improving my skills.
Coming from a dev role and studying a software university degree, it is easier for me to write source code, although my technical knowledge is not as strong as some hard science oriented people.
At the moment, I am in the process of looking for a full time position directly related to these topics.
Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: I don't recall finding anything unusual, but I may be lying :-)
Q: Generally speaking, what data processing and feature engineering did you do for this competition?
A: Overall, I could have looked for more options in FE, but I kept it pretty simple:
- MinMax Scaler for numeric columns
- transformed categorical columns into ordinals
- created Total_Duration to sum all columns with "_duration".
- created Duration_AVG for "_duration" columns
Q: What Machine Learning algorithms did you use for the competition?
A: I tested several models: KNeighborsClassifier, RandomForestClassifier, KNeighborsClassifier, LightGBM. I also tried a Stacking of some models but it seemed too excessive and the results were practically unchanged (simple is better than complex).
Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?
A: LightGBM. Most of the time the algorithms I choose (and in other competitions where they are usually winners) are LGBM, XGBoost, RandomForest.
These boosting algorithms are usually very good at finding patterns in the data.
Be careful because they also tend to overfit (RandomForest is bagging...).
There is no algorithm that solves all problems, There is no such model that is the best of all.
The choice is because it is the simplest one that solved the problem.
Could we have used neural networks? or an ensemble of several models? yes, but maybe we would have obtained the same result for much more effort.
LightGBM or XGBoost are usually good choices for tabular problems.
Q: What libraries did you use for this particular competition?
A: jupyter, sklearn, lightgbm, optuna
Q: How many years of experience do you have in Data Science and where are you currently working?
A: Let's say I have 1.5 years in DS (I don't identify myself with a role and I'm looking for an identity in this regard) and I'm currently working for 7 years in a software consulting firm.
Q: What advice would you give to those who did not score so well in the competition?
A: There are courses in Kaggle that usually cover the most common problems to solve.
Then research particular techniques that will make you improve your skills: some will read books, others will take courses.
There is a huge explosion in this topic, so there are thousands of online resources.
Julian Ismael Centeno - Peru - Ninth Place
Q: In general terms, how did you approach the problem posed in the competition?
A: First I did an EDA, to find some insights, the first thing was to build a baseline model, on this model I started to create more variables to keep the most important ones.
Q: For this particular competition, did you have any previous experience in this field?
A: Data Science and kaggle competencies.
Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: Well I think by creating more variables my model was already overfitted, and I tried to keep only the most important variables.
Q: In general terms, what data processing and feature engineering did you do for this competition?
A: My feature engineering created variables such as flag, ratios, totals, for the categorical variables I created some interactions, and for other low cardinality variables I did a one hot encoding, and for the high cardinality ones I created conditional probabilities,
Q: What Machine Learning algorithms did you use for the competition?
A: ligthtgbm+catboost
Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?
A: catboost
Q: What libraries did you use for this particular competition?
A: sklearn, pandas, numpy, KFold, ligthtgbm
Q: How many years of experience do you have in Data Science and where are you currently working?
A: I have 3 years of experience and I work at Experian peru.
Q: What advice would you give to those who did not score so well in the competition?
A: Well, keep practicing, sometimes it is frustrating when we spend hours creating variables, analyzing and our score does not always go up, we should not lose our calm.