News
Machine Learning
Business
Pandas
Interview To The Winners Of The Data Science Competition "Predicting App Ratings In Google Play Store".
As usual, we have given ourselves the task of interviewing the winners of the competition "Google Play Store Rating Prediction" that ended a few days ago, having as winner Edimer "Siderus" from Colombia and with a score of 0.698709403908066 and who has become the #1 of our general leaderboard, counting the 5 competitions that we have developed so far. The objective of this competition was to analyze and rank the rating of mobile applications in the Android marketplace of the Google Play Store. The evaluation of the model was given using the F1 score, this is because the amount of data in both classes was not symmetrical. As we worked with an imbalanced dataset, the goal was to optimize the model to properly classify both classes and maximize the classification accuracy, especially of the class with minority of data.For this competition we had a record number of participants, with 135 people joining and where we evaluated a total of 1,497 models. Many thanks to the participants, and we invite you to take part in the new competition called "Prediction of Online Shoppers Purchasing Intention".Let's take a look at the first places of the competition and the answers they gave us for the interview, let's learn from them!Rank #1 - Siderus - ColombiaQ: In general terms, how did you approach the problem posed in the competition?A: At first I tried to conceive the problem correctly, familiarizing myself with the database. Then I spent a lot of time building graphs, trying to find underlying patterns in the data or atypicalities that would allow me to make objective decisions. Finally, I fitted three models that served as a baseline to compare whether the new ideas (or algorithms) performed better than these initial results.Q: For this particular competition, did you have any previous experience in this field?A: No, none. My field is agricultural sciences.Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: Several results caught my attention, for example, an application that has many reviews is not necessarily successful, however, the rate between the number of installations and reviews turned out to be the most important variable for my models. I found it interesting that free apps were more likely to be unsuccessful, it also seems that people like apps to be constantly updated and low in size. Personally, I think the biggest problem is that the classes were unbalanced, fortunately there are tools that using sampling with replacement allow us to work with this type of information.Q: In general terms, what data processing and feature engineering did you do for this competition? A: As preprocessing I used missing value imputation through the k nearest neighbors algorithm, for the multilayer perceptron I standardized the numerical variables and transformed them with the Yeo-Johnson transformation; in tree-based algorithms (XGBoost, LightGBM or Catboost) I only imputed the data. In all algorithms I used up-sampling to balance the classes.Q: Which Machine Learning algorithms did you use for the competition?A: I tried many, Naive Bayes, KNN, generalized linear models with regularization, multilayer percentron with keras, Support Vector Machine with radial kernel, Random Forest, XGBoost, LightGBM, Catboost, among others.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others? A: The three highest scoring algorithms were LightGBM, Catboost and Multilayer Perceptron, the assembly of the three provided the best results.Q: What libraries did you use for this particular competition? A: All my work was with R, making use of tidyverse and tidymodels as main libraries. I also used lightgbm, catboost and treesnip. The themis library was very useful for up-sampling.Q: How many years of experience do you have in Data Science and where are you currently working?A: I have been working with data for about 5 years, mainly in the design and statistical analysis of agricultural experiments. Q: What advice would you give to those who did not score so well in the competition?A: To explore the data a lot, to invest a lot of time in visualization, to understand the problem I think is the fundamental part of any data-driven project.Rank #2 - Pablo Lucero - EcuadorQ: In general terms, how did you approach the problem posed in the competition? A: First I did a basic exploratory analysis, then I made a baseline to have something to build on. Subsequently, I performed an attribute extraction and then generated new ones. For modeling I tried different algorithms, the best results were found in tree-based methods, which I optimized to improve the final score.Q: For this particular competition, did you have any previous experience in this field?A: Yes, in my previous work I had the opportunity to address similar issues.Q: What important findings/conclusions did you find in exploring the data? What challenges did you have to deal with?A: Well very quickly, free applications are the most demanded, most of the successful applications have support at least version 4.1. The Eceryone category has the most applications on the market. One of the challenges was the generation of new attributes. I think that was the key to reaching the top positions.Q: In general terms, what data processing and feature engineering did you do for this competition?A: In general terms, for the data processing I cleaned the text type attributes to convert them to numerical values (Price, Installs, last update, etc), I removed symbols or other characters that are not necessary (Current Ver). As for the attribute engineering part, this was based on obtaining new attributes from the relationship that may exist between the App attribute with the rest. For example, the number of words in the App title or if a Category word appears in the App title. This allowed us to obtain about 20 base attributes. A logarithmic transformation was also implemented to improve the distribution of certain attributes. Genetic programming was then applied to obtain about 40 new attributes, giving a total set of 60.Q: What Machine Learning algorithms did you use for the competition?A: I tried different ones, from SVM, RF, MLP, LightGBM, XGBoost and Catboost. Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: Of all of them the one that gave me the best results was LightGBM so I decided to optimize the parameters for the final round.Q: What libraries did you use for this particular competition?A: One for genetic programming called gplearn.Q: How many years of experience do you have in Data Science and where do you currently work? A: I have 5 years of experience. I am currently working in a manufacturing company in the project area, leading Industry 4.0 topics.Q: What advice would you give to those who did not score so well in the competition? A: Review online documentation on similar problems, it helps to get a better picture of the problem. (We should not invent the wheel). Rank #3 - Fernando Chica - EcuadorQ: In general terms, how did you approach the problem posed in the competition? A: Initially, I performed an exploratory analysis of the data to identify the features of the data, from there I postulated possible feature extraction techniques and classification models. Q: For this particular competition, did you have any previous experience in this field?A: In data analysis yes, but for this particular problem of predicting application ratings I did not.Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: The first thing you can notice is the fact that most of the variables are categorical, so at the beginning you had to think about what kind of transformation could be applied to transform them into numerical variables. This is due to the fact that not all models allow working with categorical variables. On the other hand, the main problem of this database (it is even mentioned in the description of the challenge) is the fact that the amount of data of each class is not the same, that is, it is an unbalanced dataset. In that sense, the challenge was to select the model or process to follow to address this problem and avoid overtraining. Q: In general terms, what data processing and feature engineering did you do for this competition?A: Transformation from categorical to numerical variables using, then perform data balancing tests; duplicating data from the class with fewer observations, removing data from the class with more observations and creating synthetic data (until balancing the data) from the class with fewer observations. But there was no significant improvement in the performance of the models tested. So, data balancing was not used in the final model. Q: What Machine Learning algorithms did you use for the competition?A: Multi-layer Perceptron, linear regression, decision trees, XGboost, Light GBM, random forest and Bagging.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: The one that gave me the best score was Bagging, using decision trees as base models. I think it worked better because of the data processing I did, and with Bagging you can also choose the importance given to each class during training and since the data is unbalanced it allows you to regularize the model and prevent over training (overfitting). Q: What libraries did you use for this particular competition?A: A variety of libraries, but in a general way: Sklearn, numpy, pandas, matplotlib, seaborn, imblearn, datetime and keras.Q: How many years of experience do you have in Data Science and where are you currently working? A: I have about 4 years of experience in Data Science and I am currently working as a researcher at a university in the field of applied artificial intelligence. Q: What advice would you give to those who did not score so well in the competition? A: Be very curious about what the data hides, take into account strategies that may seem absurd and look beyond what the data shows at first glance. Rank #4 - Nicolás Dominutti - ArgentinaQ: In general terms, how did you approach the problem posed in the competition? A: After the EDA, I applied a preprocessing pipeline to obtain valuable data from the variables. Then I focused on generating new variables that would provide another perspective to the original data before entering the model selection stage.Q: For this particular competition, did you have any previous experience in this field?A: This is the 1st official competition in which I participate, previously I did bootcamps and focused on personal ML projects.Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: From the EDA it emerged that the dataset was highly unbalanced and consisted of very disparate and messy variables that demanded an interesting data processing pipeline. On the other hand, this analysis also revealed insights that allowed us to generate new variables that added value (e.g. APPS with 0 reviews tended to have a high rating almost unanimously).Q: In general terms, what data processing and feature engineering did you do for this competition?A: We applied techniques such as: extraction of relevant data via regex, creation of new variables, encoding of features treated as categorical and standardization of numeric variables (for algorithms that need it, in the winning algorithm, being an xgboost, it was not used). As an interesting point, having an unbalanced dataset, I chose to perform a random oversampling on the least represented class.Q: What Machine Learning algorithms did you use for the competition?A: I tested Logistic Regression, SVM, Random Forest, Catboost and Xgboost.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: It is not surprising that the best score was obtained with XGBOOST, an algorithm already consolidated in worldwide competitions. This is a very powerful library that is based on the use of boosting, which allows to obtain interesting scores.Q: Which libraries did you use for this particular competition?A: re, numpy, pandas, sklearn, catboost and xgboost.Q: How many years of experience do you have in Data Science and where are you currently working? A: I have 2 years of starting my first Data Science courses. I am currently working at Johnson & Johnson.Q: What advice would you give to those who did not score as well in the competition? A: Spend time to understand the problem domain in detail, ask yourself questions about the why of the industry and manage to capture the answers and insights in the dataset.Rank #5 - Fernando Cifuentes - ColombiaQ: In general terms, how did you approach the problem posed in the competition? A: First I had to understand the problem, understand the variables and above all a good cleaning job on them since it was difficult to work on them as they were, then I created new variables, after that I optimized hyperparameters in my models to finally make the prediction.Q: For this particular competition, did you have any previous experience in this field?A: I have experience in classification models which I have been working on for the last few years.Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: For this case it was a challenge to work with the version variable as it did not correspond to a decimal number, e.g. 8.1.1.Also for the Android version in which I indicated that it varied depending on the version, it was concluded that it is not possible to work with these variables directly, but that a good cleaning job had to be done before entering it into the Model.In addition to this I realized that the data were unbalanced because I had to use a SMOTE algorithm to have a balanced base by oversampling.Q: In general terms, what data processing and feature engineering did you do for this competition?A: For example for the version I took only up to its second level, i.e. 8.1.For the update date I took the maximum update date in the base and on that date I calculated the months that the other applications had been without update.For the Android version I imputed the data in order to have an approximation of the Android version in which I was working in the cases where I did not specify a version. did not specify a version.I also created a new variable which I call rating ratio corresponding to the number of comments over the number of downloads which was my most important variable in my ranking model.Q: What Machine Learning algorithms did you use for the competition?A: I used 3 models Random Forest, Xgboost, Lightgbm.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: An ensemble model by voting of the three models mentioned above, I think it got the best result because at the macro level each model had very similar metrics, however at the individual level the predictions varied for some records, so the ensemble made a "consensus" among the three models.Q: What libraries did you use for this particular competition?A: The main libraries used were: pandas, sklearn, xgboost, lightgbm. Q: How many years of experience do you have in Data Science and where do you currently work? A: I am currently working in a Bank and specifically working in modeling for about three years.Q: What advice would you give to those who did not score so well in the competition? A: Don't get discouraged, we all start like that and keep participating in competitions and reading forums, that's where you get the most help to improve your results.Rank #6 - David Villabón - ColombiaQ: In general terms, how did you approach the problem posed in the competition? A: The first thing I did with the dataset was to transform the variables that were supposed to be numerical, then feature engineering, then testing raw models by evaluating their "f1" score and finally the improvement of the selected model!Q: For this particular competition, did you have any previous experience in this field?A: No, but with exploration and understanding of the data I came to gain insights from the field.Q: What important findings/conclusions did you find in exploring the data? What challenges did you have to deal with?A: Evidently in the exploration of the data there was a considerable imbalance in the objective "Rating" which was a challenge to obtain good results. Q: In general terms, what data processing and feature engineering did you do for this competition?A: After transforming the data that I assumed was numerical and was not, I proceeded to coding the categorical variables, then removing outliers, scaling the data, variable selection and finally techniques for balancing the target variable.Q: What Machine Learning algorithms did you use for the competition?A: I tested LogisticRegression, Perceptron, RandomForestClassifier, knn, XGBoost, LightGBM, RUSBoostClassifier, AdaBoostClassifier.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: I chose RUSBoostClassifier, since it did not overfit.Q: What libraries did you use for this particular competition?A: I used Pandas, Numpy, matplotlib, Sklearn, Imblearn, xgboost.Q: How many years of experience do you have in Data Science and where do you currently work? A: I have been studying data science for a couple of years, currently my work is not related to Data Science.Q: What advice would you give to those who did not score so well in the competition? A: It is fundamental to understand the dataset, to scrutinize the data, to know how to select the final model. I think that is part of the aspects to obtain good results.Rank #9 - James Valencia - PeruQ: In general terms, how did you approach the problem posed in the competition? A: I performed the steps described in the CRISP-DM methodology. To address the particular problem of the unbalanced target I divided the train into three partitions to train a different boosting model for each partition and obtain the final prediction by evaluating the three predictions obtained by each model.Q: For this particular competition, did you have any previous experience in this field?A: I participated in the previous DataSourceAI competition and also in some competitions in Kaggle.Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: Preprocessing of the data was necessary to obtain numerical data to identify the impact on the target. In addition, I had to investigate a method of evaluation focused on unbalanced target: model assembly.Q: In general terms, what data processing and feature engineering did you do for this competition?A: I used regex method to remove characters such as M (million), $ (dollar), etc. Also for the Encoding of categorical variables I focused on the average of the target associated to each category according to the analyzed column.Q: What Machine Learning algorithms did you use for the competition?A: Three Boosting models: Catboost, XGboost; LightGBM.Q: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: The LightGBM model because it is a more optimized model and works well with large amounts of previously processed data.Q: What libraries did you use for this particular competition?A: The classic libraries for preprocessing: pandas, scikit-learn, matplotlib, metrics, among others. Plus some particular ones for boosting models: catboost, XGBoost Classifier, lightgbm.Q: How many years of experience do you have in Data Science and where are you currently working? A: I have two years of experience coding predictive clustering, classification and regression models in Python. In addition, due to the elections in my country (Peru) I am training natural language processing models, taking as imput the tweets in social networks through the tweepy and spacy libraries.Q: What advice would you give to those who did not score so well in the competition? A: Do your own research through tutorials on the internet. Currently there are many resources on Kaggle, Analytics Vidhya, TowardDataScience and even Youtube channels (my favorite on StatQuest).Rank #10 - Frank Diego - PeruQ: In general terms, how did you approach the problem posed in the competition? A: Performing an exploratory analysis of the data, data cleaning, identifying the most significant predictor variables and testing different classification models.Q: For this particular competition, did you have any previous experience in this field?A: First timeQ: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?A: Finding categorical variables with high cardinality, imbalanced data, identifying and removing outliers in different predictor variables and testing various classification models.Q: In general terms, what data processing and feature engineering did you do for this competition?A: Removing special characters and text characters in the Size, Installs and Prices variables; identifying the version number of each app and the number of android versions available for each app, using Enconding techniques for categorical variables, and data normalization.Q: What Machine Learning algorithms did you use for the competition?A: Logistic Regression and Random ForestQ: Which Machine Learning algorithm gave you the best score and why do you think it performed better than the others?A: Random Forest because it has better scores in accuracy, precision and recall.Q: Which libraries did you use for this particular competition?A: Pandas, sklearn, matplotlib, seaborn and scikitplot.Q: How many years of experience do you have in Data Science and where are you currently working? A: I've only been in the data science world for about half a year. I have taken online courses on data processing with the Pandas library, basic statistics and following youtube tutorials on machine learning which has helped me to apply it to this challenge. On the other hand, I have a venture on commercial intelligence of exports from Peru that allows me to give support to exporting companies on the foreign trade scenario in various productive sectors.Q: What advice would you give to those who did not score so well in the competition? A: To deepen the exploratory analysis of data in the datasets to obtain a better understanding of the most important characteristics that influence the target variable. ConclusionAs we can see each of the participants was able to test different models, among which Boosting models stand out and where each participant experiences different approaches to solve the problem. We hope you have drawn your own conclusions, you can share them with us in the comments, and we wait for you in the competition that is active, and maybe you could be the interviewee of the TOP 10 of the next competition!Join CompetitionMany thanks to all the participants and to the winners who helped us with the survey!PS: we are growing our data scientist discussion forum on Slack at the following link, join and participate.
Daniel Morales
February 2, 2021