This is probably one of the biggest worries of those starting in the area of data science, learning/refreshing math
Image by DataSource.ai
Let’s be honest, most people didn’t do very well in math in school, maybe not even in college, and this is very scary and creates a barrier for those who want to explore this discipline called data science.
A few days ago I published a post in Towards Data Scienceand right here on our blog called “Study Plan for Learning Data Science Over the Next 12 Months”, where I gave some quarterly recommendations and made an emphasis on studying mathematics and statistics for this first quarter, and from which I received many questions about exactly which materials I recommended. Well, this post answers those questions. But before that, I want to give you a context.
Leaving aside the factors or reasons that have led most people to hate math, it is a reality that we need it in data science. For me, one of the biggest shortcomings I found in mathematics was its lack of applicability in the real world, I didn’t see a reason for intermediate and advanced mathematics, such as multivariate calculus. I confess that in school and college I didn’t like them for that reason, but I always did well and got good scores and averages above the majority (especially in statistics). But I still didn’t see how I could use a derivative or a matrix in the real world. I finally ended up as a software engineer and once I entered the world of data science I was able to make the connection between mathematics, statistics, and the real world.
On the other hand, it is important to clarify that we do not need a master’s degree in pure mathematics to do data science projects. As I mentioned in previous posts there is a big debate in the community about how much math we need to do a good job as data scientists.
We could say that data science is divided into two major fields of work: research and production
By research, we mean the part of research and development, which normally takes place within a large company (usually a tech company), or which has focused on cutting-edge technological issues (such as medical research). Or it is also an area that is developed within universities. This sector has very limited job offers.
The great advantage is the deep knowledge of algorithms and their implementations, as well as being a person capable of creating variations of existing algorithms, to improve them. Or even create new machine learning algorithms.
The disadvantage is the unpractical nature of their work. It is a very theoretical work, in which often the only objective is to publish papers and is far from the business use cases in general. For reference on this, I recently read this post on Reddit, I recommend you read it.
By production, we refer to the practical side of this discipline, where you’ll use generally and in your day to day job libraries such as scikit-learn, Tensorflow, Keras, Pytorch, and others. These libraries operate like a black box, where you enter data, you get an output, but you don’t know in detail what happened in the process. This also has its advantages and disadvantages, but it certainly makes life much easier when putting useful models into production. What I don’t recommend is to use them blindly, where you don’t have the minimum bases of mathematics to understand a little of their fundamentals and that is the objective of this post, to guide you and recommend you some valuable resources to have the necessary bases and not to operate blindly those libraries.
So if you decide to focus on Research and Development, you are going to need mathematics and statistics in depth (very in-depth). If you are going to go for the practical part, the libraries will help you deal with most of it, under the hood. It should be noted that most job offers are in the practical side.
Well, after the previous remarks, it is time to define which are the specific topics needed to have an initial basis in mathematics for data science.
Linear Algebra: This subject is important to have the fundamentals of working with data in vector and matrix form, to acquire skills to solve systems of linear algebraic equations, and to find the basic matrix decompositions and the general understanding of their applicability.
Calculus: Here it is important to study functional maps, limits (in case of sequences, functions of one and several variables), differentiation (from a single variable to multiple cases), integration, thus sequentially building a foundation for basic optimization. It is also important here to study gradient descent.
Probability theory: Here you should learn about random variables, i.e. a variable whose values are determined by a random experiment. Random variables are used as a model for the data generation processes we want to study. The properties of the data are deeply linked to the corresponding properties of the random variables, such as expected value, variance, and correlations.
Note: these subjects are much deeper than what I just mentioned, this is simply a guide of the subjects and resources recommended to approach mathematics in the field of data science.
Now that we have a better idea of the path we should take, let’s examine the recommended resources to address this topic. We will divide them into basic, intermediate, and advanced. In the advanced ones, we’ll have resources focused on deep learning
Basics: in this first section of resources we’ll recommend the mathematical basics. Mathematical thinking, algebra, and how to implement math with python.
1- Introduction to mathematical thinking
Price: Free
Image by Coursera
Description: Learn how to think the way mathematicians do — a powerful cognitive process developed over thousands of years.
Mathematical thinking is not the same as doing mathematics — at least not as mathematics is typically presented in our school system. School math typically focuses on learning procedures to solve highly stereotyped problems. Professional mathematicians think a certain way to solve real problems, problems that can arise from the everyday world, or from science, or from within mathematics itself. The key to success in school math is to learn to think inside-the-box. In contrast, a key feature of mathematical thinking is thinking outside-the-box — a valuable ability in today’s world. This course helps to develop that crucial way of thinking.
2- Mathematical Foundation for AI and Machine Learning
Price: $46.99 usd
Image by Packt
Description: Artificial Intelligence has gained importance in the last decade with a lot depending on the development and integration of AI in our daily lives. The progress that AI has already made is astounding with innovations like self-driving cars, medical diagnosis and even beating humans at strategy games like Go and Chess. The future for AI is extremely promising and it isn’t far from when we have our own robotic companions. This has pushed a lot of developers to start writing codes and start developing for AI and ML programs. However, learning to write algorithms for AI and ML isn’t easy and requires extensive programming and mathematical knowledge. Mathematics plays an important role as it builds the foundation for programming for these two streams. And in this course, we’ve covered exactly that. We designed a complete course to help you master the mathematical foundation required for writing programs and algorithms for AI and ML.
Description: In Math for Programmers you’ll explore important mathematical concepts through hands-on coding. Filled with graphics and more than 300 exercises and mini-projects, this book unlocks the door to interesting–and lucrative!–careers in some of today’s hottest fields. As you tackle the basics of linear algebra, calculus, and machine learning, you’ll master the key Python libraries used to turn them into real-world software applications.
Description: You can learn a lot of math with a bit of coding!
Many people don’t know that Python is a really powerful tool for learning math. Sure, you can use Python as a simple calculator, but did you know that Python can help you learn more advanced topics in algebra, calculus, and matrix analysis? That’s exactly what you’ll learn in this course.
This course is a perfect supplement to your school/university math course, or for your post-school return to mathematics.
Let me guess what you are thinking:
“But I don’t know Python!”That’s okay! This course is aimed at complete beginners; I take you through every step of the code. You don’t need to know anything about Python, although it’s useful if you already have some programming experience.
“But I’m not good at math!”You will be amazed at how much better you can learn math by using Python as a tool to help with your courses or your independent study. And that’s exactly the point of this course: Python programming as a tool to learn mathematics. This course is designed to be the perfect addition to any other math course or textbook that you are going through.
7- Introduction to Linear Models and Matrix Algebra
Price: Free
Image by Edx
Description: Matrix Algebra underlies many of the current tools for experimental design and the analysis of high-dimensional data. In this introductory online course in data analysis, we will use matrix algebra to represent the linear models that commonly used to model differences between experimental units. We perform statistical inference on these differences. Throughout the course we will use the R programming language to perform matrix operations.
Given the diversity in educational background of our students we have divided the series into seven parts. You can take the entire series or individual courses that interest you. If you are a statistician you should consider skipping the first two or three courses, similarly, if you are biologists you should consider skipping some of the introductory biology lectures. Note that the statistics and programming aspects of the class ramp up in difficulty relatively quickly across the first three courses. You will need to know some basic stats for this course. By the third course will be teaching advanced statistical concepts such as hierarchical models and by the fourth advanced software engineering skills, such as parallel computing and reproducible research concepts.
Description: Python, one of the world’s most popular programming languages, has a number of powerful packages to help you tackle complex mathematical problems in a simple and efficient way. These core capabilities help programmers pave the way for building exciting applications in various domains, such as machine learning and data science, using knowledge in the computational mathematics domain.
The book teaches you how to solve problems faced in a wide variety of mathematical fields, including calculus, probability, statistics and data science, graph theory, optimization, and geometry. You’ll start by developing core skills and learning about packages covered in Python’s scientific stack, including NumPy, SciPy, and Matplotlib. As you advance, you’ll get to grips with more advanced topics of calculus, probability, and networks (graph theory). After you gain a solid understanding of these topics, you’ll discover Python’s applications in data science and statistics, forecasting, geometry, and optimization. The final chapters will take you through a collection of miscellaneous problems, including working with specific data formats and accelerating code.
By the end of this book, you’ll have an arsenal of practical coding solutions that can be used and modified to solve a wide range of practical problems in computational mathematics and data science.
Description: Behind numerous standard models and constructions in Data Science there is mathematics that makes things work. It is important to understand it to be successful in Data Science. In this specialisation we will cover wide range of mathematical tools and see how they arise in Data Science. We will cover such crucial fields as Discrete Mathematics, Calculus, Linear Algebra and Probability. To make your experience more practical we accompany mathematics with examples and problems arising in Data Science and show how to solve them in Python.
Each course of the specialisation ends with a project that gives an opportunity to see how the material of the course is used in Data Science. Each project is directed at solving practical problem in Data Science. In particular, in your projects you will analyse social graphs, predict estate prices and uncover hidden relations in the data.
Description: Discrete mathematics is a field of math that deals with studying finite and distinct elements. The theories and principles of discrete math are widely used in solving complexities and building algorithms in computer science and computing data in data science. It helps you to understand algorithms, binary, and general mathematics that is commonly used in data-driven tasks.
Learn Discrete Mathematics is a comprehensive introduction for those who are new to the mathematics of countable objects. This book will help you get up-to-speed with implementing discrete math principles to take your programming skills to another level. You’ll learn the discrete math language and methods crucial to studying and describing objects and functions in branches of computer science and machine learning. Complete with real-world examples, the book covers the internal workings of memory and CPUs, analyzes data for useful patterns, and shows you how to solve problems in network routing, encryption, and data science.
By the end of this book, you’ll have a deeper understanding of discrete mathematics and its applications in computer science, and get ready to work on real-world algorithm development and machine learning.
14- Math for Data Science and Machine Learning: University Level
Price: $12.99
Image by Udemy
Description: In this course we will learn math for data science and machine learning. We will also discuss the importance of Math for data science and machine learning in practical word. Moreover, Math for data science and machine learning course is bundle of two courses of linear algebra and probability and statistics. So, students will learn complete contents of probability and statistics and linear algebra. It is not like that you will not complete all the contents in this 7 hours videos course. This is a beautiful course and I have designed this course according to the need of the students.
Linear algebra and probability and statistics is usually offered for the students of data science, machine learning, python and IT students. So, that’s why I have prepared this dual course for different sciences.
I have taught this course multiple times on my universities classes. It is offered usually in two different modes like, it is offered as linear algebra for 100 marks paper and probability and statistics as another 100 marks paper for two different or in a same semesters. I usually focus on the method and examples while teaching this course. Examples clear the concepts of the students in a variety of way like, they can understand the main idea that instructor want to deliver if they feel typical the method of the subject or topics. So, focusing on example makes the course easy and understandable for the students.
Description: Data science courses contain math — no avoiding that! This course is designed to teach learners the basic math you will need in order to be successful in almost any data science math course and was created for learners who have basic math skills but may not have taken algebra or pre-calculus. Data Science Math Skills introduces the core math that data science is built upon, with no extra complexity, introducing unfamiliar ideas and math symbols one-at-a-time.
Learners who complete this course will master the vocabulary, notation, concepts, and algebra rules that all data scientists must know before moving on to more advanced material.
Advanced: in this last section we will focus on the statistical part (probability theory) and the application of mathematics to deep learning algorithms.
Description: Inferential statistics allows us to draw conclusions from data that might not be immediately obvious. This course focuses on enhancing your ability to develop hypotheses and use common tests such as t-tests, ANOVA tests, and regression to validate your claims.
18- Statistical Methods and Applied Mathematics in Data Science
Price: $124.99
Image by Packt
Description: Machine learning and data analysis are the center of attraction for many engineers and scientists. The reason is quite obvious: its vast application in numerous fields and booming career options. And Python is one of the leading open source platforms for data science and numerical computing. IPython, and its associated Jupyter Notebook, provide Python with efficient interfaces to for data analysis and interactive visualization, and they constitute an ideal gateway to the platform. If you are among those seeking to enhance their capabilities in machine learning, then this course is the right choice.
Statistical Methods and Applied Mathematics in Data Science provides many easy-to-follow, ready-to-use, and focused recipes for data analysis and scientific computing. This course tackles data science, statistics, machine learning, signal and image processing, dynamical systems, and pure and applied mathematics. You will apply state-of-the-art methods to various real-world examples, illustrating topics in applied mathematics, scientific modeling, and machine learning. In short, you will be well versed with the standard methods in data science and mathematical modeling.
19- Exploring Math for Programmers and Data Scientists
Price: Free
Image by Manning
Description: Exploring Math for Programmers and Data Scientists showcases chapters from three Manning books, chosen by author and master-of-math Paul Orland. You’ll start with a look at the nearest neighbor search problem, common with multidimensional data, and walk through a real-world solution for tackling it. Next, you’ll delve into a set of methods and techniques integral to Principal Component Analysis (PCA), an underlying technique in Latent Semantic Analysis (LSA) for document retrieval. In the last chapter, you’ll work with digital audio data, using mathematical functions in different and interesting ways. Begin sharpening your competitive edge with the fun and fascinating math in this (free!) practical guide!
Description: Most programmers and data scientists struggle with mathematics, having either overlooked or forgotten core mathematical concepts. This book uses Python libraries to help you understand the math required to build deep learning (DL) models.
You’ll begin by learning about core mathematical and modern computational techniques used to design and implement DL algorithms. This book will cover essential topics, such as linear algebra, eigenvalues and eigenvectors, the singular value decomposition concept, and gradient algorithms, to help you understand how to train deep neural networks. Later chapters focus on important neural networks, such as the linear neural network and multilayer perceptrons, with a primary focus on helping you learn how each model works. As you advance, you will delve into the math used for regularization, multi-layered DL, forward propagation, optimization, and backpropagation techniques to understand what it takes to build full-fledged DL models. Finally, you’ll explore CNN, recurrent neural network (RNN), and GAN models and their application.
By the end of this book, you’ll have built a strong foundation in neural networks and DL mathematical concepts, which will help you to confidently research and build custom models in DL.
Description: Math and Architectures of Deep Learning sets out the foundations of DL in a way that’s both useful and accessible to working practitioners. Each chapter explores a new fundamental DL concept or architectural pattern, explaining the underpinning mathematics and demonstrating how they work in practice with well-annotated Python code. You’ll start with a primer of basic algebra, calculus, and statistics, working your way up to state-of-the-art DL paradigms taken from the latest research. By the time you’re done, you’ll have a combined theoretical insight and practical skills to identify and implement DL architecture for almost any real-world challenge.
When we have limited time for study, we should select those that we feel best and those that fit our style. For example, you might prefer videos about books, so go ahead and choose what suits you best. This material is sufficient whether you want to take a brief look at the mathematics, or if you want to go deeper into it. I hope you find it useful.
If you have other recommendations for courses, books or videos, please leave them in the comments so that we can all create links of interest.
Note: we are building a private community in Slack of data scientist, if you want to join us you can register here: https://www.datasource.ai/en#slack
I hope you enjoyed this reading! you can follow me on twitter or linkedin
IntroductionThis article is intended for data scientists who may consider using deep learning algorithms, and want to know more about the cons of implementing these type of models into your work. Deep learning algorithms have many benefits, are powerful, and can be fun to show off. However, there are a few times when you should avoid them. I will be discussing those times when you should stop using deep learning below, so keep on reading if you would like a deeper dive into deep learning.When You Want to Easily ExplainPhoto by Malte Helmhold on Unsplash [2].Because other algorithms have been around longer, they have countless amounts of documentation, including examples and functions that make interpretability easier. It is also how the other algorithms work themselves. Deep learning can be intimidating to data scientists for this reason as well, it can be a turn-off to use a deep learning algorithm when you are unsure of how to explain it to a stakeholder.Here are 3 examples of when you would have trouble explaining deep learning:When you want to describe the top features of your model — the features become hidden inputs, so you will not know what caused a certain prediction to happen, and if you need to prove to stakeholders or customers why a certain output was achieved, it can be more of a black boxWhen you want to tune your hyperparameters like learning rate and batch sizeWhen you want to explain how the algorithm works itself — for example, if you were to present the algorithm itself to stakeholders, they might get lost, because even a simplified approach is still difficult to understandHere are 3 examples of how you could explain those same situations from above from non-deep learning algorithms:When you want to explain your top features, you can easily access SHAP libraries, say for the algorithm CatBoost, once your model is fitted, you can simply make a summary plot from feat = model.get_feature_importance() and then use the summary_plot() to rank the features by feature name, so that you can present a nice plot to stakeholders (and yourself for that matter)Example of ranked SHAP output from a non-deep learning model [3].As a solution, some other algorithms have made it plenty easy to tune your hyperparameters by randomized grid search or a more structured, set grid search method. There are even some algorithms that tune themselves so you do not have to worry about complicated tuningExplaining how other algorithms work can be a lot easier, like decision trees, for example, you can easily show a yes or no, 0/1 chart that shows a simple answer for features that lead to a prediction, like yes it is raining, yes it is winter, would provide for yes it is going to be coldOverall, deep learning algorithms are useful and powerful, so there is definitely a time and place for them, but there are other algorithms that you can use instead, as we will discuss below.When You Can Use Other AlgorithmsPhoto by Luca Bravo on Unsplash [4].To be frank, there are a few go-to algorithms that can give you a great model with great results rather quickly. Some of these algorithms include Linear Regression, Decision Trees, Random Forest, XGBoost, and CatBoost. These are alternatives that are more simple.Here are examples of why you would want to use a non-deep learning algorithm, becuase you have so many other, simpler, non-deep learning options:They can be easier and faster to set up, for example, deep learning can require you to have your model add sequential, dense layers, and compile it, which can be more complex, and take longer than simply having a regressor or classifier and fitting it with non-deep learning algorithmsI personally find more errors that can result from this more complex deep learning code and documentation for how to fix it can be confusing or old and not be applicable, using an algorithm like Random Forest instead, can have much more documentation on errors that are easy to understandTraining on a deep learning algorithm may not be complicated sometimes, but when predicting from an endpoint, it might be confusing on how to feed values to predict on, whereas some models, you can simply have the values in an encoded list of ordered valuesI would say that you can of course try out deep learning algorithms, but before you do that, it might be best to start with a simpler solution. It can depend on things like how often you will train and make predictions, or if it is a one-off task. There are some other reasons why you would not want to use a deep learning algorithm, like when you have a small dataset and small budget, as we will discuss below.When You Have a Small Dataset and BudgetPhoto by Hello I’m Nik on Unsplash [5].Oftentimes, you can be working as a data scientist at a smaller company, or perhaps at a startup. In these cases, you would not have much data and you might not have a big budget. You would, therefore, try to avoid the use of deep learning algorithms. Sometimes you can even have a small dataset that is just a few thousand rows and few features, you could simply run an alternative model instead locally, rather than spending a lot of money by serving it frequently.Here is when you should second guess using a deep learning algorthim based on costs and data availability:Small data availability is usually the case for a lot of companies (but is not always the case), and deep learning performs better on information with a lot of dataYou might be performing a one-off task, as in the model only predicts one time — and you can run it locally for free (not all models will be running in production frequently), like a simple Decision Tree Classifier. It might not be worth investing time in a deep learning model.Your company is interested in data science applications but wants to keep the budget small, rather than perform costly executions from a deep learning model, and rather, use a tree-based model with early-stopping-rounds to prevent overfitting, shorten training time, and ultimately reduce costsThere have been times where I brought up deep learning and it was shot down for a variety of reasons, and these reasons were usually the case. But, I do not want to dissuade someone from using deep learning completely, as it is something you should use sometimes in your career, and can be something you do frequently or mainly depending on the circumstances and where you are working.SummaryOverall, before you dive deep into deep learning, realize that there are some times when you should avoid using it for a variety of reasons. There are, of course, more reasons for avoiding it, but there are also reasons for using it too. It is ultimately up to you to look at the pros and cons of deep learning yourself.Here are three times/reasons when you should not use deep learning:* When You Want to Easily Explain
* When You Can Use Other Algorithms
* When You Have Small Dataset and BudgetI hope you found my article both interesting and useful. Please feel free to comment down below if you agree or disagree with reasons for avoiding deep learning. Why or why not? What other reasons do you think you should avoid using deep learning as a data scientist? These can certainly be clarified even further, but I hope I was able to shed some light on deep learning. Thank you for reading!I am not affiliated with any of these companies.Please feel free to check out my profile, Matt Przybyla, and other articles, as well as subscribe to receive email notifications for my blogs by following the link below, or by clicking on the subscribe icon on the top of the screen by the follow icon, and reach out to me on LinkedIn if you have any questions or comments.Subscribe link: https://datascience2.medium.com/subscribeReferences[1] Photo by Nadine Shaabana on Unsplash, (2018)[2] Photo by Malte Helmhold on Unsplash, (2021)[3] M.Przybyla, Example of ranked SHAP output from a non-deep learning model, (2021)[4] Photo by Luca Bravo on Unsplash, (2016)[5] Photo by Hello I’m Nik on Unsplash, (2021)
The article contains some of the most commonly used advanced statistical concepts along with their Python implementation.In my previous articles Beginners Guide to Statistics in Data Science and The Inferential Statistics Data Scientists Should Know we have talked about almost all the basics(Descriptive and Inferential) of statistics which are commonly used in understanding and working with any data science case study. In this article, lets go a little beyond and talk about some advance concepts which are not part of the buzz.Concept #1 - Q-Q(quantile-quantile) PlotsBefore understanding QQ plots first understand what is a Quantile?A quantile defines a particular part of a data set, i.e. a quantile determines how many values in a distribution are above or below a certain limit. Special quantiles are the quartile (quarter), the quintile (fifth), and percentiles (hundredth).An example:If we divide a distribution into four equal portions, we will speak of four quartiles. The first quartile includes all values that are smaller than a quarter of all values. In a graphical representation, it corresponds to 25% of the total area of distribution. The two lower quartiles comprise 50% of all distribution values. The interquartile range between the first and third quartile equals the range in which 50% of all values lie that are distributed around the mean. In Statistics, A Q-Q(quantile-quantile) plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight(y=x).Q-Q plotFor example, the median is a quantile where 50% of the data fall below that point and 50% lie above it. The purpose of Q Q plots is to find out if two sets of data come from the same distribution. A 45-degree angle is plotted on the Q Q plot; if the two data sets come from a common distribution, the points will fall on that reference line.It’s very important for you to know whether the distribution is normal or not so as to apply various statistical measures on the data and interpret it in much more human-understandable visualization and their Q-Q plot comes into the picture. The most fundamental question answered by the Q-Q plot is if the curve is Normally Distributed or not.Normally distributed, but why?The Q-Q plots are used to find the type of distribution for a random variable whether it is a Gaussian Distribution, Uniform Distribution, Exponential Distribution, or even Pareto Distribution, etc. You can tell the type of distribution using the power of the Q-Q plot just by looking at the plot. In general, we are talking about Normal distributions only because we have a very beautiful concept of the 68–95–99.7 rule which perfectly fits into the normal distribution So we know how much of the data lies in the range of the first standard deviation, second standard deviation and third standard deviation from the mean. So knowing if a distribution is Normal opens up new doors for us to experiment with Types of Q-Q plots. Source Skewed Q-Q plotsQ-Q plots can find skewness(measure of asymmetry) of the distribution. If the bottom end of the Q-Q plot deviates from the straight line but the upper end is not, then the distribution is Left skewed(Negatively skewed).Now if upper end of the Q-Q plot deviates from the staright line and the lower is not, then the distribution is Right skewed(Positively skewed).Tailed Q-Q plotsQ-Q plots can find Kurtosis(measure of tailedness) of the distribution.The distribution with the fat tail will have both the ends of the Q-Q plot to deviate from the straight line and its centre follows the line, where as a thin tailed distribution will term Q-Q plot with very less or negligible deviation at the ends thus making it a perfect fit for normal distribution.Q-Q plots in Python(Source)Suppose we have the following dataset of 100 values:import numpy as np
#create dataset with 100 values that follow a normal distribution
np.random.seed(0)
data = np.random.normal(0,1, 1000)
#view first 10 values
data[:10] array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799,
-0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985 ])To create a Q-Q plot for this dataset, we can use the qqplot() function from the statsmodels library:import statsmodels.api as sm
import matplotlib.pyplot as plt
#create Q-Q plot with 45-degree line added to plot
fig = sm.qqplot(data, line='45')
plt.show()In a Q-Q plot, the x-axis displays the theoretical quantiles. This means it doesn’t show your actual data, but instead, it represents where your data would be if it were normally distributed.The y-axis displays your actual data. This means that if the data values fall along a roughly straight line at a 45-degree angle, then the data is normally distributed.We can see in our Q-Q plot above that the data values tend to closely follow the 45-degree, which means the data is likely normally distributed. This shouldn’t be surprising since we generated the 100 data values by using the numpy.random.normal() function.Consider instead if we generated a dataset of 100 uniformly distributed values and created a Q-Q plot for that dataset:#create dataset of 100 uniformally distributed values
data = np.random.uniform(0,1, 1000)
#generate Q-Q plot for the dataset
fig = sm.qqplot(data, line='45')
plt.show()The data values clearly do not follow the red 45-degree line, which is an indication that they do not follow a normal distribution.Concept #2- Chebyshev's InequalityIn probability, Chebyshev’s Inequality, also known as “Bienayme-Chebyshev” Inequality guarantees that, for a wide class of probability distributions, only a definite fraction of values will be found within a specific distance from the mean of a distribution.Source: https://www.thoughtco.com/chebyshevs-inequality-3126547 Chebyshev’s inequality is similar to The Empirical rule(68-95-99.7); however, the latter rule only applies to normal distributions. Chebyshev’s inequality is broader; it can be applied to any distribution so long as the distribution includes a defined variance and mean.So Chebyshev’s inequality says that at least (1-1/k^2) of data from a sample must fall within K standard deviations from the mean (or equivalently, no more than 1/k^2 of the distribution’s values can be more than k standard deviations away from the mean).Where K --> Positive real numberIf the data is not normally distributed then different amounts of data could be in one standard deviation. Chebyshev’s inequality provides a way to know what fraction of data falls within K standard deviations from the mean for any data distribution.Also read: 22 Statistics Questions to Prepare for Data Science InterviewsCredits: https://calcworkshop.com/joint-probability-distribution/chebyshev-inequality/ Chebyshev’s inequality is of great value because it can be applied to any probability distribution in which the mean and variance are provided.Let us consider an example, Assume 1,000 contestants show up for a job interview, but there are only 70 positions available. In order to select the finest 70 contestants amongst the total contestants, the proprietor gives tests to judge their potential. The mean score on the test is 60, with a standard deviation of 6. If an applicant scores an 84, can they presume that they are getting the job?The results show that about 63 people scored above a 60, so with 70 positions available, a contestant who scores an 84 can be assured they got the job.Chebyshev's Inequality in Python(Source) Create a population of 1,000,000 values, I use a gamma distribution(also works with other distributions) with shape = 2 and scale = 2.import numpy as np
import random
import matplotlib.pyplot as plt
#create a population with a gamma distribution
shape, scale = 2., 2. #mean=4, std=2*sqrt(2)
mu = shape*scale #mean and standard deviation
sigma = scale*np.sqrt(shape)
s = np.random.gamma(shape, scale, 1000000)Now sample 10,000 values from the population.#sample 10000 values
rs = random.choices(s, k=10000)Count the sample that has a distance from the expected value larger than k standard deviation and use the count to calculate the probabilities. I want to depict a trend of probabilities when k is increasing, so I use a range of k from 0.1 to 3.#set k
ks = [0.1,0.5,1.0,1.5,2.0,2.5,3.0]
#probability list
probs = [] #for each k
for k in ks:
#start count
c = 0
for i in rs:
# count if far from mean in k standard deviation
if abs(i - mu) > k * sigma :
c += 1
probs.append(c/10000)Plot the results:plot = plt.figure(figsize=(20,10))
#plot each probability
plt.xlabel('K')
plt.ylabel('probability')
plt.plot(ks,probs, marker='o')
plot.show()
#print each probability
print("Probability of a sample far from mean more than k standard deviation:")
for i, prob in enumerate(probs):
print("k:" + str(ks[i]) + ", probability: " \
+ str(prob)[0:5] + \
" | in theory, probability should less than: " \
+ str(1/ks[i]**2)[0:5])From the above plot and result, we can see that as the k increases, the probability is decreasing, and the probability of each k follows the inequality. Moreover, only the case that k is larger than 1 is useful. If k is less than 1, the right side of the inequality is larger than 1 which is not useful because the probability cannot be larger than 1.Concept #3- Log-Normal DistributionIn probability theory, a Log-normal distribution also known as Galton's distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed.Thus, if the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Equivalently, if Y has a normal distribution, then the exponential function of Y i.e, X = exp(Y), has a log-normal distribution. Skewed distributions with low mean and high variance and all positive values fit under this type of distribution. A random variable that is log-normally distributed takes only positive real values. The general formula for the probability density function of the lognormal distribution is:The location and scale parameters are equivalent to the mean and standard deviation of the logarithm of the random variable.The shape of Lognormal distribution is defined by 3 parameters:σis the shape parameter, (and is the standard deviation of the log of the distribution)θ or μ is the location parameter (and is the mean of the distribution)m is the scale parameter (and is also the median of the distribution)The location and scale parameters are equivalent to the mean and standard deviation of the logarithm of the random variable as explained above.If x = θ, then f(x) = 0. The case where θ = 0 and m = 1 is called the standard lognormal distribution. The case where θ equals zero is called the 2-parameter lognormal distribution.The following graph illustrates the effect of the location(μ) and scale(σ) parameter on the probability density function of the lognormal distribution: Source: https://www.sciencedirect.com/topics/mathematics/lognormal-distribution Log-Normal Distribution in Python(Source)Let us consider an example to generate random numbers from a log-normal distribution with μ=1 and σ=0.5 using scipy.stats.lognorm function.import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import lognorm
np.random.seed(42)
data = lognorm.rvs(s=0.5, loc=1, scale=1000, size=1000)
plt.figure(figsize=(10,6))
ax = plt.subplot(111)
plt.title('Generate wrandom numbers from a Log-normal distribution')
ax.hist(data, bins=np.logspace(0,5,200), density=True)
ax.set_xscale("log")
shape,loc,scale = lognorm.fit(data)
x = np.logspace(0, 5, 200)
pdf = lognorm.pdf(x, shape, loc, scale)
ax.plot(x, pdf, 'y')
plt.show()Concept #4- Power Law distributionIn statistics, a Power Law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities: one quantity varies as a power of another. For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four.A power law distribution has the form Y = k Xα, where:X and Y are variables of interest,α is the law’s exponent,k is a constant.Source: https://en.wikipedia.org/wiki/Power_law Power-law distribution is just one of many probability distributions, but it is considered a valuable tool to assess uncertainty issues that normal distribution cannot handle when they occur at a certain probability.Many processes have been found to follow power laws over substantial ranges of values. From the distribution in incomes, size of meteoroids, earthquake magnitudes, the spectral density of weight matrices in deep neural networks, word usage, number of neighbors in various networks, etc. (Note: The power law here is a continuous distribution. The last two examples are discrete, but on a large scale can be modeled as if continuous).Also read: Statistical Measures of Central TendencyPower-law distribution in Python(Source) Let us plot the Pareto distribution which is one form of a power-law probability distribution. Pareto distribution is sometimes known as the Pareto Principle or ‘80–20’ rule, as the rule states that 80% of society’s wealth is held by 20% of its population. Pareto distribution is not a law of nature, but an observation. It is useful in many real-world problems. It is a skewed heavily tailed distribution.import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pareto
x_m = 1 #scale
alpha = [1, 2, 3] #list of values of shape parameters
plt.figure(figsize=(10,6))
samples = np.linspace(start=0, stop=5, num=1000)
for a in alpha:
output = np.array([pareto.pdf(x=samples, b=a, loc=0, scale=x_m)])
plt.plot(samples, output.T, label='alpha {0}' .format(a))
plt.xlabel('samples', fontsize=15)
plt.ylabel('PDF', fontsize=15)
plt.title('Probability Density function', fontsize=15)
plt.legend(loc='best')
plt.show()Concept #5- Box cox transformationThe Box-Cox transformation transforms our data so that it closely resembles a normal distribution.The one-parameter Box-Cox transformations are defined as In many statistical techniques, we assume that the errors are normally distributed. This assumption allows us to construct confidence intervals and conduct hypothesis tests. By transforming your target variable, we can (hopefully) normalize our errors (if they are not already normal).Additionally, transforming our variables can improve the predictive power of our models because transformations can cut away white noise.Original distribution(Left) and near-normal distribution after applying Box cox transformation. Source At the core of the Box-Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one that results in the best approximation of a normal distribution curve. The one-parameter Box-Cox transformations are defined as:and the two-parameter Box-Cox transformations as:Moreover, the one-parameter Box-Cox transformation holds for y > 0, i.e. only for positive values and two-parameter Box-Cox transformation for y > -λ, i.e. negative values. The parameter λ is estimated using the profile likelihood function and using goodness-of-fit tests.If we talk about some drawbacks of Box-cox transformation, then if interpretation is what you want to do, then Box-cox is not recommended. Because if λ is some non-zero number, then the transformed target variable may be more difficult to interpret than if we simply applied a log transform.A second stumbling block is that the Box-Cox transformation usually gives the median of the forecast distribution when we revert the transformed data to its original scale. Occasionally, we want the mean and not the median.Box-Cox transformation in Python(Source)SciPy’s stats package provides a function called boxcox for performing box-cox power transformation that takes in original non-normal data as input and returns fitted data along with the lambda value that was used to fit the non-normal distribution to normal distribution.#load necessary packages
import numpy as np
from scipy.stats import boxcox
import seaborn as sns
#make this example reproducible
np.random.seed(0)
#generate dataset
data = np.random.exponential(size=1000)
fig, ax = plt.subplots(1, 2)
#plot the distribution of data values
sns.distplot(data, hist=False, kde=True,
kde_kws = {'shade': True, 'linewidth': 2},
label = "Non-Normal", color ="red", ax = ax[0])
#perform Box-Cox transformation on original data
transformed_data, best_lambda = boxcox(data)
sns.distplot(transformed_data, hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 2},
label = "Normal", color ="red", ax = ax[1])
#adding legends to the subplots
plt.legend(loc = "upper right")
#rescaling the subplots
fig.set_figheight(5)
fig.set_figwidth(10)
#display optimal lambda value
print(f"Lambda value used for Transformation: {best_lambda}")
Concept #6- Poisson distributionIn probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.In very simple terms, A Poisson distribution can be used to estimate how likely it is that something will happen "X" number of times. Some examples of Poisson processes are customers calling a help center, radioactive decay in atoms, visitors to a website, photons arriving at a space telescope, and movements in a stock price. Poisson processes are usually associated with time, but they do not have to be. The Formula for the Poisson Distribution Is:Where:e is Euler's number (e = 2.71828...)k is the number of occurrencesk! is the factorial of kλ is equal to the expected value of kwhen that is also equal to its varianceLambda(λ) can be thought of as the expected number of events in the interval. As we change the rate parameter, λ, we change the probability of seeing different numbers of events in one interval. The below graph is the probability mass function of the Poisson distribution showing the probability of a number of events occurring in an interval with different rate parameters. Probability Mass function for Poisson Distribution with varying rate parameters.Source The Poisson distribution is also commonly used to model financial count data where the tally is small and is often zero. For one example, in finance, it can be used to model the number of trades that a typical investor will make in a given day, which can be 0 (often), or 1, or 2, etc.As another example, this model can be used to predict the number of "shocks" to the market that will occur in a given time period, say over a decade.Poisson distribution in Pythonfrom numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
lam_list = [1, 4, 9] #list of Lambda values
plt.figure(figsize=(10,6))
samples = np.linspace(start=0, stop=5, num=1000)
for lam in lam_list:
sns.distplot(random.poisson(lam=lam, size=10), hist=False, label='lambda {0}'.format(lam))
plt.xlabel('Poisson Distribution', fontsize=15)
plt.ylabel('Frequency', fontsize=15)
plt.legend(loc='best')
plt.show()As λ becomes bigger, the graph looks more like a normal distribution.I hope you have enjoyed reading this article, If you have any questions or suggestions, please leave a comment. Also read: False Positives vs. False NegativesFeel free to connect me on LinkedIn for any query.Thanks for reading!!!Referenceshttps://calcworkshop.com/joint-probability-distribution/chebyshev-inequality/ https://corporatefinanceinstitute.com/resources/knowledge/data-analysis/chebyshevs-inequality/ https://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htm https://www.statology.org/q-q-plot-python/ https://gist.github.com/chaipi-chaya/9eb72978dbbfd7fa4057b493cf6a32e7 https://stackoverflow.com/a/41968334/7175247
CreditsPredictive models have become a trusted advisor to many businesses and for a good reason. These models can “foresee the future”, and there are many different methods available, meaning any industry can find one that fits their particular challenges.When we talk about predictive models, we are talking either about a regression model (continuous output) or a classification model (nominal or binary output). In classification problems, we use two types of algorithms (dependent on the kind of output it creates):Class output: Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms that can convert these class outputs to probability.Probability output: Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost, etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.IntroductionWhile data preparation and training a machine learning model is a key step in the machine learning pipeline, it’s equally important to measure the performance of this trained model. How well the model generalizes on the unseen data is what defines adaptive vs non-adaptive machine learning models.By using different metrics for performance evaluation, we should be in a position to improve the overall predictive power of our model before we roll it out for production on unseen data.Without doing a proper evaluation of the ML model using different metrics, and depending only on accuracy, it can lead to a problem when the respective model is deployed on unseen data and can result in poor predictions.This happens because, in cases like these, our models don’t learn but instead memorize;hence, they cannot generalize well on unseen data.Model Evaluation MetricsLet us now define the evaluation metrics for evaluating the performance of a machine learning model, which is an integral component of any data science project. It aims to estimate the generalization accuracy of a model on the future (unseen/out-of-sample) data.Confusion MatrixA confusion matrix is a matrix representation of the prediction results of any binary testing that is often used to describe the performance of the classification model (or “classifier”) on a set of test data for which the true values are known.The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.Confusion matrix with 2 class labels.Each prediction can be one of the four outcomes, based on how it matches up to the actual value:True Positive (TP): Predicted True and True in reality.True Negative (TN): Predicted False and False in reality.False Positive (FP): Predicted True and False in reality.False Negative (FN): Predicted False and True in reality.Now let us understand this concept using hypothesis testing.A Hypothesis is speculation or theory based on insufficient evidence that lends itself to further testing and experimentation. With further testing, a hypothesis can usually be proven true or false.A Null Hypothesis is a hypothesis that says there is no statistical significance between the two variables in the hypothesis. It is the hypothesis that the researcher is trying to disprove.We would always reject the null hypothesis when it is false, and we would accept the null hypothesis when it is indeed true.Even though hypothesis tests are meant to be reliable, there are two types of errors that can occur.These errors are known as Type 1 and Type II errors.For example, when examining the effectiveness of a drug, the null hypothesis would be that the drug does not affect a disease.Type I Error:- equivalent to False Positives(FP).The first kind of error that is possible involves the rejection of a null hypothesis that is true.Let’s go back to the example of a drug being used to treat a disease. If we reject the null hypothesis in this situation, then we claim that the drug does have some effect on a disease. But if the null hypothesis is true, then, in reality, the drug does not combat the disease at all. The drug is falsely claimed to have a positive effect on a disease.Type II Error:- equivalent to False Negatives(FN).The other kind of error that occurs when we accept a false null hypothesis. This sort of error is called a type II error and is also referred to as an error of the second kind.If we think back again to the scenario in which we are testing a drug, what would a type II error look like? A type II error would occur if we accepted that the drug hs no effect on disease, but in reality, it did.A sample python implementation of the Confusion matrix.import warnings import pandas as pd from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt %matplotlib inline #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset url = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" df = pd.read_csv(url) # df = df.values X = df.iloc[:,0:4] y = df.iloc[:,4] #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #Split data into train and test set. X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #Train Model model = LogisticRegression() model.fit(X_train, y_train) pred = model.predict(X_test) #Construct the Confusion Matrix labels = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] cm = confusion_matrix(y_test, pred, labels) print(cm) fig = plt.figure() ax = fig.add_subplot(111) cax = ax.matshow(cm) plt.title('Confusion matrix') fig.colorbar(cax) ax.set_xticklabels([''] + labels) ax.set_yticklabels([''] + labels) plt.xlabel('Predicted Values') plt.ylabel('Actual Values') plt.show()Confusion matrix with 3 class labels.The diagonal elements represent the number of points for which the predicted label is equal to the true label, while anything off the diagonal was mislabeled by the classifier. Therefore, the higher the diagonal values of the confusion matrix the better, indicating many correct predictions.In our case, the classifier predicted all the 13 setosa and 18 virginica plants in the test data perfectly. However, it incorrectly classified 4 of the versicolor plants as virginica.There is also a list of rates that are often computed from a confusion matrix for a binary classifier:1. AccuracyOverall, how often is the classifier correct?Accuracy = (TP+TN)/totalWhen our classes are roughly equal in size, we can use accuracy, which will give us correctly classified values.Accuracy is a common evaluation metric for classification problems. It’s the number of correct predictions made as a ratio of all predictions made.Misclassification Rate(Error Rate): Overall, how often is it wrong. Since accuracy is the percent we correctly classified (success rate), it follows that our error rate (the percentage we got wrong) can be calculated as follows:Misclassification Rate = (FP+FN)/totalWe use the sklearn module to compute the accuracy of a classification task, as shown below.#import modules import warnings import pandas as pd import numpy as np from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn import datasets from sklearn.metrics import accuracy_score #ignore warnings warnings.filterwarnings('ignore') # Load digits dataset iris = datasets.load_iris() # # Create feature matrix X = iris.data # Create target vector y = iris.target #test size test_size = 0.33 #generate the same set of random numbers seed = 7 #cross-validation settings kfold = model_selection.KFold(n_splits=10, random_state=seed) #Model instance model = LogisticRegression() #Evaluate model performance scoring = 'accuracy' results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring) print('Accuracy -val set: %.2f%% (%.2f)' % (results.mean()*100, results.std())) #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) #fit model model.fit(X_train, y_train) #accuracy on test set result = model.score(X_test, y_test) print("Accuracy - test set: %.2f%%" % (result*100.0))The classification accuracy is 88% on the validation set.2. PrecisionWhen it predicts yes, how often is it correct?Precision=TP/predicted yesWhen we have a class imbalance, accuracy can become an unreliable metric for measuring our performance. For instance, if we had a 99/1 split between two classes, A and B, where the rare event, B, is our positive class, we could build a model that was 99% accurate by just saying everything belonged to class A. Clearly, we shouldn’t bother building a model if it doesn’t do anything to identify class B; thus, we need different metrics that will discourage this behavior. For this, we use precision and recall instead of accuracy.3. Recall or SensitivityWhen it’s actually yes, how often does it predict yes?True Positive Rate = TP/actual yesRecall gives us the true positive rate (TPR), which is the ratio of true positives to everything positive.In the case of the 99/1 split between classes A and B, the model that classifies everything as A would have a recall of 0% for the positive class, B (precision would be undefined — 0/0). Precision and recall provide a better way of evaluating model performance in the face of a class imbalance. They will correctly tell us that the model has little value for our use case.Just like accuracy, both precision and recall are easy to compute and understand but require thresholds. Besides, precision and recall only consider half of the confusion matrix:4. F1 ScoreThe F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.Why harmonic mean? Since the harmonic mean of a list of numbers skews strongly toward the least elements of the list, it tends (compared to the arithmetic mean) to mitigate the impact of large outliers and aggravate the impact of small ones.An F1 score punishes extreme values more. Ideally, an F1 Score could be an effective evaluation metric in the following classification scenarios:When FP and FN are equally costly — meaning they miss on true positives or find false positives — both impact the model almost the same way, as in our cancer detection classification exampleAdding more data doesn’t effectively change the outcome effectivelyTN is high (like with flood predictions, cancer predictions, etc.)A sample python implementation of the F1 score.import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss from sklearn.metrics import precision_recall_fscore_support as score, precision_score, recall_score, f1_score warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] test_size = 0.33 seed = 7 model = LogisticRegression() #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) precision = precision_score(y_test, pred) print('Precision: %f' % precision) # recall: tp / (tp + fn) recall = recall_score(y_test, pred) print('Recall: %f' % recall) # f1: tp / (tp + fp + fn) f1 = f1_score(y_test, pred) print('F1 score: %f' % f1)5. SpecificityWhen it’s no, how often does it predict no?True Negative Rate=TN/actual noIt is the true negative rate or the proportion of true negatives to everything that should have been classified as negative.Note that, together, specificity and sensitivity consider the full confusion matrix:6. Receiver Operating Characteristics (ROC) CurveMeasuring the area under the ROC curve is also a very useful method for evaluating a model. By plotting the true positive rate (sensitivity) versus the false-positive rate (1 — specificity), we get the Receiver Operating Characteristic (ROC) curve. This curve allows us to visualize the trade-off between the true positive rate and the false positive rate.The following are examples of good ROC curves. The dashed line would be random guessing (no predictive value) and is used as a baseline; anything below that is considered worse than guessing. We want to be toward the top-left corner:A sample python implementation of the ROC curves.#Classification Area under curve import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, roc_curve warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) # predict probabilities probs = model.predict_proba(X_test) # keep probabilities for the positive outcome only probs = probs[:, 1] auc = roc_auc_score(y_test, probs) print('AUC - Test Set: %.2f%%' % (auc*100)) # calculate roc curve fpr, tpr, thresholds = roc_curve(y_test, probs) # plot no skill plt.plot([0, 1], [0, 1], linestyle='--') # plot the roc curve for the model plt.plot(fpr, tpr, marker='.') plt.xlabel('False positive rate') plt.ylabel('Sensitivity/ Recall') # show the plot plt.show()In the example above, the AUC is relatively close to 1 and greater than 0.5. A perfect classifier will have the ROC curve go along the Y-axis and then along the X-axisLog LossLog Loss is the most important classification metric based on probabilities.As the predicted probability of the true class gets closer to zero, the loss increases exponentially:It measures the performance of a classification model where the prediction input is a probability value between 0 and 1. Log loss increases as the predicted probability diverge from the actual label. The goal of any machine learning model is to minimize this value. As such, smaller log loss is better, with a perfect model having a log loss of 0.A sample python implementation of the Log Loss.#Classification LogLoss import warnings import pandas from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss warnings.filterwarnings('ignore') url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" dataframe = pandas.read_csv(url) dat = dataframe.values X = dat[:,:-1] y = dat[:,-1] seed = 7 #split data X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed) model.fit(X_train, y_train) #predict and compute logloss pred = model.predict(X_test) accuracy = log_loss(y_test, pred) print("Logloss: %.2f" % (accuracy))Logloss: 8.02
Jaccard IndexJaccard Index is one of the simplest ways to calculate and find out the accuracy of a classification ML model. Let’s understand it with an example. Suppose we have a labeled test set, with labels as –y = [0,0,0,0,0,1,1,1,1,1]And our model has predicted the labels as –y1 = [1,1,0,0,0,1,1,1,1,1]The above Venn diagram shows us the labels of the test set and the labels of the predictions, and their intersection and union.Jaccard Index or Jaccard similarity coefficient is a statistic used in understanding the similarities between sample sets. The measurement emphasizes the similarity between finite sample sets and is formally defined as the size of the intersection divided by the size of the union of the two labeled sets, with formula as –Jaccard Index or Intersection over Union(IoU)So, for our example, we can see that the intersection of the two sets is equal to 8 (since eight values are predicted correctly) and the union is 10 + 10–8 = 12. So, the Jaccard index gives us the accuracy as –So, the accuracy of our model, according to Jaccard Index, becomes 0.66, or 66%.Higher the Jaccard index higher the accuracy of the classifier.A sample python implementation of the Jaccard index.import numpy as np def compute_jaccard_similarity_score(x, y): intersection_cardinality = len(set(x).intersection(set(y))) union_cardinality = len(set(x).union(set(y))) return intersection_cardinality / float(union_cardinality) score = compute_jaccard_similarity_score(np.array([0, 1, 2, 5, 6]), np.array([0, 2, 3, 5, 7, 9])) print "Jaccard Similarity Score : %s" %score passJaccard Similarity Score : 0.375Kolomogorov Smirnov chartK-S or Kolmogorov-Smirnov chart measures the performance of classification models. More accurately, K-S is a measure of the degree of separation between positive and negative distributions.The cumulative frequency for the observed and hypothesized distributions is plotted against the ordered frequencies. The vertical double arrow indicates the maximal vertical difference.The K-S is 100 if the scores partition the population into two separate groups in which one group contains all the positives and the other all the negatives. On the other hand, If the model cannot differentiate between positives and negatives, then it is as if the model selects cases randomly from the population. The K-S would be 0.In most classification models the K-S will fall between 0 and 100, and that the higher the value the better the model is at separating the positive from negative cases.The K-S may also be used to test whether two underlying one-dimensional probability distributions differ. It is a very efficient way to determine if two samples are significantly different from each other.A sample python implementation of the Kolmogorov-Smirnov.from scipy.stats import kstest import random # N = int(input("Enter number of random numbers: ")) N = 10 actual =[] print("Enter outcomes: ") for i in range(N): # x = float(input("Outcomes of class "+str(i + 1)+": ")) actual.append(random.random()) print(actual) x = kstest(actual, "norm") print(x)The Null hypothesis used here assumes that the numbers follow the normal distribution. It returns statistics and p-value. If the p-value is < alpha, we reject the Null hypothesis.Alpha is defined as the probability of rejecting the null hypothesis given the null hypothesis(H0) is true. For most of the practical applications, alpha is chosen as 0.05.Gain and Lift ChartGain or Lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. Gain and lift charts are visual aids for evaluating the performance of classification models. However, in contrast to the confusion matrix that evaluates models on the whole population gain or lift chart evaluates model performance in a portion of the population.The higher the lift (i.e. the further up it is from the baseline), the better the model.The following gains chart, run on a validation set, shows that with 50% of the data, the model contains 90% of targets, Adding more data adds a negligible increase in the percentage of targets included in the model.Gain/lift chartLift charts are often shown as a cumulative lift chart, which is also known as a gains chart. Therefore, gains charts are sometimes (perhaps confusingly) called “lift charts”, but they are more accurately cumulative lift charts.It is one of their most common uses is in marketing, to decide if a prospective client is worth calling.Gini CoefficientThe Gini coefficient or Gini Index is a popular metric for imbalanced class values. The coefficient ranges from 0 to 1 where 0 represents perfect equality and 1 represents perfect inequality. Here, if the value of an index is higher, then the data will be more dispersed.Gini coefficient can be computed from the area under the ROC curve using the following formula:Gini Coefficient = (2 * ROC_curve) — 1ConclusionUnderstanding how well a machine learning model is going to perform on unseen data is the ultimate purpose behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced and there’s a class disparity, then other methods like ROC/AUC, Gini coefficient perform better in evaluating the model performance.Well, this concludes this article. I hope you guys have enjoyed reading it, feel free to share your comments/thoughts/feedback in the comment section.Thanks for reading !!!
In our fast-paced digital world, we're producing staggering volumes of data every day. This data falls into two key categories: structured, known for its order and efficiency, and unstructured, a captivating puzzle brimming with untapped potential.In this article, we will uncover how AI confronts the complexities of unstructured data, the hurdles it faces, and the intriguing opportunities it opens up to businesses from any kind of industry.Understanding Unstructured DataUnstructured data mining is the technique of extracting valuable and meaningful insights from an abundant well of unstructured data. It uncovers hidden gems of knowledge, making it a crucial pursuit in our data-rich era.In today's digital realm, unstructured data is generated in unprecedented quantities. Billions of text documents, images, and videos come to life daily, creating a treasure trove of information just waiting for organizations to explore.Unlocking the insights hidden within unstructured data can provide organizations with a competitive edge. This data can reveal customer sentiments, emerging trends, and valuable feedback that might otherwise go unnoticed.The Basics of Data MiningHow data mining works is that it discovers patterns, trends, and valuable information within a dataset. It involves various techniques to extract knowledge from raw data. While it's exceptionally effective with structured data, applying data mining to unstructured data requires a unique set of skills and tools.Unstructured Data MiningUnstructured data mining is a method focused on the extraction of valuable information from the vast, unstructured data available. This process uncovers hidden insights, making it a valuable endeavor in today's data-driven world.The AI RevolutionThe AI revolution has given rise to an exciting era of possibilities in unstructured data mining. AI's remarkable capabilities are instrumental in taming the unstructured data landscape, and it involves a multitude of components, including:Machine learning enables AI systems to learn from data, make predictions, and identify patterns, enhancing data mining capabilities.Deep learning uses neural networks to model complex patterns in unstructured data, which is particularly valuable in image and speech recognition.Sentiment analysis gauges emotional tones within textual data, helping to understand public opinion and tailor strategies.Pattern recognition identifies recurring structures in data, aiding in image processing and text mining.Knowledge graphs structure data relationships, improving contextual understanding and data retrieval.Anomaly detection identifies outliers in data, which is essential for fraud detection and data security.Challenges in Unstructured Data MiningAs promising as AI is at handling unstructured data, it's not without its set of challenges. Here, we delve into some of the major hurdles:Data QualityUnstructured data is inherently messy. It's laden with errors, inconsistencies, and biases, which makes it a challenge to extract meaningful insights from this data. AI systems need to be trained rigorously to navigate and decipher this diversity in data quality. Techniques like data cleansing, normalization, and the use of context are essential in ensuring that AI systems provide accurate results.ScalabilityAs the volume of unstructured data grows, AI systems must scale to handle the data influx effectively. Traditional hardware and algorithms might not be sufficient to handle this data influx. Scalable infrastructure and distributed computing become crucial to ensuring that AI systems can process and analyze vast amounts of data efficiently.Privacy ConcernsMining unstructured data often raises ethical questions regarding privacy and data protection. That’s why it’s essential to strike the right balance between data utilization and respecting individual privacy. It's a challenge to ensure that AI systems are used responsibly and in compliance with data protection laws and regulations, such as GDPR in Europe. Techniques like anonymization and consent management play a vital role in addressing these privacy concerns.Opportunities and ApplicationsAI's role in unstructured data mining has opened up a world of opportunities across various industries. Let's explore some of the most promising applications:Customer InsightsUnstructured data, particularly sourced from social media and customer reviews, serves as a goldmine of information on customer behavior and preferences. By leveraging AI algorithms, companies can analyze sentiments, spot emerging trends, and even forecast future buying patterns. With these insights, they can fine-tune their marketing strategies, product development, and customer service to align with their ever-evolving audience's demands.Healthcare DiagnosisThe abundance of unstructured data found in medical records, radiological images, and wearable device data holds the key to transformative advancements. AI-powered systems, known for their proficiency in the analysis of this data, not only facilitate early disease detection but also provide highly individualized treatment plans, ultimately raising the standard of patient care. For example: AI expedites the process of analyzing medical images for anomalies, resulting in a significant reduction in the time required for diagnosing and treating severe conditions.Fraud DetectionWhen it comes to financial institutions, AI is a vital tool for exposing fraudulent activities that often hide within the vast volumes of unstructured transaction data. Through a meticulous examination of transaction patterns and anomalies, AI systems can rapidly pinpoint fraudulent actions, providing businesses with a robust defense against significant financial losses. The ability to detect and thwart fraud in real-time provides a critical advantage, resulting in annual savings of billions of dollars for businesses.ConclusionThe future belongs to those who embrace the AI revolution in unstructured data mining. In this future, data isn't just information; it's the key to success. So, let's move forward, embracing this tomorrow, where possibilities are limitless and opportunities are endless.
nikos_datasource
Jan 19, 2021
Join our private community in Discord
Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!