Data Science Engineer at DataSource.ai
Real-life data is usually messy. It requires a lot of preprocessing to be ready for use. Pandas being one of the most-widely used data analysis and manipulation libraries offers several functions to preprocess the raw data.In this article, we will focus on one particular function that organizes multiple preprocessing operations into a single one: the pipe function.When it comes to software tools and packages, I learn best by working through examples. I keep this in mind when creating content. I will do the same in this article.Let’s start with creating a data frame with mock data.import numpy as np import pandas as pd df = pd.DataFrame({ "id": [100, 100, 101, 102, 103, 104, 105, 106], "A": [1, 2, 3, 4, 5, 2, np.nan, 5], "B": [45, 56, 48, 47, 62, 112, 54, 49], "C": [1.2, 1.4, 1.1, 1.8, np.nan, 1.4, 1.6, 1.5] }) df(image by author)Our data frame contains some missing values indicated by a standard missing value representation (i.e. NaN). The id column includes duplicate values. Last but not least, 112 in column B seems like an outlier.These are some of the typical issues in real-life data. We will be creating a pipe that handles the issues we have just described.For each task, we need a function. Thus, the first step is to create the functions that will be placed in the pipe.It is important to note that the functions used in the pipe need to take a data frame as argument and return a data frame.The first function handles the missing values.def fill_missing_values(df): for col in df.select_dtypes(include= ["int","float"]).columns: val = df[col].mean() df[col].fillna(val, inplace=True) return dfI prefer to replace the missing values in the numerical columns with the mean value of the column. Feel free to customize this function. It will work in the pipe as long as it takes a data frame as argument and returns a data frame.The second function will help us remove the duplicate values.def drop_duplicates(df, column_name): df = df.drop_duplicates(subset=column_name) return dfI have got some help from the built-in drop duplicates function of Pandas. It eliminates the duplicate values in the given column or columns. In addition to the data frame, this function also takes a column name as an argument. We can pass the additional arguments to the pipe as well.Also read: Pandas vs SQL. When Data Scientists Should Use One Over the OtherThe last function in the pipe will be used for eliminating the outliers.def remove_outliers(df, column_list): for col in column_list: avg = df[col].mean() std = df[col].std() low = avg - 2 * std high = avg + 2 * std df = df[df[col].between(low, high, inclusive=True)] return dfWhat this function does is as follows:It takes a data frame and a list of columnsFor each column in the list, it calculates the mean and standard deviationIt calculates a lower and upper bound using the mean and standard deviationIt removes the values that are outside range defined by the lower and upper boundJust like the previous functions, you can choose your own way of detecting outliers.We now have 3 functions that handle a data preprocessing task. The next step is to create a pipe with these functions.df_processed = (df. pipe(fill_missing_values). pipe(drop_duplicates, "id"). pipe(remove_outliers, ["A","B"]))This pipe executes the functions in the given order. We can pass the arguments to the pipe along with the function names.One thing to mention here is that some functions in the pipe modify the original data frame. Thus, using the pipe as indicated above will update df as well.One option to overcome this issue is to use a copy of the original data frame in the pipe. If you do not care about keeping the original data frame as is, you can just use it in the pipe.I will update the pipe as below:my_df = df.copy() df_processed = (my_df. pipe(fill_missing_values). pipe(drop_duplicates, "id"). pipe(remove_outliers, ["A","B"]))Let’s take a look at the original and processed data frames:df (image by author)df_processed (image by author)ConclusionYou can, of course, accomplish the same tasks by applying these functions separately. However, the pipe function offers a structured and organized way for combining several functions into a single operation.Depending on the raw data and the tasks, the preprocessing may include more steps. You can add as many steps as you need in the pipe function. As the number of steps increase, the syntax becomes cleaner with the pipe function compared to executing functions separately.Thank you for reading. Please let me know if you have any feedback.Also read:- Using Python And Pandas Datareader to Analyze Financial Data- Using Pandas Profiling to Accelerate Our Exploratory Analysis- Pandas Essentials For Data Science
Mar 02, 2021
Time series data consists of data points attached to sequential time stamps. Daily sales, hourly temperature values, and second-level measurements in a chemical process are some examples of time series data.Time series data has different characteristics than ordinary tabular data. Thus, time series analysis has its own dynamics and can be considered as a separate field. There are books over 500 pages to cover time series analysis concepts and techniques in depth.Pandas was created by Wes Mckinney to provide an efficient and flexible tool to work with financial data which is kind of a time series. In this article, we will go over 4 Pandas functions that can be used for time series analysis.We need data for the examples. Let’s start with creating our own time series data.import numpy as np import pandas as pd df = pd.DataFrame({ "date": pd.date_range(start="2020-05-01", periods=100, freq="D"), "temperature": np.random.randint(18, 30, size=100) + np.random.random(100).round(1) }) df.head()(image by author)We have created a data frame that contains temperature measurements during a period of 100 days. The date_range function of Pandas can be used for generating a date range with customized frequency. The temperature values are generated randomly using Numpy functions.We can now start on the functions.1. ShiftIt is a common operation to shift time series data. We may need to make a comparison between lagged or lead features. In our data frame, we can create a new feature that contains the temperature of the previous day.df["temperature_lag_1"] = df["temperature"].shift(1) df.head()(image by author)The scalar value passed to the shift function indicates the number of periods to shift. The first row of the new column is filled with NaN because there is no previous value for the first row.The fill_value parameter can be used for filling the missing values with a scalar. Let’s replace the NaN with the average value of the temperature column.df["temperature_lag_1"] = df["temperature"]\ .shift(1, fill_value = df.temperature.mean()) df.head()(image by author)If you are interested in the future values, you can shift backwards by passing negative values to the shift function. For instance, “-1” brings the temperature in the next day.2. ResampleAnother common operation performed on time series data is resampling. It involves in changing the frequency of the periods. For instance, we may be interested in the weekly temperature data rather than daily measurements.The resample function creates groups (or bins) of a specified internal. Then, we can apply aggregation functions to the groups to calculate the value based on resampled frequency.Let’s calculate the average weekly temperatures. The first step is to resample the data to week level. Then, we will apply the mean function to calculate the average.df_weekly = df.resample("W", on="date").mean() df_weekly.head()(image by author)The first parameter specifies the frequency for resampling. “W” stands for week, surprisingly. If the data frame does not have a datetime index, the column that contains the date or time related information needs to be passed to the on parameter.3. AsfreqThe asfreq function provides a different technique for resampling. It returns the value at the end of the specified interval. For instance, asfreq(“W”)returns the value on the last day of each week.In order to use the asfreq function, we should set the date column as the index of the data frame.df.set_index("date").asfreq("W").head()(image by author)Since we are getting a value at a specific day, it is not necessary to apply an aggregation function.4. RollingThe rolling function can be used for calculating moving average which is a highly common operation for time series data. It creates a window of a particular size. Then, we can use this window to make calculations as it rolls through the data points.The figure below explains the concept of rolling.(image by author)Let’s create a rolling window of 3 and use it to calculate the moving average.df.set_index("date").rolling(3).mean().head()(image by author)For any day, the values show the average of the day and the previous 2 days. The values of the first 3 days are 18.9, 23.8, and 19.9. Thus, the moving average on the third day is the average of these values which is 20.7.The first 2 values are NaN because they do not have previous 2 values. We can also use this rolling window to cover the previous and next day for any given day. It can be done by setting the center parameter as true.df.set_index("date").rolling(3, center=True).mean().head()(image by author)The values of the first 3 days are 18.9, 23.8, and 19.9. Thus, the moving average in the second day is the average of these 3 values. In this setting, only the first value is NaN because we only need 1 previous value.ConclusionWe have covered 4 Pandas functions that are commonly used in time series analysis. Predictive analytics is an essential part of data science. Time series analysis is at the core of many problems that predictive analytics aims to solve. Hence, if you plan to work on predictive analytics, you should definitely learn how to handle time series data.Thank you for reading. Please let me know if you have any feedback.Soner Yıldırım
Mar 02, 2021
In this article, we’re going to explore some lesser-known but very useful pandas methods for manipulating Series objects. Some of these methods are related only to Series, the others — both to Series and DataFrames, having, however, specific features when used with both structure types.1. is_uniqueAs its name sugests, this method checks if all the values of a Series are unique:import pandas as pd print(pd.Series([1, 2, 3, 4]).is_unique) print(pd.Series([1, 2, 3, 1]).is_unique) Output: True False 2 & 3. is_monotonic and is_monotonic_decreasingWith these 2 methods, we can check if the values of a Series are in ascending/descending order:print(pd.Series([1, 2, 3, 8]).is_monotonic) print(pd.Series([1, 2, 3, 1]).is_monotonic) print(pd.Series([9, 8, 4, 0]).is_monotonic_decreasing) Output: True False TrueBoth methods work also for a Series with string values. In this case, Python uses a lexicographical ordering under the hood, comparing two subsequent strings character by character. It’s not the same as just an alphabetical ordering, and actually, the example with the numeric data above is a particular case of such an ordering. As the Python documentation says,Lexicographical ordering for strings uses the Unicode code point number to order individual characters.In practice, it mainly means that the letter case and special symbols are also taken into account:print(pd.Series(['fox', 'koala', 'panda']).is_monotonic) print(pd.Series(['FOX', 'Fox', 'fox']).is_monotonic) print(pd.Series(['*', '&', '_']).is_monotonic) Output: True True FalseA curious exception happens when all the values of a Series are the same. In this case, both methods return True:print(pd.Series([1, 1, 1, 1, 1]).is_monotonic) print(pd.Series(['fish', 'fish']).is_monotonic_decreasing) Output: True TrueAlso Read: 4 Must-Know Python Pandas Functions for Time Series Analysis4. hasnansThis method checks if a Series contains NaN values:import numpy as np print(pd.Series([1, 2, 3, np.nan]).hasnans) print(pd.Series([1, 2, 3, 10, 20]).hasnans) Output: True False5. emptySometimes, we might want to know if a Series is completely empty, not containing even NaN values:print(pd.Series().empty) print(pd.Series(np.nan).empty) Output: True FalseA Series can become empty after some manipulations with it, for example, filtering:s = pd.Series([1, 2, 3]) s[s > 3].empty Output: True 6 & 7. first_valid_index() and last_valid_index()These 2 methods return index for first/last non-NaN value and are particularly useful for Series objects with many NaNs:print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]).first_valid_index()) print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]).last_valid_index()) Output: 2 4If all the values of a Series are NaN, both methods return None:print(pd.Series([np.nan, np.nan, np.nan]).first_valid_index()) print(pd.Series([np.nan, np.nan, np.nan]).last_valid_index()) Output: None None 8. truncate()This method allows truncating a Series before and after some index value. Let’s truncate the Series from the previous section leaving only non-NaN values:s = pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]) s.truncate(before=2, after=4) Output: 2 1.0 3 2.0 4 3.0 dtype: float64The original index of the Series was preserved. We may want to reset it and also to assign the truncated Series to a variable:s_truncated = s.truncate(before=2, after=4).reset_index(drop=True) print(s_truncated) Output: 0 1.0 1 2.0 2 3.0 dtype: float64Also Read: Pandas vs SQL. When Data Scientists Should Use One Over the Other9. convert_dtypes()As the pandas documentation says, this method is used toConvert columns to best possible dtypes using dtypes supporting pd.NA.If to consider only Series objects and not DataFrames, the only application of this method is to convert all nullable integers (i.e. float numbers with a decimal part equal to 0, such as 1.0, 2.0, etc.) back to “normal” integers. Such float numbers appear when the original Series contains both integers and NaN values. Since NaN is a float in numpy and pandas, it leads to the whole Series with any missing values to become of float type as well.Let’s take a look at the example from the previous section to see how it works:print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan])) print('\n') print(pd.Series([np.nan, np.nan, 1, 2, 3, np.nan]).convert_dtypes()) Output: 0 NaN 1 NaN 2 1.0 3 2.0 4 3.0 5 NaN dtype: float64 0 <NA> 1 <NA> 2 1 3 2 4 3 5 <NA> dtype: Int64 10. clip()We can clip all the values of a Series at input thresholds (lower and upper parameters):s = pd.Series(range(1, 11)) print(s) s_clipped = s.clip(lower=2, upper=7) print(s_clipped) Output: 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 dtype: int64 0 2 1 2 2 3 3 4 4 5 5 6 6 7 7 7 8 7 9 7 dtype: int64 11. rename_axis()In the case of a Series object, this method sets the name of the index:s = pd.Series({'flour': '300 g', 'butter': '150 g', 'sugar': '100 g'}) print(s) s=s.rename_axis('ingredients') print(s) Output: flour 300 g butter 150 g sugar 100 g dtype: object ingredients flour 300 g butter 150 g sugar 100 g dtype: object 12 & 13. nsmallest() and nlargest()These 2 methods return the smallest/largest elements of a Series. By default, they return 5 values, in ascending order for nsmallest() and in descending - for nlargest().s = pd.Series([3, 2, 1, 100, 200, 300, 4, 5, 6]) s.nsmallest() Output: 2 1 1 2 0 3 6 4 7 5 dtype: int64It’s possible to specify another number of the smallest/largest values to be returned. Also, we may want to reset the index and assign the result to a variable:largest_3 = s.nlargest(3).reset_index(drop=True) print(largest_3) Output: 0 300 1 200 2 100 dtype: int64Also Read: Pandas vs SQL. When Data Scientists Should Use One Over the Other14. pct_change()For a Series object, we can calculate percentage change (or, more precisely, fraction change) between the current and a prior element. This approach can be helpful, for example, when working with time series, or for creating a waterfall chart in % or fractions.s = pd.Series([20, 33, 14, 97, 19]) s.pct_change() Output: 0 NaN 1 0.650000 2 -0.575758 3 5.928571 4 -0.804124 dtype: float64To make the resulting Series more readable, let’s round it:s.pct_change().round(2) Output: 0 NaN 1 0.65 2 -0.58 3 5.93 4 -0.80 dtype: float64 15. explode()This method transforms each list-like element of a Series (lists, tuples, sets, Series, ndarrays) to a row. Empty list-likes will be transformed in a row with NaN. To avoid repeated indices in the resulting Series, it’s better to reset index:s = pd.Series([[np.nan], {1, 2}, 3, (4, 5)]) print(s) s_exploded = s.explode().reset_index(drop=True) print(s_exploded) Output: 0 [nan] 1 {1, 2} 2 3 3 (4, 5) dtype: object 0 NaN 1 1 2 2 3 3 4 4 5 5 dtype: object 16. repeat()This method is used for consecutive repeating each element of a Series a defined number of times. Also in this case, it makes sense to reset index:s = pd.Series([1, 2, 3]) print(s) s_repeated = s.repeat(2).reset_index(drop=True) print(s_repeated) Output: 0 1 1 2 2 3 dtype: int64 0 1 1 1 2 2 3 2 4 3 5 3 dtype: int64If the number of repetitions is assigned to 0, an empty Series will be returned:s.repeat(0) Output: Series([], dtype: int64) ConclusionTo sum up, we investigated 16 rarely used pandas methods for working with Series and some of their application cases. If you know some other interesting ways to manipulate pandas Series, you’re very welcome to share them in the comments.Thanks for reading!Also read: Using Python And Pandas Datareader to Analyze Financial Data
Mar 02, 2021
Yet another Python library for Data Analysis that You Should Know About — and no, I am not talking about Spark or DaskBig Data Analysis in Python is having its renaissance. It all started with NumPy, which is also one of the building blocks behind the tool I am presenting in this article.In 2006, Big Data was a topic that was slowly gaining traction, especially with the release of Hadoop. Pandas followed soon after with its DataFrames. 2014 was the year when Big Data became mainstream, also Apache Spark was released that year. In 2018 came Dask and other libraries for data analytics in Python.Each month I find a new Data Analytics tool, which I am eager to learn. It is a worthy investment of spending an hour or two on tutorials as it can save you a lot of time in the long run. It’s also important to keep in touch with the latest tech.While you might expect that this article will be about Dask you are wrong. I found another Python library for data analysis that you should know about.Like Python, it is equally important that you become proficient in SQL. In case you aren’t familiar with it, and you have some money to spare, check out this course: Master SQL, the core language for Big Data analysis.Big Data Analysis in Python is having its renaissanceIn case you’re interested, Udacity offers Free Access to:- Intro to Machine Learning with PyTorch - Deep Learning Nanodegree and moreMeet VaexPhoto by Mathew Schwartz on UnsplashVaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets.It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data.What’s the difference between Vaex and Dask?Photo by Stillness InMotion on UnsplashVaex is not similar to Dask but is similar to Dask DataFrames, which are built on top pandas DataFrames. This means that Dask inherits pandas issues, like high memory usage. This is not the case Vaex.Vaex doesn’t make DataFrame copies so it can process bigger DataFrame on machines with less main memory.Both Vaex and Dask use lazy processing. The only difference is that Vaex calculates the field when needed, wherewith Dask we need to explicitly use the compute function.Data needs to be in HDF5 or Apache Arrow format to take full advantage of Vaex.How to install Vaex?To install Vaex is as simple as installing any other Python package:pip install vaexLet’s take Vaex to a test drivePhoto by Eugene Chystiakov on UnsplashLet’s create a pandas DataFrame with 1 million rows and 1000 columns to create a big data file.import vaex import pandas as pd import numpy as np n_rows = 1000000 n_cols = 1000 df = pd.DataFrame(np.random.randint(0, 100, size=(n_rows, n_cols)), columns=['col%d' % i for i in range(n_cols)]) df.head()First few lines in a Pandas Dataframe (image made by author)How much main memory does this DataFrame use?df.info(memory_usage='deep')Let’s save it to disk so that we can read it later with Vaex.file_path = 'big_file.csv' df.to_csv(file_path, index=False)We wouldn’t gain much by reading the whole CSV directly with Vaex as the speed would be similar to pandas. Both need approximately 85 seconds on my laptop.We need to convert the CSV to HDF5 (the Hierarchical Data Format version 5) to see the benefit with Vaex. Vaex has a function for conversion, which even supports files bigger than the main memory by converting smaller chunks.If you cannot open a big file with pandas, because of memory constraints, you can covert it to HDF5 and process it with Vaex.dv = vaex.from_csv(file_path, convert=True, chunk_size=5_000_000)This function creates an HDF5 file and persists it to disk.What’s the datatype of dv?type(dv) # output vaex.hdf5.dataset.Hdf5MemoryMappedNow, let’s read the 7.5 GB dataset with Vaex — We wouldn’t need to read it again as we already have it in dv variable. This is just to test the speed.dv = vaex.open('big_file.csv.hdf5')Vaex needed less than 1 second to execute the command above. But Vaex didn’t actually read the file, because of lazy loading, right?Let’s force to read it by calculating a sum of col1.suma = dv.col1.sum() suma # Output # array(49486599)I was really surprised by this one. Vaex needed less than 1 second to calculate the sum. How is that possible?Opening such data is instantaneous regardless of the file size on disk. Vaex will just memory-map the data instead of reading it in memory. This is the optimal way of working with large datasets that are larger than available RAM.PlottingVaex is also fast when plotting data. It has special plotting functions plot1d, plot2d and plot2d_contour.dv.plot1d(dv.col2, figsize=(14, 7))Plotting with Vaex (image made by author)Virtual columnsVaex creates a virtual column when adding a new column, — a column that doesn’t take the main memory as it is computed on the fly.dv['col1_plus_col2'] = dv.col1 + dv.col2 dv['col1_plus_col2']The virtual column in Vaex (image made by author)Efficient filteringVaex won’t create DataFrame copies when filtering data, which is much more memory efficient.dvv = dv[dv.col1 > 90] AggregationsAggregations work slightly differently than in pandas, but more importantly, they are blazingly fast.Let’s add a binary virtual column where col1 ≥ 50.dv['col1_50'] = dv.col1 >= 50Vaex combines group by and aggregation in a single command. The command below groups data by the “col1_50” column and calculates the sum of the col3 column.dv_group = dv.groupby(dv['col1_50'], agg=vaex.agg.sum(dv['col3'])) dv_groupAggregations in Vaex (image made by author)JoinsVaex joins data without making memory copies, which saves the main memory. Pandas users will be familiar with the join function:dv_join = dv.join(dv_group, on=’col1_50') ConclusionIn the end, you might ask: Should we simply switch from pandas to Vaex? The answer is a big NO.Pandas is still the best tool for data analysis in Python. It has well-supported functions for the most common data analysis tasks.When it comes to bigger files, pandas might not be the fastest tool. This is a great place for Vaex to jump in.Vaex is a tool you should add to your Data Analytics toolbox.When working on an analysis task where pandas is too slow or simply crashes, pull Vaex out of your toolbox, filter out the most important entries and continue the analysis with pandas.Follow me on Twitter, where I regularly tweet about Data Science and Machine Learning.
Mar 02, 2021
Keep up to date by participating in our global community of data scientists and AI enthusiasts. We discuss the latest developments in data science competitions, new techniques for solving complex challenges, AI and machine learning models, and much more!