Pandas Groupby Function: A Brief Overview of All It Can Do

Navigate to:

Python is at the forefront of programming use cases through its wide variety of libraries. One of these libraries is pandas. Pandas can be used to work with large and complex datasets. It contains numerous functions and methods that allow users to manipulate, analyze, and visualize data. One of the functions of pandas is Groupby. This post is about the Groupby function, its syntax, and a few of its example use cases. This tutorial is geared toward developers who are just getting started with the Groupby function and intend to use it for their applications.

What Is Pandas Groupby?

Pandas Groupby is frequently used in data analysis. It allows you to group data and then execute some function or aggregate over the grouped data. The groups are generated on the basis of some category. Suppose you have a dataset containing information on car models, company, and their prices. Groupby allows you to find the average cost of cars by each company by grouping the dataset on the basis of the company and then applying the mean function over the values.

Syntax

Let’s start by overviewing the syntax of the Pandas Groupby function.

df.groupby(by=None, axis=0,sort=True)

The Groupby function has multiple parameters, in addition to the ones above, which can be used in various combinations to achieve desired results. The syntax shown above is explained below:

  1. by: This parameter accepts a function or mapping. It’s used to specify how the date should be grouped and allows you to define the criteria on which they should be grouped.

  2. axis: The axis parameter is an integer. It determines whether the grouping should occur along rows (0) or columns (1) of the data frame. Setting it to 0 indicates to split along rows, and 1 indicates splitting along columns.

  3. sort: Setting this as true allows you to sort group keys.

Pandas Groupby Examples

Before we dive into specific examples, let’s understand what a data frame in pandas is. Data frames are the core data structures in pandas. They provide the context and structure that’s necessary for meaningful Groupby functions.

A data frame is a two-dimensional data structure. It has data sorted into both rows and columns. In simple terms, a data frame can be thought of as a table. Let’s create a simple data frame:

Import pandas as pd df = pd.DataFrame({"Name": ['Ali', 'Ben'], "Age": [23, 32]}) print(df)

The output to the above code is shown below: Output showing name and Age Name and Age represent the column names. The integers 0 and 1 represent the index of each row. Now that we know what a data frame is, let’s read the data from the repo:

noaa-ndbc-data/latest-observations.csv using Pandas df = pd.read_csv('latest-observations.csv')

Reading the data like this will give us a data frame as the output, as we can see here: Table showing read data Now, suppose we have the given data and we want to understand it a bit more for data analysis. We can use the Python Groupby function. Let’s see the columns available to use first:

col_names = list(df.columns) col_names

Suppose we want to find the counts of the unique values of wind speed. We run the following command:

result = df.groupby('wind_speed_mps').size().reset_index(name='Count') 

print(result)

The output shows a count of each value of wind_speed. This data allows us to understand that 1.0 mps is the most common wind speed, as it’s the most frequently occurring figure. On the other hand, the windspeed 1.6, 2.3, 3.5, and 4.7 occur only once.

Pandas Groupby: Multiple columns

You can also group by multiple columns in pandas. Suppose we want to count both the gust and the wind speed. Instead of running two separate commands, we can modify my original command to accommodate this change:

result = df.groupby(['wind_speed_mps', 'gust_speed_mps']).size().reset_index(name='Count')

The resulting data frame looks like this: This command is basically finding the occurrence of gust speed, given that wind speed is a certain value. So, for wind speed 0.0 and gust speed 0.0, we have a count of 6. But when wind is 0 and gust is 0.5, the count is 7. Thus, it creates a hierarchy of sorts.

Pandas Groupby: Aggregate

Groupby also contains an aggregate function, which allows you to group your data and apply an aggregate function of choice. This is a pretty useful tool when you have expansive data and you want to cluster it for brevity. Suppose we want to find the count of all values by grouping the data over wind speed. This can be easily achieved by the following command:

result = df.groupby(['wind_speed_mps']).aggregate('count') result

wind_speed is now the index of our data frame, and the count of all columns is present as well. If we change the aggregate function from count to mean, the output changes too.

result = df.groupby(['wind_speed_mps']).aggregate(‘mean’) result

We now know that for wind speed of 1.0 mps, the average gust speed is 1.81 mps. We can apply different aggregate functions here, such as sum.

Pandas Groupby: Mean

We can also use the Groupby mean command to find the mean of different columns. Suppose I want to find out the mean values for all these attributes of each station. In order to do that, we can execute the following command.

df.groupby(['station_id']).mean() We get the following as an output

For each station id, we get the mean wind_dir_degt, wind_speed, and so on. But what if we only needed to find the mean sea level pressure for each station? We would change my original command to the following:

df.groupby(['station_id'])['sea_level_pressure_hpa'].mean()

And we would get the following data frame as an output: We now know that for station_id 13009, no data is available for sea_level_pressure. This could be an error in the data, which could potentially hint toward a larger issue. Python Groupby has allowed us to analyze and find errors in the data. Hurray!

Pandas Groupby: Sum

Similarly, Groupby has a sum function that allows you to sum over groups. Suppose we want to find the sum of sea level pressure for each station. We’ll execute the following command:

df.groupby('station_name')['sea_level_pressure_hpa'].sum()

Pandas grouby sum outpput We can now see the total sum of sea levels recorded against each station. If we hadn’t specified the specific column that we want to sum over, Python would have given a sum over all attributes in the data frame. Insights such as these can help us discover anomalies in the data and understand if there are any issues we should be looking out for.

For example, in the case above, we see that the sum of sea pressure for York Spit, VA, is 0.0. This can raise some questions. Were these issues in data measurement at the York Spit station? Why has no value ever been recorded? These issues can be raised on a higher level and allow you to uncover anomalies.

Pandas Groupby: Apply

There’s another very useful function you can use in pandas. Groupby apply allows you to apply a custom function to each group in your data. Suppose we want to find the maximum wind speed for each station. Run the following command:

df.groupby('station_name').apply(lambda x: x['wind_speed_mps'].max())

Pandas grouby apply We can also create a custom function using apply. Suppose we want to find the mean of the difference between the wind speed and the gust speed. We could apply the following function.

df.groupby('station_name').apply(lambda x: (x['wind_speed_mps'] - x['gust_speed_mps']).mean())

The resulting data frame is shown below:

Key Takeaways

In this post, we saw how Python pandas Groupby is a useful function that allows us to group data by different categories and apply various functions on it. These functions can also be custom functions, as we saw previously.

Groupby is a powerful tool that allows users to analyze large datasets, detect anomalies, and understand patterns and trends. InfluxDB provides real-time insights for your time series datasets through a single, purpose-built database. Enjoy low latency, unlimited cardinality, native SQL support, and superior data compression. Contact us today to see how we can help with your programming needs.

This post was written by Ali Mannan Tirmizi. Ali is a Senior DevOps manager and specializes in SaaS copywriting. He holds a degree in electrical engineering and physics and has held several leadership positions in the Manufacturing IT, DevOps and social impact domains.