The first thing we're going to do is create a DataFrame GroupBy object. This is the first step of GroupBy, and I'm going to explain how it works as we go along. Let's just close this pane.
I'm going to start off with a variable which I'm going to call categories. Then I'm going to reference my DataFrame df_merged, then period and group by. I then put an open bracket and reference the column that I want to group by. Here, we're going to group by the category column, which has those four values: cardio, functional, gym equipment, and weights and bars.
If I reference category and then press Ctrl + Enter, what we return here is a DataFrame GroupBy object. Actually, if I click the card, we don't see any data in here. What this is is a group of four DataFrames: one DataFrame holds all the data where cardio is the category, another one where functional is the category, another one gym equipment, and the last one weights and bars. This is basically just an object with four DataFrames within it.
Now, I can inspect this DataFrame GroupBy object to show how many rows of data are in each of the DataFrames. You'll note again that I've just created this variable categories, and I'm going to reference that in the next step. If I reference categories, this variable which is basically just the group by object, and then use the method size, what is returned is a series. It shows each of the categories and how many rows of data are in each of them. This is a useful method to inspect how much data is in each of these DataFrames.
The next thing I'm going to do is sum the revenue by category. Again, I'm going to use this categories variable (which is my DataFrame GroupBy object) and enter it like this: if I reference categories (the group by object), now I want to reference a column within these DataFrames. I'm going to do that in exactly the same way as I would with any DataFrame. I open a square bracket and then reference the revenue column, close the square bracket, and then use the sum method at the end, and we'll see what we get.
What we return here is a series. If I inspect this card, you see it says category and revenue. We get each of the categories (cardio, functional, etc.) and then it sums up the revenue. That's great and really useful.
One thing I want to note here is that it says it's a series, but it really looks like a DataFrame. What's going on here is that this series is revenue, and the category column isn't actually a column of a DataFrame; it's a new index. We talked earlier about how there's this hidden numerical index starting at zero and going to X depending on how many rows of data you have. When we use GroupBy, we actually create a new index. The previous numerical one still exists, but we also have a second index here. If you think about the numerical index 0 to X, those are unique values in this index. It's also going to be unique values, and I'm going to show you how we can leverage this in a moment.
First, I'm just going to change this to values and show you that instead of summing this, we could have used other methods such as mean, which would find the average revenue amount in each of these categories (for instance, cardio, functional, etc.). This probably wouldn't be that useful, but we could use others such as max. That would give us the row of data in cardio which has the highest revenue. It might have been, for instance, 10 treadmills being sold or something like that. However, sum is more commonly used, so I'm going to return to that.
One other thing I'm going to do here is set this to a new variable, so I'm going to call this category_revenue. I want to give you an example of what I was just talking about regarding this new index. If I go to one of these cells and reference that variable category_revenue, let's say I wanted to return the amount related to functional. I could do this in two ways: I could put 1, which is the hidden index number for the second row. We've seen how we can return rows of data from a series earlier in the course, so that is 897,986, which is the second value.
Another way I can do this is by referencing my new index and just writing functional. If we put functional, then we get the same result. This shows that it is a new index that we can use to look up data or subsets of data. We're going to see another example of that in a minute. For now, I'm going to get rid of this.