Can someone explain what happens during a reset_index(name='counts')
operation after a groupby(...).size() operation on a dataframe? It does exactly what I want (creates a dataframe with a column 'counts' that has the size of each group), but I don't understand why it works.
df = pd.DataFrame( {'letter':['A', 'A', 'B', 'B', 'C'], 'number':[0,0,1,2,0]} )
If I do a groupby + size operation: df.groupby(['letter', 'number']).size()
, I get a multi-level index with one 'letter' level and one 'number' level:
df = df.groupby(['letter', 'number']).size()
print df.index
Out: MultiIndex(levels=[[u'A', u'B', u'C'], [0, 1, 2]], labels=[[0, 1, 1, 2], [0, 1, 2, 0]], names=[u'letter', u'number'])
I'm confused about what happens when I add .reset_index(...)
operation:
df = df.groupby(['letter', 'number']).size().reset_index(name='counts')
,
which produces the following Dataframe with index = RangeIndex(start=0, stop=4, step=1)
:
letter number counts
0 A 0 2
1 B 1 1
2 B 2 1
3 C 0 1
I'm particularly confused about three points:
name
keyword argument works?reset_index
has a column named 'counts', but the reset_index
documentation doesn't say anything about causing a column to be named, so how does this happen? Text in your question is a bit confusing. When you use groupby
you need to provide an argument for the grouping. You may want to edit. I think I can still answer your Q...
If you groupby 1 thing, you will typically get a series
as an answer to .size()
or .count()
. You can use the .index
to check out what is going on:
In [18]: df1 = pd.DataFrame({'letter':['A', 'A', 'B', 'B', 'C'], 'number':[0,0,1
...: ,2,0]})
In [19]: df1
Out[19]:
letter number
0 A 0
1 A 0
2 B 1
3 B 2
4 C 0
In [20]: df1.index
Out[20]: RangeIndex(start=0, stop=5, step=1)
In [21]: df1.groupby('letter').size()
Out[21]:
letter
A 2
B 2
C 1
dtype: int64
In [22]: size_groups = _
In [23]: size_groups.index
Out[23]: Index(['A', 'B', 'C'], dtype='object', name='letter')
In [24]: type(size_groups)
Out[24]: pandas.core.series.Series
So, this is a series, with the index as the list shown above. If you reset this index, pandas will retain that series, but add a new index series, and move the sizes over to a new series, which will create a dataframe of the 2 series:
In [25]: size_groups.reset_index()
Out[25]:
letter 0
0 A 2
1 B 2
2 C 1
You won't get a multilevel index out of this unless you groupby
2 things. For instance:
In [43]: df1
Out[43]:
letter number
0 A 0
1 A 0
2 B 1
3 B 2
4 C 0
In [44]: df2 = df1.groupby(['letter', 'number']).size()
In [45]: df2
Out[45]:
letter number
A 0 2
B 1 1
2 1
C 0 1
dtype: int64
In [46]: df2.index
Out[46]:
MultiIndex([('A', 0),
('B', 1),
('B', 2),
('C', 0)],
names=['letter', 'number'])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.