[英]Pandas groupby month and year
I have the following dataframe:我有以下 dataframe:
Date abc xyz
01-Jun-13 100 200
03-Jun-13 -20 50
15-Aug-13 40 -5
20-Jan-14 25 15
21-Feb-14 60 80
I need to group the data by year and month.我需要按年和月对数据进行分组。 Ie, Group by Jan 2013, Feb 2013, Mar 2013, etc...
即,按 2013 年 1 月、2013 年 2 月、2013 年 3 月等分组...
I will be using the newly grouped data to create a plot showing abc vs xyz per year/month.我将使用新分组的数据创建一个 plot 显示每年/每月的 abc 与 xyz。
I've tried various combinations of groupby and sum, but I just can't seem to get anything to work.我尝试了 groupby 和 sum 的各种组合,但我似乎无法得到任何工作。 How can I do it?
我该怎么做?
You can use either resample or Grouper
(which resamples under the hood).您可以使用重新采样或
Grouper
(在引擎盖下重新采样)。
First make sure that the datetime column is actually of datetimes (hit it with pd.to_datetime
).首先确保日期时间列实际上是日期时间(用
pd.to_datetime
打它)。 It's easier if it's a DatetimeIndex:如果它是 DatetimeIndex 则更容易:
In [11]: df1
Out[11]:
abc xyz
Date
2013-06-01 100 200
2013-06-03 -20 50
2013-08-15 40 -5
2014-01-20 25 15
2014-02-21 60 80
In [12]: g = df1.groupby(pd.Grouper(freq="M")) # DataFrameGroupBy (grouped by Month)
In [13]: g.sum()
Out[13]:
abc xyz
Date
2013-06-30 80 250
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
In [14]: df1.resample("M", how='sum') # the same
Out[14]:
abc xyz
Date
2013-06-30 40 125
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
Note: Previously pd.Grouper(freq="M")
was written as pd.TimeGrouper("M")
.注意:以前
pd.Grouper(freq="M")
写为pd.TimeGrouper("M")
。 The latter is now deprecated since 0.21.后者自 0.21 起已被弃用。
I had thought the following would work, but it doesn't (due to as_index
not being respected? I'm not sure.).我曾认为以下内容会起作用,但它不会(由于没有尊重
as_index
?我不确定。)。 I'm including this for interest's sake.为了利益,我将其包括在内。
If it's a column (it has to be a datetime64 column! as I say, hit it with to_datetime
), you can use the PeriodIndex:如果它是一列(它必须是 datetime64 列!正如我所说,用
to_datetime
命中它),您可以使用 PeriodIndex:
In [21]: df
Out[21]:
Date abc xyz
0 2013-06-01 100 200
1 2013-06-03 -20 50
2 2013-08-15 40 -5
3 2014-01-20 25 15
4 2014-02-21 60 80
In [22]: pd.DatetimeIndex(df.Date).to_period("M") # old way
Out[22]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-06, ..., 2014-02]
Length: 5, Freq: M
In [23]: per = df.Date.dt.to_period("M") # new way to get the same
In [24]: g = df.groupby(per)
In [25]: g.sum() # dang not quite what we want (doesn't fill in the gaps)
Out[25]:
abc xyz
2013-06 80 250
2013-08 40 -5
2014-01 25 15
2014-02 60 80
To get the desired result we have to reindex...要获得所需的结果,我们必须重新索引...
Why not keep it simple?!为什么不保持简单?!
GB=DF.groupby([(DF.index.year),(DF.index.month)]).sum()
giving you,给你,
print(GB)
abc xyz
2013 6 80 250
8 40 -5
2014 1 25 15
2 60 80
and then you can plot like asked using,然后你可以按照要求进行绘图,
GB.plot('abc','xyz',kind='scatter')
There are different ways to do that.有不同的方法可以做到这一点。
df = pd.DataFrame({'Date':['01-Jun-13','03-Jun-13', '15-Aug-13', '20-Jan-14', '21-Feb-14'],
'abc':[100,-20,40,25,60],'xyz':[200,50,-5,15,80] })
'abc':[100,-20,40,25,60],'xyz':[200,50,-5,15,80] })
def getMonth(s): return s.split("-")[1] def getDay(s): return s.split("-")[0] def getYear(s): return s.split("-")[2] def getYearMonth(s): return s.split("-")[1]+"-"+s.split("-")[2]
year
, month
, day
and ' yearMonth
'.year
、 month
、 day
和 ' yearMonth
'。 In your case, you need one of both.'year','month'
or using one column yearMonth
'year','month'
或使用一列yearMonth
df['year']= df['Date'].apply(lambda x: getYear(x)) df['month']= df['Date'].apply(lambda x: getMonth(x)) df['day']= df['Date'].apply(lambda x: getDay(x)) df['YearMonth']= df['Date'].apply(lambda x: getYearMonth(x))
Output:输出:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
3 20-Jan-14 25 15 14 Jan 20 Jan-14
4 21-Feb-14 60 80 14 Feb 21 Feb-14
In this case, we are grouping by two columns:在这种情况下,我们按两列分组:
for key,g in df.groupby(['year','month']): print key,g
Output:输出:
('13', 'Jun') Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
('13', 'Aug') Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
('14', 'Jan') Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
('14', 'Feb') Date abc xyz year month day YearMonth
In this case, we are grouping by one column:在这种情况下,我们按一列分组:
for key,g in df.groupby(['YearMonth']): print key,g
Output:输出:
Jun-13 Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
Aug-13 Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
Jan-14 Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
Feb-14 Date abc xyz year month day YearMonth
4 21-Feb-14 60 80 14 Feb 21 Feb-14
get_group
get_group
print df.groupby(['YearMonth']).get_group('Jun-13')
打印 df.groupby(['YearMonth']).get_group('Jun-13')
Output:输出:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
get_group
.get_group
。 This hack would help to filter values and get the grouped values. This also would give the same result.这也会产生相同的结果。
print df[df['YearMonth']=='Jun-13']
Output:输出:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
You can select list of abc
or xyz
values during Jun-13
您可以在
Jun-13
期间选择abc
或xyz
值列表
print df[df['YearMonth']=='Jun-13'].abc.values
print df[df['YearMonth']=='Jun-13'].xyz.values
Output:输出:
[100 -20] #abc values
[200 50] #xyz values
You can use this to go through the dates that you have classified as "year-month" and apply cretiria on it to get related data.您可以使用它来查看您归类为“年-月”的日期,并对其应用 cretiria 以获取相关数据。
for x in set(df.YearMonth):
print df[df['YearMonth']==x].abc.values
print df[df['YearMonth']==x].xyz.values
You can also do it by creating a string column with the year and month as follows:您还可以通过创建一个带有年份和月份的字符串列来实现,如下所示:
df['date'] = df.index
df['year-month'] = df['date'].apply(lambda x: str(x.year) + ' ' + str(x.month))
grouped = df.groupby('year-month')
However this doesn't preserve the order when you loop over the groups, eg但是,当您遍历组时,这不会保留顺序,例如
for name, group in grouped:
print(name)
Will give:会给:
2007 11
2007 12
2008 1
2008 10
2008 11
2008 12
2008 2
2008 3
2008 4
2008 5
2008 6
2008 7
2008 8
2008 9
2009 1
2009 10
So then, if you want to preserve the order, you must do as suggested by @Q-man above:那么,如果您想保留顺序,则必须按照上面@Q-man 的建议进行操作:
grouped = df.groupby([df.index.year, df.index.month])
This will preserve the order in the above loop:这将保留上述循环中的顺序:
(2007, 11)
(2007, 12)
(2008, 1)
(2008, 2)
(2008, 3)
(2008, 4)
(2008, 5)
(2008, 6)
(2008, 7)
(2008, 8)
(2008, 9)
(2008, 10)
Some of the answers are using Date
as an index instead of a column (and there's nothing wrong with doing that).一些答案是使用
Date
作为索引而不是列(这样做没有错)。
However, for anyone who has the dates stored as a column (instead of an index), remember to access the column's dt
attribute.但是,对于将日期存储为列(而不是索引)的任何人,请记住访问列的
dt
属性。 That is:那是:
# First make sure `Date` is a datetime column
df['Date'] = pd.to_datetime(
arg=df['Date'],
format='%d-%b-%y' # Assuming dd-Mon-yy format
)
# Group by year and month
df.groupby(
[
df['Date'].dt.year,
df['Date'].dt.month
]
).sum()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.