[英]DataFrame Groupby two columns and get counts of another column
Novice programmer here seeking help.新手程序员在这里寻求帮助。 I have a Dataframe that looks like this:我有一个看起来像这样的 Dataframe:
Cashtag Date Message
0 $AAPL 2018-01-01 "Blah blah $AAPL"
1 $AAPL 2018-01-05 "Blah blah $AAPL"
2 $AAPL 2019-01-08 "Blah blah $AAPL"
3 $AAPL 2019-02-09 "Blah blah $AAPL"
4 $AAPL 2019-02-10 "Blah blah $AAPL"
5 $AAPL 2019-03-01 "Blah blah $AAPL"
6 $FB 2018-01-03 "Blah blah $FB"
7 $FB 2018-02-10 "Blah blah $FB"
8 $FB 2018-02-11 "Blah blah $FB"
9 $FB 2019-03-22 "Blah blah $FB"
10 $AMZN 2018-04-13 "Blah blah $AMZN"
11 $AMZN 2018-04-29 "Blah blah $AMZN"
12 $AMZN 2019-07-23 "Blah blah $AMZN"
13 $AMZN 2019-07-27 "Blah blah $AMZN"
My desired output is a DataFrame that tells me the number of messages for each month of every year in the sample for each company.我想要的 output 是 DataFrame,它告诉我每个公司样本中每年每个月的消息数量。 In this example it would be:在此示例中,它将是:
Cashtag Date #Messages
0 $AAPL 2018-01 02
1 $AAPL 2019-01 01
2 $AAPL 2019-02 02
3 $AAPL 2019-03 01
4 $FB 2018-01 01
5 $FB 2018-02 02
6 $FB 2019-03 01
7 $AMZN 2018-04 02
8 $AMZN 2019-07 02
I've tried many combinations of.groupby() but have not achieved a solution.我尝试了很多 .groupby() 的组合,但没有找到解决方案。
How can I achieve my desired output?如何实现我想要的 output?
Try:尝试:
In case Date
is string
:如果Date
是string
:
>>> df.groupby([df["Cashtag"], df["Date"].apply(lambda x: x[:7])]).agg({"Message": "count"}).reset_index()
If Date
is datetime
:如果Date
是datetime
时间:
>>> df.groupby([df["Cashtag"], df["Date"].apply(lambda x: "{0}-{1:02}".format(x.year, x.month))]).agg({"Message": "count"}).reset_index()
and output:和 output:
Cashtag Date Message
0 $AAPL 2018-01 2
1 $AAPL 2019-01 1
2 $AAPL 2019-02 2
3 $AAPL 2019-03 1
4 $AMZN 2018-04 2
5 $AMZN 2019-07 2
6 $FB 2018-01 1
7 $FB 2018-02 2
8 $FB 2019-03 1
There are two tricky parts.有两个棘手的部分。 One is handling dates and the other is the groupby itself.一个是处理日期,另一个是groupby本身。
To group by just year and month, you need to extract them from your dates.要按年和月分组,您需要从日期中提取它们。 You can use string indexing, or convert your "Date" column to datetimes and format them with strftime
.您可以使用字符串索引,或将“日期”列转换为日期时间并使用strftime
对其进行格式化。 I will use the second method because I find it more readable and also more useful as a learning point.我将使用第二种方法,因为我发现它更具可读性并且作为学习点也更有用。
The important point about groupby
is that you can pass it a list of column labels.关于groupby
的重要一点是,您可以将列标签列表传递给它。 Aggregation is then done on every unique combination of values in those columns.然后对这些列中的每个唯一值组合进行聚合。
# convert Date to datetimes
df['Date'] = pd.to_datetime(df['Date'])
# extract year and month from datetime objects with `strftime`
df['year-month'] = df['Date'].apply(lambda x: (x.strftime('%Y-%m')))
# groupby columns 'Cashtag' and 'year-month' and aggregate 'Message' using the `count` function
df.groupby(['Cashtag', 'year-month'])['Message'].count()
If you don't want to create a new column, you can do it in a single line:如果您不想创建新列,可以在一行中完成:
df.groupby(['Cashtag', df['Date'].apply(lambda x: (x.strftime('%Y-%m')))])['Message'].count()
Solution using resample
:使用resample
的解决方案:
import pandas as pd
data = [
('$AAPL', '2018-01-01', "Blah blah $AAPL"),
('$AAPL', '2018-01-05', "Blah blah $AAPL"),
('$AAPL', '2019-01-08', "Blah blah $AAPL"),
('$AAPL', '2019-02-09', "Blah blah $AAPL"),
('$AAPL', '2019-02-10', "Blah blah $AAPL"),
('$AAPL', '2019-03-01', "Blah blah $AAPL"),
('$FB', '2018-01-03', "Blah blah $FB"),
('$FB', '2018-02-10', "Blah blah $FB"),
]
df = pd.DataFrame.from_records(data=data, columns=['Cashtag', 'Date', 'Message'])
df['Date'] = pd.to_datetime(df['Date'])
df = (df
.set_index(pd.DatetimeIndex(df['Date']))
.groupby('Cashtag')
.resample('M')['Message']
.count()
.reset_index()
.query('Message > 0')
.reset_index(drop=True)
)
df['Date'] = df['Date'].dt.to_period('M')
Output: Output:
Cashtag Date Message
0 $AAPL 2018-01 2
1 $AAPL 2019-01 1
2 $AAPL 2019-02 2
3 $AAPL 2019-03 1
4 $FB 2018-01 1
5 $FB 2018-02 1
Or even simpler solution:甚至更简单的解决方案:
df['Date'] = pd.to_datetime(df['Date']).dt.to_period('M')
df = df.groupby(['Cashtag', 'Date'])['Message'].count().reset_index()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.