简体   繁体   English

DataFrame 按两列分组并获取另一列的计数

[英]DataFrame Groupby two columns and get counts of another column

Novice programmer here seeking help.新手程序员在这里寻求帮助。 I have a Dataframe that looks like this:我有一个看起来像这样的 Dataframe:

  Cashtag      Date           Message  
0  $AAPL    2018-01-01   "Blah blah $AAPL"
1  $AAPL    2018-01-05   "Blah blah $AAPL"      
2  $AAPL    2019-01-08   "Blah blah $AAPL"     
3  $AAPL    2019-02-09   "Blah blah $AAPL"
4  $AAPL    2019-02-10   "Blah blah $AAPL"
5  $AAPL    2019-03-01   "Blah blah $AAPL"
6  $FB      2018-01-03   "Blah blah $FB"
7  $FB      2018-02-10   "Blah blah $FB"    
8  $FB      2018-02-11   "Blah blah $FB"   
9  $FB      2019-03-22   "Blah blah $FB" 
10 $AMZN    2018-04-13   "Blah blah $AMZN"
11 $AMZN    2018-04-29   "Blah blah $AMZN"
12 $AMZN    2019-07-23   "Blah blah $AMZN"     
13 $AMZN    2019-07-27   "Blah blah $AMZN"                         

My desired output is a DataFrame that tells me the number of messages for each month of every year in the sample for each company.我想要的 output 是 DataFrame,它告诉我每个公司样本中每年每个月的消息数量。 In this example it would be:在此示例中,它将是:

   Cashtag    Date    #Messages       
0  $AAPL    2018-01      02       
1  $AAPL    2019-01      01   
2  $AAPL    2019-02      02     
3  $AAPL    2019-03      01
4  $FB      2018-01      01
5  $FB      2018-02      02        
6  $FB      2019-03      01   
7  $AMZN    2018-04      02  
8  $AMZN    2019-07      02       

I've tried many combinations of.groupby() but have not achieved a solution.我尝试了很多 .groupby() 的组合,但没有找到解决方案。

How can I achieve my desired output?如何实现我想要的 output?

Try:尝试:

In case Date is string :如果Datestring

>>> df.groupby([df["Cashtag"], df["Date"].apply(lambda x: x[:7])]).agg({"Message": "count"}).reset_index()

If Date is datetime :如果Datedatetime时间:

>>> df.groupby([df["Cashtag"], df["Date"].apply(lambda x: "{0}-{1:02}".format(x.year, x.month))]).agg({"Message": "count"}).reset_index()

and output:和 output:

  Cashtag     Date  Message
0   $AAPL  2018-01        2
1   $AAPL  2019-01        1
2   $AAPL  2019-02        2
3   $AAPL  2019-03        1
4   $AMZN  2018-04        2
5   $AMZN  2019-07        2
6     $FB  2018-01        1
7     $FB  2018-02        2
8     $FB  2019-03        1

There are two tricky parts.有两个棘手的部分。 One is handling dates and the other is the groupby itself.一个是处理日期,另一个是groupby本身。

To group by just year and month, you need to extract them from your dates.要按年和月分组,您需要从日期中提取它们。 You can use string indexing, or convert your "Date" column to datetimes and format them with strftime .您可以使用字符串索引,或将“日期”列转换为日期时间并使用strftime对其进行格式化。 I will use the second method because I find it more readable and also more useful as a learning point.我将使用第二种方法,因为我发现它更具可读性并且作为学习点也更有用。

The important point about groupby is that you can pass it a list of column labels.关于groupby的重要一点是,您可以将列标签列表传递给它。 Aggregation is then done on every unique combination of values in those columns.然后对这些列中的每个唯一值组合进行聚合。

# convert Date to datetimes
df['Date'] = pd.to_datetime(df['Date'])
# extract year and month from datetime objects with `strftime`
df['year-month'] = df['Date'].apply(lambda x: (x.strftime('%Y-%m')))
# groupby columns 'Cashtag' and 'year-month' and aggregate 'Message' using the `count` function
df.groupby(['Cashtag', 'year-month'])['Message'].count()

If you don't want to create a new column, you can do it in a single line:如果您不想创建新列,可以在一行中完成:

df.groupby(['Cashtag', df['Date'].apply(lambda x: (x.strftime('%Y-%m')))])['Message'].count()

Solution using resample :使用resample的解决方案:

import pandas as pd


data = [
    ('$AAPL', '2018-01-01', "Blah blah $AAPL"),
    ('$AAPL', '2018-01-05', "Blah blah $AAPL"),      
    ('$AAPL', '2019-01-08', "Blah blah $AAPL"),     
    ('$AAPL', '2019-02-09', "Blah blah $AAPL"),
    ('$AAPL', '2019-02-10', "Blah blah $AAPL"),
    ('$AAPL', '2019-03-01', "Blah blah $AAPL"),
    ('$FB',   '2018-01-03', "Blah blah $FB"),
    ('$FB',   '2018-02-10', "Blah blah $FB"),  
]

df = pd.DataFrame.from_records(data=data, columns=['Cashtag', 'Date', 'Message'])


df['Date'] = pd.to_datetime(df['Date'])

df = (df
    .set_index(pd.DatetimeIndex(df['Date']))
    .groupby('Cashtag')
    .resample('M')['Message']
    .count()
    .reset_index()
    .query('Message > 0')
    .reset_index(drop=True)
)
df['Date'] = df['Date'].dt.to_period('M')

Output: Output:

  Cashtag     Date  Message
0   $AAPL  2018-01        2
1   $AAPL  2019-01        1
2   $AAPL  2019-02        2
3   $AAPL  2019-03        1
4     $FB  2018-01        1
5     $FB  2018-02        1

Or even simpler solution:甚至更简单的解决方案:

df['Date'] = pd.to_datetime(df['Date']).dt.to_period('M')
df = df.groupby(['Cashtag', 'Date'])['Message'].count().reset_index()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM