[英]get total by groups for all rows, selected rows and percent of total pandas
let us say I have a pandas dataframe called mydf. 假设我有一个名为mydf的熊猫数据框。 Ie, 即
import pandas as pd
mydf = pd.DataFrame({
'type':['A','A','A', 'B','B','B', 'C'],
'state':['NY','CA','NY', 'NY','CA','CA', 'WY'],
'date':['2018-01-02','2018-01-04','2018-02-06',
'2018-01-01','2018-01-24','2018-02-10','2018-01-24']
})
Out[28]:
date state type
0 2018-01-02 NY A
1 2018-01-04 CA A
2 2018-02-06 NY A
3 2018-01-01 NY B
4 2018-01-24 CA B
5 2018-02-10 CA B
6 2018-01-24 WY C
I'd want a table that count total number of records per state and date (year-month only not per daily date) for all records of type A, for all records (type A,B,C) and then the percentage of A within each group to the total. 我想要一个表,该表计算类型A的所有记录,状态A的所有记录(类型A,B,C)的每个州和日期(仅年-月,而不是每天的日期)的记录总数,然后计算A的百分比每个组中的总数。
Ie, the final output would be another pandas dataframe with following columns and values: 即,最终输出将是具有以下列和值的另一个pandas数据框:
date_ym state total_count total_type_A percentage
20181 CA 2 1 50
20181 NY 2 1 50
20181 WY 1 0 0
20182 CA 1 0 0
20182 NY 1 1 50
I could create two tables, then merge them and then count but I was looking for a simpler one-liner code... 我可以创建两个表,然后合并它们,然后计数,但是我在寻找一个更简单的单行代码...
First transform dates to months: 第一次转换日期为几个月:
mydf["date"] = mydf["date"].dt.strftime("%Y%m")
Then use groupby.agg
: 然后使用groupby.agg
:
def total_type_A(x):
return sum(x == "A")
def percentage(x):
return sum(x == "A") / len(x)
mydf.groupby(["date", "state"]).agg([len, total_type_A, percentage])
Another alternative would be to create a function that returns a Series with your desired columns. 另一种选择是创建一个函数,该函数返回带有所需列的Series。
Full example: 完整示例:
import pandas as pd
df = pd.DataFrame({
'type':['A','A','A', 'B','B','B', 'C'],
'state':['NY','CA','NY', 'NY','CA','CA', 'WY'],
'date':['2018-01-02','2018-01-04','2018-02-06',
'2018-01-01','2018-01-24','2018-02-10','2018-01-24']
})
df['date_ym'] = pd.to_datetime(df['date']).dt.strftime('%Y%#m') # switch # with - on linux
def func(x):
cnt = len(x)
cnt_A = sum(x == 'A')
return pd.Series({
'total_count': cnt,
'total_type_A': cnt_A,
'percentage': cnt_A/cnt*100
})
df = df.groupby(['date_ym','state'])['type'].apply(func).unstack().reset_index()
print(df)
Returns: 返回值:
date_ym state total_count total_type_A percentage
0 20181 CA 2.0 1.0 50.0
1 20181 NY 2.0 1.0 50.0
2 20181 WY 1.0 0.0 0.0
3 20182 CA 1.0 0.0 0.0
4 20182 NY 1.0 1.0 100.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.