[英]pandas: Get a daily description of Dataframe
I have a dataframe that looks like this:我有一个看起来像这样的 dataframe:
provider timestamp vehicle_id
id
103107 a 2019-09-11 20:05:47+02:00 x
1192195 b 2019-09-11 00:02:46+02:00 y
434508 c 2019-09-11 00:32:39+02:00 z
530388 c 2019-09-11 08:12:56+02:00 z
1773721 b 2019-09-11 20:02:55+02:00 w
...
I would like to get some statistics on the different vehicle_ids per day.我想获得一些关于每天不同车辆 ID 的统计数据。 I have this which allows me to do a
describe
manually:我有这个允许我手动进行
describe
:
df.groupby(['provider', df['timestamp'].dt.strftime('%Y-%m-%d')])[['vehicle_id']].nunique()
: df.groupby(['provider', df['timestamp'].dt.strftime('%Y-%m-%d')])[['vehicle_id']].nunique()
:
vehicle_id
provider timestamp
a 2019-09-11 1224
2019-09-12 1054
b 2019-09-11 2859
2019-09-12 2761
2019-09-17 700
How do I wrangle the data so I can get a daily min / max / average for each day?如何整理数据,以便获得每天的最小值/最大值/平均值? I'm kind of lost, any help is very appreciated.
我有点迷茫,非常感谢任何帮助。
Try this:尝试这个:
aggregations = ['mean', 'min', 'max', 'std']
result = grouped_df.groupby('timestamp')[vehicle_id].agg(aggregations)
Note: You might need to flatten your columns indexes first:注意:您可能需要先展平列索引:
grouped_df.columns = [col[1] if col[1] != '' else col[0] for col in grouped_df.columns]
Try groupby().agg()
:尝试
groupby().agg()
:
new_df.groupby('timestamp').vehicle_id.agg({'min','max','mean'})
Note : Since you only care about one column in your original data, you can just pass a series in the first groupby instead of a data frame, ie,注意:由于您只关心原始数据中的一列,因此您可以在第一个 groupby 中传递一个系列而不是数据框,即
# note the number of [] around 'vehicle_id'
new_df = (df.groupby(['provider',
df['timestamp'].dt.strftime('%Y-%m-%d')])
['vehicle_id'].nunique()
)
Then new_df
is a series named vehicle_id
, and the next command is just那么
new_df
就是一个名为vehicle_id
的系列,下一个命令就是
# note the difference before .agg
new_df.groupby('timestamp').agg({'min', 'max', 'mean'})
If I correctly understand your problem, all you need to do is this:如果我正确理解您的问题,您需要做的就是:
df.groupby(['provider', df['timestamp'].dt.strftime('%Y-%m-%d')])[['vehicle_id']].nunique()\
.groupby('timestamp')['vehicle_id'].describe()
In first groupby you'll get the dataframe with with number of unique vehicle_id
by provider
and day.在第一个 groupby 中,您将获得
vehicle_id
以及provider
和日期的唯一车辆 ID 数量。 For provided data sample it is:对于提供的数据样本,它是:
vehicle_id
provider timestamp
a 2019-09-11 1
b 2019-09-11 2
c 2019-09-11 1
And in the second it'll be statistics per day.第二个是每天的统计数据。 So the result will be
所以结果将是
count mean std min 25% 50% 75% max
timestamp
2019-09-11 3.0 1.333333 0.57735 1.0 1.0 1.0 1.5 2.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.