简体   繁体   English

Python Pandas groupby 或滚动多年平均汇总统计

[英]Python Pandas groupby or rolling multi year average summary statistics

I have a pandas time series data frame with approximately 20 rows for for each year, from 2014 to 2017, and I'm trying to calculate the mean value for each two year period.我有一个熊猫时间序列数据框,从 2014 年到 2017 年,每年大约有 20 行,我正在尝试计算每两年期间的平均值。 For example: 01/1/2014 ... 31/12/2015, 01/1/2015 ... 31/12/2016, 01/1/2016 ... 31/12/2017例如:01/1/2014 ... 31/12/2015、01/1/2015 ... 31/12/2016、01/1/2016 ... 31/12/2017

Here is the code I'm using to import the DataFrame:这是我用来导入 DataFrame 的代码:

import pandas as pd

infile = 'https://environment.data.gov.uk/bwq/downloadAPI/requestDownload?report=samples&bw=ukj2100-14950&to=2018-02-05&from=2014-05-01'
df = pd.read_csv(infile,compression='zip',usecols=['intestinalEnterococciCount','sampleTime'], parse_dates=['sampleTime'],infer_datetime_format=True,index_col=['sampleTime'],na_values=True)

and an example of the DataFrame:以及 DataFrame 的示例:

                     intestinalEnterococciCount
sampleTime                                     
2014-05-12 13:00:00                          10
2014-05-21 12:27:00                          10
2014-05-27 10:55:00                          10
2014-06-06 12:19:00                          10
2014-06-09 13:26:00                          10

I would like to calculate the mean value for each two year period.我想计算每两年期间的平均值。 The expected answers would be:预期的答案是:

Period                Mean
Jan 2014 - Dec 2015:  33.575
Jan 2015 - Dec 2016:  22.85
Jan 2016 - Dec 2017:  25.5

What I tried:我试过的:

  • I know I could use a loop and iterate through a list of the two year periods and calculate it that way, but I'm sure there must be a nicer way to achieve this using Pandas.我知道我可以使用循环并遍历两年期的列表并以这种方式计算它,但我确信必须有更好的方法来使用 Pandas 来实现这一点。
  • I tried using .rolling but that appears to give a rolling mean, which increments forward row by row, rather than over two year periods.我尝试使用.rolling但这似乎给出了一个滚动平均值,它逐行向前递增,而不是超过两年。
  • I can successfully use groupby(df.index.year).mean to get the mean for each year, but how would I go about it for calculating it for each two year period?我可以成功地使用groupby(df.index.year).mean来获得每年的平均值,但是我将如何计算groupby(df.index.year).mean的平均值?

You can using groupby and rolling , make sure you record count and sum for future calculation of mean ,(you just need make the change of index to what you need by using s.index=[your index list] )您可以使用groupbyrolling ,确保记录计数和总和以供将来计算均值,(您只需使用s.index=[your index list]更改为您需要的内容)

s=df.groupby(df.index.strftime('%Y')).intestinalEnterococciCount.agg(['sum','count'])

s=s.rolling(window=2).sum()

s['mean']=s['sum']/s['count']

s.dropna()

Out[564]: 
         sum  count    mean
2015  1343.0   40.0  33.575
2016   914.0   40.0  22.850
2017   765.0   30.0  25.500

Update:更新:

s=df.groupby(df.index.strftime('%Y')).intestinalEnterococciCount.apply(list)
(s+s.shift()).dropna().apply(pd.Series).stack().std(level=0)
Out[601]: 
2015    76.472179
2016    33.701974
2017    34.845224
dtype: float64

To get other aggregate statistics like standard deviation and geometric mean, here's a somewhat hackish way:要获得标准偏差和几何平均值等其他汇总统计数据,这里有一个有点hackish的方法:

df_std = pd.DataFrame([df[str(y):str(y+2)].std() for y in df.index.year.unique()])
df_std.index = df.index.year.unique().sort_values()

df_std
            intestinalEnterococciCount
sampleTime
2014                         63.825528
2015                         37.596271
2016                         34.845224
2017                         51.384066

from scipy.stats.mstats import gmean
df_gm = pd.DataFrame([df[str(y):str(y+2)].agg(gmean) for y in df.index.year.unique()])
df_gm.index = df.index.year.unique().sort_values()

df_gm
            intestinalEnterococciCount
sampleTime
2014                         16.230186
2015                         16.136248
2016                         16.377124
2017                         19.529690

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM