[英]Python Pandas groupby or rolling multi year average summary statistics
I have a pandas time series data frame with approximately 20 rows for for each year, from 2014 to 2017, and I'm trying to calculate the mean value for each two year period.我有一个熊猫时间序列数据框,从 2014 年到 2017 年,每年大约有 20 行,我正在尝试计算每两年期间的平均值。 For example: 01/1/2014 ... 31/12/2015, 01/1/2015 ... 31/12/2016, 01/1/2016 ... 31/12/2017
例如:01/1/2014 ... 31/12/2015、01/1/2015 ... 31/12/2016、01/1/2016 ... 31/12/2017
Here is the code I'm using to import the DataFrame:这是我用来导入 DataFrame 的代码:
import pandas as pd
infile = 'https://environment.data.gov.uk/bwq/downloadAPI/requestDownload?report=samples&bw=ukj2100-14950&to=2018-02-05&from=2014-05-01'
df = pd.read_csv(infile,compression='zip',usecols=['intestinalEnterococciCount','sampleTime'], parse_dates=['sampleTime'],infer_datetime_format=True,index_col=['sampleTime'],na_values=True)
and an example of the DataFrame:以及 DataFrame 的示例:
intestinalEnterococciCount
sampleTime
2014-05-12 13:00:00 10
2014-05-21 12:27:00 10
2014-05-27 10:55:00 10
2014-06-06 12:19:00 10
2014-06-09 13:26:00 10
I would like to calculate the mean value for each two year period.我想计算每两年期间的平均值。 The expected answers would be:
预期的答案是:
Period Mean
Jan 2014 - Dec 2015: 33.575
Jan 2015 - Dec 2016: 22.85
Jan 2016 - Dec 2017: 25.5
What I tried:我试过的:
.rolling
but that appears to give a rolling mean, which increments forward row by row, rather than over two year periods..rolling
但这似乎给出了一个滚动平均值,它逐行向前递增,而不是超过两年。groupby(df.index.year).mean
to get the mean for each year, but how would I go about it for calculating it for each two year period?groupby(df.index.year).mean
来获得每年的平均值,但是我将如何计算groupby(df.index.year).mean
的平均值?You can using groupby
and rolling
, make sure you record count and sum for future calculation of mean ,(you just need make the change of index to what you need by using s.index=[your index list]
)您可以使用
groupby
和rolling
,确保记录计数和总和以供将来计算均值,(您只需使用s.index=[your index list]
更改为您需要的内容)
s=df.groupby(df.index.strftime('%Y')).intestinalEnterococciCount.agg(['sum','count'])
s=s.rolling(window=2).sum()
s['mean']=s['sum']/s['count']
s.dropna()
Out[564]:
sum count mean
2015 1343.0 40.0 33.575
2016 914.0 40.0 22.850
2017 765.0 30.0 25.500
Update:更新:
s=df.groupby(df.index.strftime('%Y')).intestinalEnterococciCount.apply(list)
(s+s.shift()).dropna().apply(pd.Series).stack().std(level=0)
Out[601]:
2015 76.472179
2016 33.701974
2017 34.845224
dtype: float64
To get other aggregate statistics like standard deviation and geometric mean, here's a somewhat hackish way:要获得标准偏差和几何平均值等其他汇总统计数据,这里有一个有点hackish的方法:
df_std = pd.DataFrame([df[str(y):str(y+2)].std() for y in df.index.year.unique()])
df_std.index = df.index.year.unique().sort_values()
df_std
intestinalEnterococciCount
sampleTime
2014 63.825528
2015 37.596271
2016 34.845224
2017 51.384066
from scipy.stats.mstats import gmean
df_gm = pd.DataFrame([df[str(y):str(y+2)].agg(gmean) for y in df.index.year.unique()])
df_gm.index = df.index.year.unique().sort_values()
df_gm
intestinalEnterococciCount
sampleTime
2014 16.230186
2015 16.136248
2016 16.377124
2017 19.529690
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.