[英]Calculate Statistical Metrics On Time-Series Data
I have a dataframe that looks like this:我有一个看起来像这样的 dataframe:
ID DATE QTD
0 71896517 2020-07-25 1
1 71896517 2020-09-14 2
2 72837949 2020-09-21 1
3 72848188 2020-11-03 1
4 73307986 2020-11-04 1
5 72848188 2020-11-16 1
6 71896517 2020-11-22 1
7 73307986 2020-11-25 1
8 73307986 2021-01-04 1
9 73307986 2021-02-04 1
10 72848188 2021-02-07 1
11 72837949 2021-02-11 1
I want to add a column to the dataframe for the monthly mean of the QTD
column.我想在 dataframe 中添加一列作为
QTD
列的月平均值。 This metrics should be calculated for each ID
individually (so every ID
has its own values) and should incremement as the dataframe grows in DATE
(which is the year-month-day).该指标应针对每个
ID
单独计算(因此每个ID
都有自己的值),并且应随着 dataframe 在DATE
(即年-月-日)中的增长而递增。 If the ID
doesn't have activity in a certain month, I want that month to still be included on the calculated metrics.如果
ID
在某个月份没有活动,我希望该月仍包含在计算的指标中。
I want the end dataframe to look like this:我希望结束 dataframe 看起来像这样:
ID DATE QTD MEAN
0 71896517 2020-07-25 1 1.0
1 71896517 2020-09-14 2 1.0
2 72837949 2020-09-21 1 1.0
3 72848188 2020-11-03 1 1.0
4 73307986 2020-11-04 1 1.0
5 72848188 2020-11-16 1 2.0
6 71896517 2020-11-22 1 0.8
7 73307986 2020-11-25 1 2.0
8 73307986 2021-01-04 1 1.0
9 73307986 2021-02-04 1 1.0
10 72848188 2021-02-07 1 0.75
11 72837949 2021-02-11 1 0.33
How can I go about to implement this?我该如何 go 来实现这个?
There's usually an iterative solution.通常有一个迭代的解决方案。
import pandas as pd
import numpy as np
def monthdelta(a,b):
a1,a2,a3 = (int(k) for k in a.split('-'))
b1,b2,b3 = (int(k) for k in b.split('-'))
return (a1*12+a2) - (b1*12+b2)
data = [
[ 71896517, "2020-07-25", 1 ],
[ 71896517, "2020-09-14", 2 ],
[ 72837949, "2020-09-21", 1 ],
[ 72848188, "2020-11-03", 1 ],
[ 73307986, "2020-11-04", 1 ],
[ 72848188, "2020-11-16", 1 ],
[ 71896517, "2020-11-22", 1 ],
[ 73307986, "2020-11-25", 1 ],
[ 73307986, "2021-01-04", 1 ],
[ 73307986, "2021-02-04", 1 ],
[ 72848188, "2021-02-07", 1 ],
[ 72837949, "2021-02-11", 1 ],
]
df = pd.DataFrame( data, columns=["ID", "DATE", "QTD"] )
startdate = {}
sums = {}
sumsqs = {}
num = {}
stdev = []
means = []
for row in df.T.iteritems():
id = row[1]['ID']
if id not in startdate:
num[id] = 1
startdate[id] = row[1]['DATE']
sums[id] = row[1]['QTD']
sumsqs[id] = row[1]['QTD'] * row[1]['QTD']
means.append( row[1]['QTD'] )
stdev.append( 0 )
else:
num[id] += 1
sums[id] += row[1]['QTD']
sumsqs[id] += row[1]['QTD'] * row[1]['QTD']
delta = monthdelta(row[1]['DATE'],startdate[id]) + 1
means.append( sums[id] / delta )
if delta == 1:
stdev.append( 0 )
else:
stdev.append( np.sqrt((delta*sumsqs[id] - sums[id]*sums[id])/delta))
print( means )
df['MEAN'] = pd.Series(means)
print( stdev )
df['STDEV'] = pd.Series(stdev)
print( df )
Output: Output:
[1, 1.0, 1, 1, 1, 2.0, 0.8, 2.0, 1.0, 1.0, 0.75, 0.3333333333333333]
[0, 1.4142135623730951, 0, 0, 0, 0, 1.6733200530681511, 0, 0.0, 0.0, 0.8660254037844386, 1.1547005383792515]
ID DATE QTD MEAN STDEV
0 71896517 2020-07-25 1 1.000000 0.000000
1 71896517 2020-09-14 2 1.000000 1.414214
2 72837949 2020-09-21 1 1.000000 0.000000
3 72848188 2020-11-03 1 1.000000 0.000000
4 73307986 2020-11-04 1 1.000000 0.000000
5 72848188 2020-11-16 1 2.000000 0.000000
6 71896517 2020-11-22 1 0.800000 1.673320
7 73307986 2020-11-25 1 2.000000 0.000000
8 73307986 2021-01-04 1 1.000000 0.000000
9 73307986 2021-02-04 1 1.000000 0.000000
10 72848188 2021-02-07 1 0.750000 0.866025
11 72837949 2021-02-11 1 0.333333 1.154701
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.