[英]Calculate Statistical Metrics On Time-Series Data
我有一個看起來像這樣的 dataframe:
ID DATE QTD
0 71896517 2020-07-25 1
1 71896517 2020-09-14 2
2 72837949 2020-09-21 1
3 72848188 2020-11-03 1
4 73307986 2020-11-04 1
5 72848188 2020-11-16 1
6 71896517 2020-11-22 1
7 73307986 2020-11-25 1
8 73307986 2021-01-04 1
9 73307986 2021-02-04 1
10 72848188 2021-02-07 1
11 72837949 2021-02-11 1
我想在 dataframe 中添加一列作為QTD
列的月平均值。 該指標應針對每個ID
單獨計算(因此每個ID
都有自己的值),並且應隨着 dataframe 在DATE
(即年-月-日)中的增長而遞增。 如果ID
在某個月份沒有活動,我希望該月仍包含在計算的指標中。
我希望結束 dataframe 看起來像這樣:
ID DATE QTD MEAN
0 71896517 2020-07-25 1 1.0
1 71896517 2020-09-14 2 1.0
2 72837949 2020-09-21 1 1.0
3 72848188 2020-11-03 1 1.0
4 73307986 2020-11-04 1 1.0
5 72848188 2020-11-16 1 2.0
6 71896517 2020-11-22 1 0.8
7 73307986 2020-11-25 1 2.0
8 73307986 2021-01-04 1 1.0
9 73307986 2021-02-04 1 1.0
10 72848188 2021-02-07 1 0.75
11 72837949 2021-02-11 1 0.33
我該如何 go 來實現這個?
通常有一個迭代的解決方案。
import pandas as pd
import numpy as np
def monthdelta(a,b):
a1,a2,a3 = (int(k) for k in a.split('-'))
b1,b2,b3 = (int(k) for k in b.split('-'))
return (a1*12+a2) - (b1*12+b2)
data = [
[ 71896517, "2020-07-25", 1 ],
[ 71896517, "2020-09-14", 2 ],
[ 72837949, "2020-09-21", 1 ],
[ 72848188, "2020-11-03", 1 ],
[ 73307986, "2020-11-04", 1 ],
[ 72848188, "2020-11-16", 1 ],
[ 71896517, "2020-11-22", 1 ],
[ 73307986, "2020-11-25", 1 ],
[ 73307986, "2021-01-04", 1 ],
[ 73307986, "2021-02-04", 1 ],
[ 72848188, "2021-02-07", 1 ],
[ 72837949, "2021-02-11", 1 ],
]
df = pd.DataFrame( data, columns=["ID", "DATE", "QTD"] )
startdate = {}
sums = {}
sumsqs = {}
num = {}
stdev = []
means = []
for row in df.T.iteritems():
id = row[1]['ID']
if id not in startdate:
num[id] = 1
startdate[id] = row[1]['DATE']
sums[id] = row[1]['QTD']
sumsqs[id] = row[1]['QTD'] * row[1]['QTD']
means.append( row[1]['QTD'] )
stdev.append( 0 )
else:
num[id] += 1
sums[id] += row[1]['QTD']
sumsqs[id] += row[1]['QTD'] * row[1]['QTD']
delta = monthdelta(row[1]['DATE'],startdate[id]) + 1
means.append( sums[id] / delta )
if delta == 1:
stdev.append( 0 )
else:
stdev.append( np.sqrt((delta*sumsqs[id] - sums[id]*sums[id])/delta))
print( means )
df['MEAN'] = pd.Series(means)
print( stdev )
df['STDEV'] = pd.Series(stdev)
print( df )
Output:
[1, 1.0, 1, 1, 1, 2.0, 0.8, 2.0, 1.0, 1.0, 0.75, 0.3333333333333333]
[0, 1.4142135623730951, 0, 0, 0, 0, 1.6733200530681511, 0, 0.0, 0.0, 0.8660254037844386, 1.1547005383792515]
ID DATE QTD MEAN STDEV
0 71896517 2020-07-25 1 1.000000 0.000000
1 71896517 2020-09-14 2 1.000000 1.414214
2 72837949 2020-09-21 1 1.000000 0.000000
3 72848188 2020-11-03 1 1.000000 0.000000
4 73307986 2020-11-04 1 1.000000 0.000000
5 72848188 2020-11-16 1 2.000000 0.000000
6 71896517 2020-11-22 1 0.800000 1.673320
7 73307986 2020-11-25 1 2.000000 0.000000
8 73307986 2021-01-04 1 1.000000 0.000000
9 73307986 2021-02-04 1 1.000000 0.000000
10 72848188 2021-02-07 1 0.750000 0.866025
11 72837949 2021-02-11 1 0.333333 1.154701
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.