计算时间序列数据的统计指标

Question

I have a dataframe that looks like this:我有一个看起来像这样的 dataframe：

        ID          DATE       QTD
0       71896517    2020-07-25  1
1       71896517    2020-09-14  2
2       72837949    2020-09-21  1
3       72848188    2020-11-03  1
4       73307986    2020-11-04  1
5       72848188    2020-11-16  1
6       71896517    2020-11-22  1
7       73307986    2020-11-25  1
8       73307986    2021-01-04  1
9       73307986    2021-02-04  1
10      72848188    2021-02-07  1
11      72837949    2021-02-11  1

I want to add a column to the dataframe for the monthly mean of the QTD column.我想在 dataframe 中添加一列作为QTD列的月平均值。 This metrics should be calculated for each ID individually (so every ID has its own values) and should incremement as the dataframe grows in DATE (which is the year-month-day).该指标应针对每个ID单独计算（因此每个ID都有自己的值），并且应随着 dataframe 在DATE （即年-月-日）中的增长而递增。 If the ID doesn't have activity in a certain month, I want that month to still be included on the calculated metrics.如果ID在某个月份没有活动，我希望该月仍包含在计算的指标中。

I want the end dataframe to look like this:我希望结束 dataframe 看起来像这样：

        ID          DATE       QTD    MEAN
0       71896517    2020-07-25  1     1.0
1       71896517    2020-09-14  2     1.0
2       72837949    2020-09-21  1     1.0
3       72848188    2020-11-03  1     1.0
4       73307986    2020-11-04  1     1.0
5       72848188    2020-11-16  1     2.0
6       71896517    2020-11-22  1     0.8
7       73307986    2020-11-25  1     2.0
8       73307986    2021-01-04  1     1.0
9       73307986    2021-02-04  1     1.0
10      72848188    2021-02-07  1     0.75
11      72837949    2021-02-11  1     0.33

How can I go about to implement this?我该如何 go 来实现这个？

Answer 1

There's usually an iterative solution.通常有一个迭代的解决方案。

import pandas as pd
import numpy as np

def monthdelta(a,b):
    a1,a2,a3 = (int(k) for k in a.split('-'))
    b1,b2,b3 = (int(k) for k in b.split('-'))
    return (a1*12+a2) - (b1*12+b2)

data = [
[ 71896517, "2020-07-25", 1 ],
[ 71896517, "2020-09-14", 2 ],
[ 72837949, "2020-09-21", 1 ],
[ 72848188, "2020-11-03", 1 ],
[ 73307986, "2020-11-04", 1 ],
[ 72848188, "2020-11-16", 1 ],
[ 71896517, "2020-11-22", 1 ],
[ 73307986, "2020-11-25", 1 ],
[ 73307986, "2021-01-04", 1 ],
[ 73307986, "2021-02-04", 1 ],
[ 72848188, "2021-02-07", 1 ],
[ 72837949, "2021-02-11", 1 ],
]

df = pd.DataFrame( data, columns=["ID", "DATE", "QTD"] )

startdate = {}
sums = {}
sumsqs = {}
num = {}
stdev = []
means = []

for row in df.T.iteritems():
    id = row[1]['ID']
    if id not in startdate:
        num[id] = 1
        startdate[id] = row[1]['DATE']
        sums[id] = row[1]['QTD']
        sumsqs[id] = row[1]['QTD'] * row[1]['QTD']
        means.append( row[1]['QTD'] )
        stdev.append( 0 )
    else:
        num[id] += 1
        sums[id] += row[1]['QTD']
        sumsqs[id] += row[1]['QTD'] * row[1]['QTD']
        delta = monthdelta(row[1]['DATE'],startdate[id]) + 1
        means.append( sums[id] / delta )
        if delta == 1:
            stdev.append( 0 )
        else:
            stdev.append( np.sqrt((delta*sumsqs[id] - sums[id]*sums[id])/delta))

print( means )
df['MEAN'] = pd.Series(means)
print( stdev )
df['STDEV'] = pd.Series(stdev)
print( df )

Output: Output：

[1, 1.0, 1, 1, 1, 2.0, 0.8, 2.0, 1.0, 1.0, 0.75, 0.3333333333333333]
[0, 1.4142135623730951, 0, 0, 0, 0, 1.6733200530681511, 0, 0.0, 0.0, 0.8660254037844386, 1.1547005383792515]
          ID        DATE  QTD      MEAN     STDEV
0   71896517  2020-07-25    1  1.000000  0.000000
1   71896517  2020-09-14    2  1.000000  1.414214
2   72837949  2020-09-21    1  1.000000  0.000000
3   72848188  2020-11-03    1  1.000000  0.000000
4   73307986  2020-11-04    1  1.000000  0.000000
5   72848188  2020-11-16    1  2.000000  0.000000
6   71896517  2020-11-22    1  0.800000  1.673320
7   73307986  2020-11-25    1  2.000000  0.000000
8   73307986  2021-01-04    1  1.000000  0.000000
9   73307986  2021-02-04    1  1.000000  0.000000
10  72848188  2021-02-07    1  0.750000  0.866025
11  72837949  2021-02-11    1  0.333333  1.154701

计算时间序列数据的统计指标

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-04-21 19:49:34

计算时间序列数据的统计指标

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-04-21 19:49:34

解决方案1
0 已采纳 2021-04-21 19:49:34