Pandas 按非时间序列列（例如价格）对数据重新采样

Question

Renko Chart Wiki： https://en.wikipedia.org/wiki/Renko_chart

我正在尝试使用交易报价数据生成砖形图。 报价数据包含时间戳、价格、交易量。 时间戳采用 unix 毫秒格式。 例如 1649289600174。

Pandas 已经通过df.resample('10Min').agg({'Price': 'ohlc', 'volume': 'sum'})支持 OHLC 重采样。 但是，我想根据price重新采样贸易数据。 不是按时间戳。

Renko 图表使用固定的砖块尺寸。 例如，如果价格上涨 10 点或 go 下跌 10 点，则 brick_size 为 10 将生成一块砖。

一位 pandas 贡献者告诉我，这可以通过groupby with a binned grouper binned grouper 来完成。 不过，我不太明白他在说什么。

这就是我的原始数据的样子。

Timestamp           Price               Volume

1649289600174       100                 100
1649289600176       105                 100
1649289600178       110                 100
1649289600179       104                 100
1649289600181       101                 100
1649289600182       100                 100
1649289600183       103                 100
1649289600184       107                 100
1649289600185       102                 100
1649289600186        99                 100
1649289600188        93                 100
1649289600189        90                 100
1649289600192        95                 100
1649289600193       100                 100
1649289600194       105                 100
1649289600195       110                 100
1649289600196       115                 100
1649289600197       120                 100

我正在寻找一个看起来像df.resample('10Numeric').agg({'Price': 'ohlc', 'volume': 'sum'}) 。 这里10Numeric表示，brick_size 是 10。如果价格上涨 10 点，或 go 下跌 10 点，那么我想汇总该期间内的数据。

output 应该是这样的

Timestamp           Open    High    Low    Close               Volume
    
1649289600178       100     110     100     110                 300
1649289600182       110     110     100     100                 300
1649289600189       100     107      90      90                 600
1649289600193        90     100      90     100                 200
1649289600195       100     110     100     110                 200
1649289600197       110     120     110     120                 200

我相信 pandas 贡献者在谈论 pd.cut 选项。 然后做groupby。 像这样的东西。

import pandas as pd
import numpy as np

df = pd.DataFrame({'price': np.random.randint(1, 100, 1000)})
df['bins'] = pd.cut(x=df['price'], bins=[0, 10, 20, 30, 40, 50, 60,
                                          70, 80, 90, 100])

output 看起来像这样。

      price       bins
0       92  (90, 100]
1       15   (10, 20]
2       54   (50, 60]
3       55   (50, 60]
4       72   (70, 80]
..     ...        ...
95      88   (80, 90]
96      21   (20, 30]
97      91  (90, 100]
98      51   (50, 60]
99      18   (10, 20]

请注意：价格数据不是唯一的。 一年前比特币的价格应该是 45555 美元。 但今年又是同样的价格。 如果我使用 100 bin 大小，它将在 (45500, 45600) 中。

groupby 会将 1 年前的数据和当前数据放在同一个箱子中。 我正在寻找跟随价格变动的解决方案。 例如，收盘价应如下所示45500, 45600, 45700, 45600, 45500, 45400, 45300, 45200, 45100, 45000

有人可以解释 pandas 贡献者说groupby with a binned grouper binned grouper 时的意思吗？

Answer 1

这是你要找的吗？

df['bins'] = pd.cut(x=df['Price'], bins=range(df['Price'].min(), df['Price'].max(), 10))
df.groupby('bins').agg({'Price': 'ohlc', 'Volume': 'sum'})

Output：

           Price                 Volume
            open high  low close Volume
bins                                   
(90, 100]    100  100   93   100    600
(100, 110]   105  110  101   110    900

Answer 2

您可以基于pd.cut创建一个新列，执行cumsum ，并以此为基础进行分组。

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        {"Timestamp": 1649289600174, "Price": 100, "Volume": 100},
        {"Timestamp": 1649289600176, "Price": 105, "Volume": 100},
        {"Timestamp": 1649289600178, "Price": 110, "Volume": 100},
        {"Timestamp": 1649289600179, "Price": 104, "Volume": 100},
        {"Timestamp": 1649289600181, "Price": 101, "Volume": 100},
        {"Timestamp": 1649289600182, "Price": 100, "Volume": 100},
        {"Timestamp": 1649289600183, "Price": 103, "Volume": 100},
        {"Timestamp": 1649289600184, "Price": 107, "Volume": 100},
        {"Timestamp": 1649289600185, "Price": 102, "Volume": 100},
        {"Timestamp": 1649289600186, "Price": 99, "Volume": 100},
        {"Timestamp": 1649289600188, "Price": 93, "Volume": 100},
        {"Timestamp": 1649289600189, "Price": 90, "Volume": 100},
        {"Timestamp": 1649289600192, "Price": 95, "Volume": 100},
        {"Timestamp": 1649289600193, "Price": 100, "Volume": 100},
        {"Timestamp": 1649289600194, "Price": 105, "Volume": 100},
        {"Timestamp": 1649289600195, "Price": 110, "Volume": 100},
        {"Timestamp": 1649289600196, "Price": 115, "Volume": 100},
        {"Timestamp": 1649289600197, "Price": 120, "Volume": 100},
    ]
)
codes = pd.cut(df["Price"], bins=np.arange(0, 200, 10), right=False).cat.codes
df.groupby((codes != codes.shift(1)).cumsum()).agg(
    {"Price": "ohlc", "Volume": "sum", "Timestamp": "min"}
)

这会给你：

  Price                 Volume      Timestamp
   open high  low close Volume      Timestamp
1   100  105  100   105    200  1649289600174
2   110  110  110   110    100  1649289600178
3   104  107  100   102    600  1649289600179
4    99   99   90    95    400  1649289600186
5   100  105  100   105    200  1649289600193
6   110  115  110   115    200  1649289600195
7   120  120  120   120    100  1649289600197

Pandas 按非时间序列列（例如价格）对数据重新采样

问题描述

2 个解决方案

解决方案1
1 2022-04-20 01:58:07

解决方案2
1 已采纳 2022-04-20 07:13:50

Pandas 按非时间序列列（例如价格）对数据重新采样

问题描述

2 个解决方案

解决方案1 1 2022-04-20 01:58:07

解决方案2 1 已采纳 2022-04-20 07:13:50

解决方案1
1 2022-04-20 01:58:07

解决方案2
1 已采纳 2022-04-20 07:13:50