Python：對dataframe中某列連續重復值的個數進行分組統計

Question

我非常想對 python 中的 dataframe 執行數據分析任務。因此，這是我擁有的 dataframe：

df = pd.DataFrame({"Person": ["P1", "P1","P1","P1","P1","P1","P1","P1","P1","P1", "P2", "P2","P2","P2","P2","P2","P2","P2","P2","P2"], 
                   "Activity": ["A", "A", "A", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "B", "A"],
                   "Time": ["0", "0", "1", "1", "1", "3", "5", "5", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6"]
                   })

我想要

找出每人連續重復活動“A”超過 2 次的組數，以及
計算連續重復“A”的平均時間，即每組的結束時間減去開始時間除以組數

即目標結果 dataframe 應如下所示（P1 的 AVGTime 計算為 (1-0 + 6-1)/2）：

solution = pd.DataFrame({"Person": ["P1", "P2"],
                    "Activity": ["A", "A"],
                    "Count": [2, 1], 
                    "AVGTime": [3, 0]})

我知道這里有一種接近的解決方案： https://datascience-stackexchange-com.translate.goog/questions/41428/how-to-find-the-count-of-consecutive-same-string-values-in -a-pandas-dataframe?_x_tr_sl=en&_x_tr_tl=de&_x_tr_hl=de&_x_tr_pto=sc

但是，該解決方案不會聚合在一個 col 上，例如我的例子中的“Person”。 此外，鑒於我有一個 dataframe 和大約 7 Mio，該解決方案似乎表現不佳。 行。

我真的很感激任何提示！

Answer 1

您可以將數據處理為 stream 而無需創建 dataframe，它應該適合 memory。我建議嘗試使用 convtools庫（我必須承認 - 我是作者）。

由於您已經有一個 dataframe，我們將其用作輸入：

import pandas as pd

from convtools import conversion as c
from convtools.contrib.tables import Table


# fmt: off
df = pd.DataFrame({
    "Person": ["P1", "P1","P1","P1","P1","P1","P1","P1","P1","P1", "P2", "P2","P2","P2","P2","P2","P2","P2","P2","P2"], 
    "Activity": ["A", "A", "A", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "B", "A"],
    "Time": ["0", "0", "1", "1", "1", "3", "5", "5", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6"]
})
# fmt: on

# transforming DataFrame into an iterable of dicts not to allocate all rows at
# once by df.to_dict("records")
iter_rows = Table.from_rows(
    df.itertuples(index=False), header=list(df.columns)
).into_iter_rows(dict)


result = (
    # chunk by consecutive "person"+"activity" pairs
    c.chunk_by(c.item("Person"), c.item("Activity"))
    .aggregate(
        # each chunk gets transformed into a dict like this:
        {
            "Person": c.ReduceFuncs.First(c.item("Person")),
            "Activity": c.ReduceFuncs.First(c.item("Activity")),
            "length": c.ReduceFuncs.Count(),
            "time": (
                c.ReduceFuncs.Last(c.item("Time")).as_type(float)
                - c.ReduceFuncs.First(c.item("Time")).as_type(float)
            ),
        }
    )
    # remove short groups
    .filter(c.item("length") > 2)
    .pipe(
        # now group by "person"+"activity" pair to calculate avg time
        c.group_by(c.item("Person"), c.item("Activity")).aggregate(
            {
                "Person": c.item("Person"),
                "Activity": c.item("Activity"),
                "avg_time": c.ReduceFuncs.Average(c.item("time")),
                "number_of_groups": c.ReduceFuncs.Count(),
            }
        )
    )
    # should you want to reuse this conversion multiple times, run
    # .gen_converter() to get a function and store it for further reuse
    .execute(iter_rows)
)

結果：

In [37]: result
Out[37]:
[{'Person': 'P1', 'Activity': 'A', 'avg_time': 3.0, 'number_of_groups': 2},
 {'Person': 'P2', 'Activity': 'A', 'avg_time': 0.0, 'number_of_groups': 1}]

Answer 2

嘗試：

def group_func(x):
    groups = []
    for _, g in x.groupby((x["Activity"] != x["Activity"].shift()).cumsum()):
        if len(g) > 2 and g["Activity"].iat[0] == "A":
            groups.append(g)

    avgs = sum(g["Time"].max() - g["Time"].min() for g in groups) / len(groups)

    return pd.Series(
        ["A", len(groups), avgs], index=["Activity", "Count", "AVGTime"]
    )


df["Time"] = df["Time"].astype(int)
x = df.groupby("Person", as_index=False).apply(group_func)
print(x)

印刷：

  Person Activity  Count  AVGTime
0     P1        A      2      3.0
1     P2        A      1      0.0

Python：對dataframe中某列連續重復值的個數進行分組統計

問題描述

2 個解決方案

解決方案1
0 2022-11-25 22:35:54

解決方案2
0 已采納 2022-11-25 23:06:30

Python：對dataframe中某列連續重復值的個數進行分組統計

問題描述

2 個解決方案

解決方案1 0 2022-11-25 22:35:54

解決方案2 0 已采納 2022-11-25 23:06:30

解決方案1
0 2022-11-25 22:35:54

解決方案2
0 已采納 2022-11-25 23:06:30