簡體   English   中英

Python:對dataframe中某列連續重復值的個數進行分組統計

[英]Python: Group and count number of consecutive repetitive values in a column in a dataframe

我非常想對 python 中的 dataframe 執行數據分析任務。因此,這是我擁有的 dataframe:

df = pd.DataFrame({"Person": ["P1", "P1","P1","P1","P1","P1","P1","P1","P1","P1", "P2", "P2","P2","P2","P2","P2","P2","P2","P2","P2"], 
                   "Activity": ["A", "A", "A", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "B", "A"],
                   "Time": ["0", "0", "1", "1", "1", "3", "5", "5", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6"]
                   })

我想要

  • 找出每人連續重復活動“A”超過 2 次的組數,以及
  • 計算連續重復“A”的平均時間,即每組的結束時間減去開始時間除以組數

即目標結果 dataframe 應如下所示(P1 的 AVGTime 計算為 (1-0 + 6-1)/2):

solution = pd.DataFrame({"Person": ["P1", "P2"],
                    "Activity": ["A", "A"],
                    "Count": [2, 1], 
                    "AVGTime": [3, 0]})

我知道這里有一種接近的解決方案: https://datascience-stackexchange-com.translate.goog/questions/41428/how-to-find-the-count-of-consecutive-same-string-values-in -a-pandas-dataframe?_x_tr_sl=en&_x_tr_tl=de&_x_tr_hl=de&_x_tr_pto=sc

但是,該解決方案不會聚合在一個 col 上,例如我的例子中的“Person”。 此外,鑒於我有一個 dataframe 和大約 7 Mio,該解決方案似乎表現不佳。 行。

我真的很感激任何提示!

您可以將數據處理為 stream 而無需創建 dataframe,它應該適合 memory。我建議嘗試使用 convtools庫(我必須承認 - 我是作者)。

由於您已經有一個 dataframe,我們將其用作輸入:

import pandas as pd

from convtools import conversion as c
from convtools.contrib.tables import Table


# fmt: off
df = pd.DataFrame({
    "Person": ["P1", "P1","P1","P1","P1","P1","P1","P1","P1","P1", "P2", "P2","P2","P2","P2","P2","P2","P2","P2","P2"], 
    "Activity": ["A", "A", "A", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "B", "A"],
    "Time": ["0", "0", "1", "1", "1", "3", "5", "5", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6"]
})
# fmt: on

# transforming DataFrame into an iterable of dicts not to allocate all rows at
# once by df.to_dict("records")
iter_rows = Table.from_rows(
    df.itertuples(index=False), header=list(df.columns)
).into_iter_rows(dict)


result = (
    # chunk by consecutive "person"+"activity" pairs
    c.chunk_by(c.item("Person"), c.item("Activity"))
    .aggregate(
        # each chunk gets transformed into a dict like this:
        {
            "Person": c.ReduceFuncs.First(c.item("Person")),
            "Activity": c.ReduceFuncs.First(c.item("Activity")),
            "length": c.ReduceFuncs.Count(),
            "time": (
                c.ReduceFuncs.Last(c.item("Time")).as_type(float)
                - c.ReduceFuncs.First(c.item("Time")).as_type(float)
            ),
        }
    )
    # remove short groups
    .filter(c.item("length") > 2)
    .pipe(
        # now group by "person"+"activity" pair to calculate avg time
        c.group_by(c.item("Person"), c.item("Activity")).aggregate(
            {
                "Person": c.item("Person"),
                "Activity": c.item("Activity"),
                "avg_time": c.ReduceFuncs.Average(c.item("time")),
                "number_of_groups": c.ReduceFuncs.Count(),
            }
        )
    )
    # should you want to reuse this conversion multiple times, run
    # .gen_converter() to get a function and store it for further reuse
    .execute(iter_rows)
)

結果:

In [37]: result
Out[37]:
[{'Person': 'P1', 'Activity': 'A', 'avg_time': 3.0, 'number_of_groups': 2},
 {'Person': 'P2', 'Activity': 'A', 'avg_time': 0.0, 'number_of_groups': 1}]

嘗試:

def group_func(x):
    groups = []
    for _, g in x.groupby((x["Activity"] != x["Activity"].shift()).cumsum()):
        if len(g) > 2 and g["Activity"].iat[0] == "A":
            groups.append(g)

    avgs = sum(g["Time"].max() - g["Time"].min() for g in groups) / len(groups)

    return pd.Series(
        ["A", len(groups), avgs], index=["Activity", "Count", "AVGTime"]
    )


df["Time"] = df["Time"].astype(int)
x = df.groupby("Person", as_index=False).apply(group_func)
print(x)

印刷:

  Person Activity  Count  AVGTime
0     P1        A      2      3.0
1     P2        A      1      0.0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM