[英]Python: Group and count number of consecutive repetitive values in a column in a dataframe
我非常想對 python 中的 dataframe 執行數據分析任務。因此,這是我擁有的 dataframe:
df = pd.DataFrame({"Person": ["P1", "P1","P1","P1","P1","P1","P1","P1","P1","P1", "P2", "P2","P2","P2","P2","P2","P2","P2","P2","P2"],
"Activity": ["A", "A", "A", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "B", "A"],
"Time": ["0", "0", "1", "1", "1", "3", "5", "5", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6"]
})
我想要
即目標結果 dataframe 應如下所示(P1 的 AVGTime 計算為 (1-0 + 6-1)/2):
solution = pd.DataFrame({"Person": ["P1", "P2"],
"Activity": ["A", "A"],
"Count": [2, 1],
"AVGTime": [3, 0]})
但是,該解決方案不會聚合在一個 col 上,例如我的例子中的“Person”。 此外,鑒於我有一個 dataframe 和大約 7 Mio,該解決方案似乎表現不佳。 行。
我真的很感激任何提示!
您可以將數據處理為 stream 而無需創建 dataframe,它應該適合 memory。我建議嘗試使用 convtools庫(我必須承認 - 我是作者)。
由於您已經有一個 dataframe,我們將其用作輸入:
import pandas as pd
from convtools import conversion as c
from convtools.contrib.tables import Table
# fmt: off
df = pd.DataFrame({
"Person": ["P1", "P1","P1","P1","P1","P1","P1","P1","P1","P1", "P2", "P2","P2","P2","P2","P2","P2","P2","P2","P2"],
"Activity": ["A", "A", "A", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "B", "A"],
"Time": ["0", "0", "1", "1", "1", "3", "5", "5", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6"]
})
# fmt: on
# transforming DataFrame into an iterable of dicts not to allocate all rows at
# once by df.to_dict("records")
iter_rows = Table.from_rows(
df.itertuples(index=False), header=list(df.columns)
).into_iter_rows(dict)
result = (
# chunk by consecutive "person"+"activity" pairs
c.chunk_by(c.item("Person"), c.item("Activity"))
.aggregate(
# each chunk gets transformed into a dict like this:
{
"Person": c.ReduceFuncs.First(c.item("Person")),
"Activity": c.ReduceFuncs.First(c.item("Activity")),
"length": c.ReduceFuncs.Count(),
"time": (
c.ReduceFuncs.Last(c.item("Time")).as_type(float)
- c.ReduceFuncs.First(c.item("Time")).as_type(float)
),
}
)
# remove short groups
.filter(c.item("length") > 2)
.pipe(
# now group by "person"+"activity" pair to calculate avg time
c.group_by(c.item("Person"), c.item("Activity")).aggregate(
{
"Person": c.item("Person"),
"Activity": c.item("Activity"),
"avg_time": c.ReduceFuncs.Average(c.item("time")),
"number_of_groups": c.ReduceFuncs.Count(),
}
)
)
# should you want to reuse this conversion multiple times, run
# .gen_converter() to get a function and store it for further reuse
.execute(iter_rows)
)
結果:
In [37]: result
Out[37]:
[{'Person': 'P1', 'Activity': 'A', 'avg_time': 3.0, 'number_of_groups': 2},
{'Person': 'P2', 'Activity': 'A', 'avg_time': 0.0, 'number_of_groups': 1}]
嘗試:
def group_func(x):
groups = []
for _, g in x.groupby((x["Activity"] != x["Activity"].shift()).cumsum()):
if len(g) > 2 and g["Activity"].iat[0] == "A":
groups.append(g)
avgs = sum(g["Time"].max() - g["Time"].min() for g in groups) / len(groups)
return pd.Series(
["A", len(groups), avgs], index=["Activity", "Count", "AVGTime"]
)
df["Time"] = df["Time"].astype(int)
x = df.groupby("Person", as_index=False).apply(group_func)
print(x)
印刷:
Person Activity Count AVGTime
0 P1 A 2 3.0
1 P2 A 1 0.0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.