[英]What is the most efficient way to check several conditions in columns in a pandas dataframe?
[英]Efficient way to apply several functions to Pandas DataFrame returning several columns
我有一个大型数据源,我试图通过创建一些计算列来丰富数据。
数据源接近 400 万行,我以 100,000 个块的形式提取数据
for field in fields:
operation_start = time.time()
print(f"Operation {y+1}")
chunk[field] = chunk.apply(operations[y], axis = 1)
print("Operation completed in " + str(round(time.time()-operation_start,2)) + " seconds")
operation_start = time.time()
y = y +1
for 循环为我定义的 9 个函数运行,每个函数返回一个新列的值。
有没有一种更有效的方法可以一次捕获所有新字段,一次应用所有内容而不是一个一个应用? 一个重要的评论是,有一对函数需要其他函数创建的值
我尝试过 pandarallel ,但它使过程变得更慢。
编辑:
我设法制作了 function 并按照此处的信息使用 pandarallel 将多个函数应用于多个 groupby 列,但它没有 append 新列
def Operations(row):
import pandas as pd
d = {}
d["Operation A "] = Operation_A(row)
d["Operation B"] = Operation_B(row, d["Operation A "])
d["Operation C"] = Operation_C(row)
d["Operation D"] = Operation_D(row)
d["Operation E"] = Operation_E(row)
d["Operation F"] = Operation_F(row)
d["Operation G"] = Operation_G(row, d["Operation F"], d["Operation D"], d["Operation B"])
d["Operation H"] = Operation_H(row)
d["Operation I"] = Operation_I(row, d["Operation H"])
return pd.Series(d, index= ["Operation A ", "Operation B", "Operation C", "Operation D", "Operation E", "Operation F", "Operation G", "Operation H", "Operation I"])
chunk.parallel_apply(Operations)
所有操作都进行字符串比较并返回字符串,我无法提供示例,因为所有这些操作的功能增加了 400 多行代码:S
您可以将文件作为 stream 处理,而无需构建DataFrame
。 convtools 库中有Table
helper ( table docs | github )。 请记住,它按原样读取文件并且不推断类型,它的工作方式与您自己使用csv.reader
读取文件相同。
from convtools.contrib.tables import Table
from convtools import conversion as c
def Operation_A(x):
return x == "abc"
a_values = {"1", "2"}
def Operation_B(x, y):
return x in a_values and y
table = Table.from_csv("tmp/in.csv", header=True)
# # or if from pd.DataFrame -- unnecessary RAM usage, because Table works with
# # the stream of data
# import pandas as pd
# df = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
# table = Table.from_rows(df.itertuples(index=False), header=list(df.columns))
table.update(
**{
"Operation A": c.call_func(Operation_A, c.col("b")),
"Operation B": c.call_func(
Operation_B, c.col("a"), c.col("Operation A")
),
}
).into_csv("tmp/out.csv")
# .into_iter_rows(dict)
# .into_iter_rows(tuple, include_header=True)
我建议使用convtools转换:
(
Table.from_csv("tmp/in.csv", header=True)
.update(
**{
"Operation A": c.col("b") == "abc",
"Operation B": c.and_(
c.col("a").in_(c.naive(a_values)), c.col("Operation A")
),
}
)
.into_csv("tmp/out.csv")
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.