[英]Efficient way to apply several functions to Pandas DataFrame returning several columns
I have a large datasource in which I am trying to enrich the data by creating some calculated columns.我有一个大型数据源,我试图通过创建一些计算列来丰富数据。
The data source is close to 4 Million rows and I am pulling the data in chunks of 100,000数据源接近 400 万行,我以 100,000 个块的形式提取数据
for field in fields:
operation_start = time.time()
print(f"Operation {y+1}")
chunk[field] = chunk.apply(operations[y], axis = 1)
print("Operation completed in " + str(round(time.time()-operation_start,2)) + " seconds")
operation_start = time.time()
y = y +1
The for loop runs for 9 functions I have defined, each one returning a value for a new column. for 循环为我定义的 9 个函数运行,每个函数返回一个新列的值。
Is there a more efficient way in which I capture all the new fields at once, applying everything at once instead of one by one?有没有一种更有效的方法可以一次捕获所有新字段,一次应用所有内容而不是一个一个应用? One important remark is that there are a pair of the functions that need the value created by other functions
一个重要的评论是,有一对函数需要其他函数创建的值
I have tried pandarallel but it makes the process even slower.我尝试过 pandarallel ,但它使过程变得更慢。
EDIT:编辑:
I managed to make a function and use pandarallel following the info here Apply multiple functions to multiple groupby columns however it does not append the new columns我设法制作了 function 并按照此处的信息使用 pandarallel 将多个函数应用于多个 groupby 列,但它没有 append 新列
def Operations(row):
import pandas as pd
d = {}
d["Operation A "] = Operation_A(row)
d["Operation B"] = Operation_B(row, d["Operation A "])
d["Operation C"] = Operation_C(row)
d["Operation D"] = Operation_D(row)
d["Operation E"] = Operation_E(row)
d["Operation F"] = Operation_F(row)
d["Operation G"] = Operation_G(row, d["Operation F"], d["Operation D"], d["Operation B"])
d["Operation H"] = Operation_H(row)
d["Operation I"] = Operation_I(row, d["Operation H"])
return pd.Series(d, index= ["Operation A ", "Operation B", "Operation C", "Operation D", "Operation E", "Operation F", "Operation G", "Operation H", "Operation I"])
chunk.parallel_apply(Operations)
All the Operations make string comparisons and return strings, I cannot provide an example because the functions of all these operations add to more than 400 lines of code:S所有操作都进行字符串比较并返回字符串,我无法提供示例,因为所有这些操作的功能增加了 400 多行代码:S
You can process the file as a stream without building a DataFrame
.您可以将文件作为 stream 处理,而无需构建
DataFrame
。 There's Table
helper in convtools library ( table docs | github ). convtools 库中有
Table
helper ( table docs | github )。 Keep in mind it reads the file as-is and doesn't infer types, it works the same as if you read the file with csv.reader
yourself.请记住,它按原样读取文件并且不推断类型,它的工作方式与您自己使用
csv.reader
读取文件相同。
from convtools.contrib.tables import Table
from convtools import conversion as c
def Operation_A(x):
return x == "abc"
a_values = {"1", "2"}
def Operation_B(x, y):
return x in a_values and y
table = Table.from_csv("tmp/in.csv", header=True)
# # or if from pd.DataFrame -- unnecessary RAM usage, because Table works with
# # the stream of data
# import pandas as pd
# df = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
# table = Table.from_rows(df.itertuples(index=False), header=list(df.columns))
table.update(
**{
"Operation A": c.call_func(Operation_A, c.col("b")),
"Operation B": c.call_func(
Operation_B, c.col("a"), c.col("Operation A")
),
}
).into_csv("tmp/out.csv")
# .into_iter_rows(dict)
# .into_iter_rows(tuple, include_header=True)
I'd suggest using convtools conversions though:我建议使用convtools转换:
(
Table.from_csv("tmp/in.csv", header=True)
.update(
**{
"Operation A": c.col("b") == "abc",
"Operation B": c.and_(
c.col("a").in_(c.naive(a_values)), c.col("Operation A")
),
}
)
.into_csv("tmp/out.csv")
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.