简体   繁体   English

将多个函数应用于 Pandas DataFrame 返回几列的有效方法

[英]Efficient way to apply several functions to Pandas DataFrame returning several columns

I have a large datasource in which I am trying to enrich the data by creating some calculated columns.我有一个大型数据源,我试图通过创建一些计算列来丰富数据。

The data source is close to 4 Million rows and I am pulling the data in chunks of 100,000数据源接近 400 万行,我以 100,000 个块的形式提取数据

for field in fields:
   operation_start = time.time()
   print(f"Operation {y+1}")
   chunk[field] = chunk.apply(operations[y], axis = 1)                    
   print("Operation completed in " + str(round(time.time()-operation_start,2)) + " seconds")
   operation_start = time.time()
   y = y +1

The for loop runs for 9 functions I have defined, each one returning a value for a new column. for 循环为我定义的 9 个函数运行,每个函数返回一个新列的值。

Is there a more efficient way in which I capture all the new fields at once, applying everything at once instead of one by one?有没有一种更有效的方法可以一次捕获所有新字段,一次应用所有内容而不是一个一个应用? One important remark is that there are a pair of the functions that need the value created by other functions一个重要的评论是,有一对函数需要其他函数创建的值

I have tried pandarallel but it makes the process even slower.我尝试过 pandarallel ,但它使过程变得更慢。

EDIT:编辑:

I managed to make a function and use pandarallel following the info here Apply multiple functions to multiple groupby columns however it does not append the new columns我设法制作了 function 并按照此处的信息使用 pandarallel 将多个函数应用于多个 groupby 列,但它没有 append 新列

def Operations(row):
    import pandas as pd
    d = {}
    d["Operation A "] = Operation_A(row)    
    d["Operation B"] = Operation_B(row, d["Operation A "])
    d["Operation C"] = Operation_C(row)
    d["Operation D"] = Operation_D(row)
    d["Operation E"] = Operation_E(row)
    d["Operation F"] = Operation_F(row)
    d["Operation G"] = Operation_G(row, d["Operation F"], d["Operation D"], d["Operation B"])
    d["Operation H"] = Operation_H(row)
    d["Operation I"] = Operation_I(row, d["Operation H"])

    return pd.Series(d, index= ["Operation A ", "Operation B", "Operation C", "Operation D", "Operation E", "Operation F", "Operation G", "Operation H", "Operation I"])

chunk.parallel_apply(Operations)

All the Operations make string comparisons and return strings, I cannot provide an example because the functions of all these operations add to more than 400 lines of code:S所有操作都进行字符串比较并返回字符串,我无法提供示例,因为所有这些操作的功能增加了 400 多行代码:S

You can process the file as a stream without building a DataFrame .您可以将文件作为 stream 处理,而无需构建DataFrame There's Table helper in convtools library ( table docs | github ). convtools 库中有Table helper ( table docs | github )。 Keep in mind it reads the file as-is and doesn't infer types, it works the same as if you read the file with csv.reader yourself.请记住,它按原样读取文件并且不推断类型,它的工作方式与您自己使用csv.reader读取文件相同。

from convtools.contrib.tables import Table
from convtools import conversion as c

def Operation_A(x):
    return x == "abc"

a_values = {"1", "2"}
def Operation_B(x, y):
    return x in a_values and y

table = Table.from_csv("tmp/in.csv", header=True)

# # or if from pd.DataFrame -- unnecessary RAM usage, because Table works with
# # the stream of data
# import pandas as pd
# df = pd.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
# table = Table.from_rows(df.itertuples(index=False), header=list(df.columns))


table.update(
    **{
        "Operation A": c.call_func(Operation_A, c.col("b")),
        "Operation B": c.call_func(
            Operation_B, c.col("a"), c.col("Operation A")
        ),
    }
).into_csv("tmp/out.csv")
# .into_iter_rows(dict)
# .into_iter_rows(tuple, include_header=True)

I'd suggest using convtools conversions though:我建议使用convtools转换:

(
    Table.from_csv("tmp/in.csv", header=True)
    .update(
        **{
            "Operation A": c.col("b") == "abc",
            "Operation B": c.and_(
                c.col("a").in_(c.naive(a_values)), c.col("Operation A")
            ),
        }
    )
    .into_csv("tmp/out.csv")
)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 检查 Pandas 数据框中列中的多个条件的最有效方法是什么? - What is the most efficient way to check several conditions in columns in a pandas dataframe? 将pandas dataframe列拆分为多个列的最有效方法 - Most efficient way to split a pandas dataframe column into several columns 有没有办法获得熊猫DataFrame的几列的“联合”? - Is there a way to get a “union” of several columns of pandas DataFrame? 创建大熊猫数据框的最快/计算效率最高的方法,其中的列填充有几百万行的随机字符串? - Fastest/most computationally efficient way to create a pandas dataframe where columns are filled with random strings, for several million rows? 如何将多个函数应用于单个pandas dataframe列? - How to apply several functions to a single pandas dataframe column? Pandas Dataframe groupby:一次应用几个lambda函数 - Pandas Dataframe groupby: apply several lambda functions at once 在应用上使用 Dask 返回多列(一个 DataFrame 等) - Using Dask on an apply returning several columns (a DataFrame so) Pandas - 将函数应用于具有来自不同列的多个参数的数据帧 - Pandas - Apply a function to a dataframe with several arguments from different columns 在 dataframe 列上应用 function 以获得其他几个列 Pandas ZA7F5F35426B5274113ZB231 - Apply function on dataframe Column to get several other columns Pandas Python 熊猫将unidecode应用于几列 - Pandas apply unidecode to several columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM