apply 函数需要很长时间才能运行

Question

I'm working with a dataset of about ~ 32.000.000 rows:我正在使用大约 32.000.000 行的数据集：

RangeIndex: 32084542 entries, 0 to 32084541

df.head()


        time                        device      kpi                                 value
0   2020-10-22 00:04:03+00:00       1-xxxx  chassis.routing-engine.0.cpu-idle   100
1   2020-10-22 00:04:06+00:00       2-yyyy  chassis.routing-engine.0.cpu-idle   97
2   2020-10-22 00:04:07+00:00       3-zzzz  chassis.routing-engine.0.cpu-idle   100
3   2020-10-22 00:04:10+00:00       4-dddd  chassis.routing-engine.0.cpu-idle   93
4   2020-10-22 00:04:10+00:00       5-rrrr  chassis.routing-engine.0.cpu-idle   99

My goal is to create one aditional columns named role, filled with regard a regex我的目标是创建一个名为 role 的附加列，填充正则表达式

This is my approach这是我的方法

def router_role(row):
    if row["device"].startswith("1"):
        row["role"] = '1'
    if row["device"].startswith("2"):
        row["role"] = '2'
    if row["device"].startswith("3"):
        row["role"] = '3'
    if row["device"].startswith("4"):
        row["role"] = '4'
    return row

then,然后，

df = df.apply(router_role,axis=1)

However it's taking a lot of time ... any idea about other possible approach ?然而，这需要很多时间......对其他可能的方法有什么想法吗？

Thanks谢谢

Answer 1

Apply is very slow and never very good. Apply 很慢，而且从来都不是很好。 Try something like this instead:试试这样的：

df['role'] = df['device'].str[0]

Answer 2

Using apply is notoriously slow because it doesn't take advantage of multithreading (see, for example, pandas multiprocessing apply ).众所周知，使用apply很慢，因为它没有利用多线程（例如，请参见pandas multiprocessing apply ）。 Instead, use built-ins:相反，使用内置函数：

>>> import pandas as pd
>>> df = pd.DataFrame([["some-data", "1-xxxx"], ["more-data", "1-yyyy"], ["other-data", "2-xxxx"]])
>>> df
            0       1
0   some-data  1-xxxx
1   more-data  1-yyyy
2  other-data  2-xxxx
>>> df["Derived Column"] = df[1].str.split("-", expand=True)[0]
>>> df
            0       1 Derived Column
0   some-data  1-xxxx              1
1   more-data  1-yyyy              1
2  other-data  2-xxxx              2

Here, I'm assuming that you might have multiple digits before the hyphen (eg 42-aaaa ), hence the extra work to split the column and get the first value of the split.在这里，我假设您在连字符之前可能有多个数字（例如42-aaaa ），因此需要额外的工作来拆分列并获取拆分的第一个值。 If you're just getting the first character, do what @teepee did in their answer with just indexing into the string.如果您只是获得第一个字符，请执行@teepee在他们的回答中所做的操作，只需对字符串进行索引即可。

Answer 3

You can trivially convert your code to use np.vectorize() .您可以轻松地将代码转换为使用np.vectorize() 。

See here: Performance of Pandas apply vs np.vectorize to create new column from existing columns请参阅此处： Pandas 的性能应用 vs np.vectorize 从现有列创建新列

apply 函数需要很长时间才能运行

问题描述

3 个解决方案

解决方案1
4 已采纳 2020-11-19 18:18:40

解决方案2
1 2020-11-19 18:24:45

解决方案3
0 2020-11-19 21:27:44

apply 函数需要很长时间才能运行

问题描述

3 个解决方案

解决方案1 4 已采纳 2020-11-19 18:18:40

解决方案2 1 2020-11-19 18:24:45

解决方案3 0 2020-11-19 21:27:44

解决方案1
4 已采纳 2020-11-19 18:18:40

解决方案2
1 2020-11-19 18:24:45

解决方案3
0 2020-11-19 21:27:44