[英]apply function takes a long time to run
I'm working with a dataset of about ~ 32.000.000 rows:我正在使用大约 32.000.000 行的数据集:
RangeIndex: 32084542 entries, 0 to 32084541
df.head()
time device kpi value
0 2020-10-22 00:04:03+00:00 1-xxxx chassis.routing-engine.0.cpu-idle 100
1 2020-10-22 00:04:06+00:00 2-yyyy chassis.routing-engine.0.cpu-idle 97
2 2020-10-22 00:04:07+00:00 3-zzzz chassis.routing-engine.0.cpu-idle 100
3 2020-10-22 00:04:10+00:00 4-dddd chassis.routing-engine.0.cpu-idle 93
4 2020-10-22 00:04:10+00:00 5-rrrr chassis.routing-engine.0.cpu-idle 99
My goal is to create one aditional columns named role, filled with regard a regex我的目标是创建一个名为 role 的附加列,填充正则表达式
This is my approach这是我的方法
def router_role(row):
if row["device"].startswith("1"):
row["role"] = '1'
if row["device"].startswith("2"):
row["role"] = '2'
if row["device"].startswith("3"):
row["role"] = '3'
if row["device"].startswith("4"):
row["role"] = '4'
return row
then,然后,
df = df.apply(router_role,axis=1)
However it's taking a lot of time ... any idea about other possible approach ?然而,这需要很多时间......对其他可能的方法有什么想法吗?
Thanks谢谢
Apply is very slow and never very good. Apply 很慢,而且从来都不是很好。 Try something like this instead:试试这样的:
df['role'] = df['device'].str[0]
Using apply
is notoriously slow because it doesn't take advantage of multithreading (see, for example, pandas multiprocessing apply ).众所周知,使用apply
很慢,因为它没有利用多线程(例如,请参见pandas multiprocessing apply )。 Instead, use built-ins:相反,使用内置函数:
>>> import pandas as pd
>>> df = pd.DataFrame([["some-data", "1-xxxx"], ["more-data", "1-yyyy"], ["other-data", "2-xxxx"]])
>>> df
0 1
0 some-data 1-xxxx
1 more-data 1-yyyy
2 other-data 2-xxxx
>>> df["Derived Column"] = df[1].str.split("-", expand=True)[0]
>>> df
0 1 Derived Column
0 some-data 1-xxxx 1
1 more-data 1-yyyy 1
2 other-data 2-xxxx 2
Here, I'm assuming that you might have multiple digits before the hyphen (eg 42-aaaa
), hence the extra work to split the column and get the first value of the split.在这里,我假设您在连字符之前可能有多个数字(例如42-aaaa
),因此需要额外的工作来拆分列并获取拆分的第一个值。 If you're just getting the first character, do what @teepee did in their answer with just indexing into the string.如果您只是获得第一个字符,请执行@teepee在他们的回答中所做的操作,只需对字符串进行索引即可。
You can trivially convert your code to use np.vectorize()
.您可以轻松地将代码转换为使用np.vectorize()
。
See here: Performance of Pandas apply vs np.vectorize to create new column from existing columns请参阅此处: Pandas 的性能应用 vs np.vectorize 从现有列创建新列
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.