迭代大 pandas DataFrame 太慢

Question

I have a large dataframe where I would like to make a new column based on existing columns.我有一个大的 dataframe 我想在现有列的基础上创建一个新列。

test = pd.DataFrame({'Test1':["100","4242","3454","2","54"]})
test['Test2'] = ""
for i in range(0,len(test)):
    if len(test.iloc[i,0]) == 4:
        test.iloc[i,-1] = test.iloc[i,0][0:1]
    elif len(test.iloc[i,0]) == 3:
        test.iloc[i,-1] = test.iloc[i,0][0]
    elif len(test.iloc[i,0]) < 3:
        test.iloc[i,-1] = 0
    else:
        test.iloc[i,-1] = np.nan

This is working for a small dataframe, but when I have a large data set, (10+ million rows), it is taking way too long.这适用于小型 dataframe，但是当我有一个大型数据集（10+ 百万行）时，它需要的时间太长了。 How can I make this process faster?我怎样才能使这个过程更快？

Answer 1

Use str.len method to find the lengths of strings in the 'Test1' column and then using this information, use np.select to assign relevant parts of the strings in 'Test1' or default values to 'Test2' .使用str.len方法查找'Test1'列中字符串的长度，然后使用此信息，使用np.select将'Test1'中字符串的相关部分或默认值分配给'Test2' 。

import numpy as np
lengths = test['Test1'].str.len()
test['Test2'] = np.select([lengths == 4, lengths == 3, lengths < 3], [test['Test1'].str[0:1], test['Test1'].str[0], 0], np.nan)

Output: Output：

  Test1 Test2
0   100     1
1  4242     4
2  3454     3
3     2     0
4    54     0

Note that [0:1] only returns the first element (same as [0] ) so maybe you meant [0:2] (or something else) otherwise you can save one condition there.请注意， [0:1]仅返回第一个元素（与[0]相同），因此您的意思可能是[0:2] （或其他），否则您可以在那里保存一个条件。

Answer 2

So, basically you want to extract the first character of the string if it is at least 3 characters long.所以，基本上你想提取字符串的第一个字符，如果它至少有 3 个字符长。 ( NB. for a string , [0] and [0:1] yields exactly the same thing ) （注意，对于字符串， [0]和[0:1]产生完全相同的东西）

Just use a regex with a lookbehind for that.只需使用带有后视功能的正则表达式即可。

test['Test2'] = test['Test1'].str.extract('^(.)(?=..)').fillna(0)

output: output：

  Test1 Test2
0   100     1
1  4242     4
2  3454     3
3     2     0
4    54     0

How the regex works:正则表达式的工作原理：

^       # match beginning of string
(.)     # capture one character
(?=..)  # only if it is followed by at least two characters

迭代大 pandas DataFrame 太慢

问题描述

2 个解决方案

解决方案1
0 2022-01-06 07:52:14

解决方案2
0 2022-01-06 08:27:47

迭代大 pandas DataFrame 太慢

问题描述

2 个解决方案

解决方案1 0 2022-01-06 07:52:14

解决方案2 0 2022-01-06 08:27:47

解决方案1
0 2022-01-06 07:52:14

解决方案2
0 2022-01-06 08:27:47