是否有比 .apply 和 str.contains 更快的方法来搜索数据帧的每一列以查找字符串？

Question

所以基本上我有一堆数据框，大约有 100 列和 500-3000 行，填充了不同的字符串值。 现在我想搜索整个数据框，让我们说字符串“安全气囊”并删除不包含此字符串的每一行？ 我能够使用以下代码做到这一点：

df = df[df.apply(lambda row: row.astype(str).str.contains('Airbag', regex=False).any(), axis=1)]

这与我想要的完全一样，但它太慢了。 因此，我尝试找到一种使用矢量化或列表理解的方法，但我无法做到，也无法在互联网上找到一些示例代码。 所以我的问题是，是否有可能加快这个过程？

示例数据框：

df = pd.DataFrame({'col1': ['Airbag_101', 'Distance_xy', 'Sensor_2'], 'col2': ['String1', 'String2', 'String3'], 'col3': ['Tires', 'Wheel_Airbag', 'Antenna']})

Answer 1

让我们从这个数据帧开始，在COLUMN使用随机字符串和数字：

import numpy as np
np.random.seed(0)
strings = np.apply_along_axis(''.join, 1, np.random.choice(list('ABCD'), size=(100, 5)))
junk = list(range(10))
col = list(strings)+junk
np.random.shuffle(col)
df = pd.DataFrame({'COLUMN': col})

>>> df.head()
  COLUMN
0  BBCAA
1      6
2  ADDDA
3  DCABB
4  ADABC

您可以简单地应用pandas.Series.str.contains 。 您需要使用fillna来说明非字符串元素：

>>> df[df['COLUMN'].str.contains('ABC').fillna(False)]
    COLUMN
4    ADABC
31   BDABC
40   BABCB
88   AABCA
101  ABCBB

测试所有列：

这是使用良好的旧自定义函数的替代方法。 人们可能会认为它应该比apply / transform慢，但当你有很多列和搜索词的合适频率时，它实际上更快（在示例数据帧上测试，没有匹配的 3x3 和 3x3000 数据帧与匹配和不匹配）：

def has_match(series):
    for s in series:
        if 'Airbag' in s:
            return True
    return False


df[df.apply(has_match, axis=1)]

Answer 2

更新（完全匹配）

由于看起来您实际上想要完全匹配，请使用eq()而不是str.contains() 。 然后对loc使用布尔索引：

df.loc[df.eq('Airbag').any(axis=1)]

原始（子串）

使用applymap()测试字符串并使用any(axis=1)将其转换为行掩码：

df[df.applymap(lambda x: 'Airbag' in x).any(axis=1)]

#           col1     col2          col3
# 0   Airbag_101  String1         Tires
# 1  Distance_xy  String2  Wheel_Airbag

正如 mozway 所说，“最佳”取决于数据。 这些是一些时序图供参考。

时间与行数（固定为 3 列）：
时间与列数（固定为 3,000 行）：

Answer 3

好的，我能够在 numpy 数组的帮助下加快速度，但感谢您的帮助：D

master_index = []
for column in df.columns:
   np_array = df[column].values
   index = np.where(np_array == 'Airbag')
   master_index.append(index)
print(df.iloc[master_index[1][0]])

是否有比 .apply 和 str.contains 更快的方法来搜索数据帧的每一列以查找字符串？

问题描述

3 个解决方案

解决方案1
1 2021-07-27 13:59:57

测试所有列：

解决方案2
1 2021-07-28 08:17:05

更新（完全匹配）

原始（子串）

解决方案3
0 2021-07-28 09:53:51

是否有比 .apply 和 str.contains 更快的方法来搜索数据帧的每一列以查找字符串？

问题描述

3 个解决方案

解决方案1 1 2021-07-27 13:59:57

测试所有列：

解决方案2 1 2021-07-28 08:17:05

更新（完全匹配）

原始（子串）

解决方案3 0 2021-07-28 09:53:51

解决方案1
1 2021-07-27 13:59:57

解决方案2
1 2021-07-28 08:17:05

解决方案3
0 2021-07-28 09:53:51