删除包含熊猫数据帧同一列中值的子字符串的行的最快方法

Question

I am trying to write some efficient code that removes rows of a pandas dataframe with values in a specific column that are substrings of other values (subset of at least one value) in the same column.我正在尝试编写一些有效的代码，用于删除 Pandas 数据帧的行，其中特定列中的值是同一列中其他值（至少一个值的子集）的子字符串。 For example, consider column B in the following input Dataframe:例如，考虑以下输入数据帧中的B列：

|   | A  | B          |
|---|----|------------|
| 0 | 22 | ab         |
| 1 | 33 | abc        |
| 2 | 44 | abcd       |
| 3 | 55 | a          |
| 4 | 66 | john       |
| 5 | 77 | john Doe   |
| 6 | 88 | jo         |
| 7 | 99 | john hi Doe|

Output Dataframe:输出数据帧：

|   | A  | B          |
|---|----|------------|
| 2 | 44 | abcd       |
| 5 | 77 | john Doe   |
| 7 | 99 | john hi Doe|

Rows 0, 1, and 3 have been removed because all of their values for column B ( ab , abc , and a ) are substrings of other values in that column (ie abcd ).行 0、1 和 3 已被删除，因为它们在B列（ ab 、 abc和a ）的所有值都是该列（即abcd ）中其他值的子字符串。 This is also the case for rows 4 and 6.第 4 行和第 6 行也是这种情况。

Answer 1

You could use some list comprehesnion to check if row strings are in other rows of the dataframe:您可以使用一些列表理解来检查行字符串是否在数据帧的其他行中：

m = df['B'].apply(lambda x: any([x for y in df['B'] if x != y if x in y]))
df = df[~m]
df
Out[1]: 
    A            B
2  44         abcd
5  77     john Doe
7  99  john hi Doe

删除包含熊猫数据帧同一列中值的子字符串的行的最快方法

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-01 20:34:18

删除包含熊猫数据帧同一列中值的子字符串的行的最快方法

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-01 20:34:18

解决方案1
1 已采纳 2020-11-01 20:34:18