[英]Fastest way to remove rows that contain substrings of values in the same column of a pandas dataframe
I am trying to write some efficient code that removes rows of a pandas dataframe with values in a specific column that are substrings of other values (subset of at least one value) in the same column.我正在尝试编写一些有效的代码,用于删除 Pandas 数据帧的行,其中特定列中的值是同一列中其他值(至少一个值的子集)的子字符串。 For example, consider column
B
in the following input Dataframe:例如,考虑以下输入数据帧中的
B
列:
| | A | B |
|---|----|------------|
| 0 | 22 | ab |
| 1 | 33 | abc |
| 2 | 44 | abcd |
| 3 | 55 | a |
| 4 | 66 | john |
| 5 | 77 | john Doe |
| 6 | 88 | jo |
| 7 | 99 | john hi Doe|
Output Dataframe:输出数据帧:
| | A | B |
|---|----|------------|
| 2 | 44 | abcd |
| 5 | 77 | john Doe |
| 7 | 99 | john hi Doe|
Rows 0, 1, and 3 have been removed because all of their values for column B
( ab
, abc
, and a
) are substrings of other values in that column (ie abcd
).行 0、1 和 3 已被删除,因为它们在
B
列( ab
、 abc
和a
)的所有值都是该列(即abcd
)中其他值的子字符串。 This is also the case for rows 4 and 6.第 4 行和第 6 行也是这种情况。
You could use some list comprehesnion to check if row strings are in other rows of the dataframe:您可以使用一些列表理解来检查行字符串是否在数据帧的其他行中:
m = df['B'].apply(lambda x: any([x for y in df['B'] if x != y if x in y]))
df = df[~m]
df
Out[1]:
A B
2 44 abcd
5 77 john Doe
7 99 john hi Doe
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.