[英]Using str.contains across multiple rows
I have a dataframe with five rows that looks like this: 我有一个包含五行的数据框,如下所示:
index col1 col2 col3 col4 col5
1 word1 None word1 None None
2 None word1 word2 None None
3 None None None word2 word2
4 word1 word2 None None None
I'm trying to find all rows that contain both strings in any combination of columns---in this case, rows 2 and 4. Normally I would use the str.contains
method to filter by string: 我试图在列的任意组合中找到包含两个字符串的所有行 - 在本例中为第2行和第4行。通常我会使用
str.contains
方法按字符串过滤:
df[df['col1'].str.contains('word1 | word2'), case=False)
But this only gives me A) results for one column, and B) a True if the column has one word. 但这只给了我A)一列的结果,B)如果列有一个单词则为真。 I intuitively tried
df[df[['col1', 'col2', 'col3', 'col4', 'col5']].str.contains('word1' & 'word2'), case=False)
but .str.contains
doesn't work on DataFrame objects. 我直观地尝试了
df[df[['col1', 'col2', 'col3', 'col4', 'col5']].str.contains('word1' & 'word2'), case=False)
但是.str.contains
不适用于DataFrame对象。
Is there a way to do this without resorting to a for loop? 有没有办法在不诉诸for循环的情况下做到这一点?
Using any
使用
any
s1=df.apply(lambda x : x.str.contains(r'word1')).any(1)
s2=df.apply(lambda x : x.str.contains(r'word2')).any(1)
df[s1&s2]
Out[452]:
col1 col2 col3 col4 col5
index
2 None word1 word2 None None
4 word1 word2 None None None
If there is only 2 words you are looking for, You could use np.isin
and any
to check if each row in the underlying numpy
array contains both the elements, using a separate isin
for each word: 如果你正在寻找只有2个单词,你可以使用
np.isin
和any
来检查底层numpy
数组中的每一行是否包含这两个元素,每个单词使用一个单独的isin
:
df[np.isin(df.values, 'word1').any(1) & np.isin(df.values, 'word2').any(1)]
index col1 col2 col3 col4 col5
1 2 None word1 word2 None None
3 4 word1 word2 None None None
Or, following the same logic but borrowing a bit from @coldspeed's answer: 或者,遵循相同的逻辑,但从@ coldspeed的答案借一点:
words = ['word1','word2']
df[np.logical_and.reduce([np.isin(df.values, w).any(1) for w in words])]
index col1 col2 col3 col4 col5
1 2 None word1 word2 None None
3 4 word1 word2 None None None
Assuming you want only the rows with both word1 and word2 somewhere, you will need to stack
, groupby
index, and search inside an apply
. 假设您只想要在某处同时包含word1和word2的行,则需要
stack
, groupby
索引和在apply
搜索。
words = ['word1', 'word2']
df[df.stack().groupby(level=0).apply(
lambda x: all(x.str.contains(w, case=False).any() for w in words))]
print(df)
col1 col2 col3 col4 col5
index
2 None word1 word2 None None # word1=>col2, word2=>col3
4 word1 word2 None None None # word1=>col1, word2=>col2
Another alternative would be using np.logical_and.reduce
: 另一种方法是使用
np.logical_and.reduce
:
v = df.stack()
m = pd.Series(
np.logical_and.reduce([
v.str.contains(w, case=False).groupby(level=0).transform('any')
for w in words]),
index=v.index)
df = df[m.unstack().all(1)]
print(df)
col1 col2 col3 col4 col5
index
2 None word1 word2 None None
4 word1 word2 None None None
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.