在多行中使用str.contains

Question

I have a dataframe with five rows that looks like this: 我有一个包含五行的数据框，如下所示：

index  col1   col2  col3   col4   col5
1      word1  None  word1  None   None
2      None   word1 word2  None   None
3      None   None  None   word2  word2
4      word1  word2 None   None   None

I'm trying to find all rows that contain both strings in any combination of columns---in this case, rows 2 and 4. Normally I would use the str.contains method to filter by string: 我试图在列的任意组合中找到包含两个字符串的所有行 - 在本例中为第2行和第4行。通常我会使用str.contains方法按字符串过滤：

df[df['col1'].str.contains('word1 | word2'), case=False)

But this only gives me A) results for one column, and B) a True if the column has one word. 但这只给了我A）一列的结果，B）如果列有一个单词则为真。 I intuitively tried df[df[['col1', 'col2', 'col3', 'col4', 'col5']].str.contains('word1' & 'word2'), case=False) but .str.contains doesn't work on DataFrame objects. 我直观地尝试了df[df[['col1', 'col2', 'col3', 'col4', 'col5']].str.contains('word1' & 'word2'), case=False)但是.str.contains不适用于DataFrame对象。

Is there a way to do this without resorting to a for loop? 有没有办法在不诉诸for循环的情况下做到这一点？

Answer 1

Using any 使用any

s1=df.apply(lambda x : x.str.contains(r'word1')).any(1)
s2=df.apply(lambda x : x.str.contains(r'word2')).any(1)
df[s1&s2]
Out[452]: 
        col1   col2   col3  col4  col5
index                                 
2       None  word1  word2  None  None
4      word1  word2   None  None  None

Answer 2

If there is only 2 words you are looking for, You could use np.isin and any to check if each row in the underlying numpy array contains both the elements, using a separate isin for each word: 如果你正在寻找只有2个单词，你可以使用np.isin和any来检查底层numpy数组中的每一行是否包含这两个元素，每个单词使用一个单独的isin ：

df[np.isin(df.values, 'word1').any(1) & np.isin(df.values, 'word2').any(1)]

   index   col1   col2   col3  col4  col5
1      2   None  word1  word2  None  None
3      4  word1  word2   None  None  None

Or, following the same logic but borrowing a bit from @coldspeed's answer: 或者，遵循相同的逻辑，但从@ coldspeed的答案借一点：

words = ['word1','word2']

df[np.logical_and.reduce([np.isin(df.values, w).any(1) for w in words])]

   index   col1   col2   col3  col4  col5
1      2   None  word1  word2  None  None
3      4  word1  word2   None  None  None

Answer 3

Assuming you want only the rows with both word1 and word2 somewhere, you will need to stack , groupby index, and search inside an apply . 假设您只想要在某处同时包含word1和word2的行，则需要stack ， groupby索引和在apply搜索。

words = ['word1', 'word2']
df[df.stack().groupby(level=0).apply(
    lambda x: all(x.str.contains(w, case=False).any() for w in words))]

print(df)
        col1   col2   col3  col4  col5
index                                 
2       None  word1  word2  None  None  # word1=>col2, word2=>col3
4      word1  word2   None  None  None  # word1=>col1, word2=>col2

Another alternative would be using np.logical_and.reduce : 另一种方法是使用np.logical_and.reduce ：

v = df.stack()
m = pd.Series(
        np.logical_and.reduce([
           v.str.contains(w, case=False).groupby(level=0).transform('any') 
           for w in words]),
        index=v.index)
df = df[m.unstack().all(1)]

print(df)
        col1   col2   col3  col4  col5
index                                 
2       None  word1  word2  None  None
4      word1  word2   None  None  None

在多行中使用str.contains

问题描述

3 个解决方案

解决方案1
4 已采纳 2018-11-08 02:44:55

解决方案2
4 2018-11-08 02:46:11

解决方案3
2 2018-11-08 02:42:33

在多行中使用str.contains

问题描述

3 个解决方案

解决方案1 4 已采纳 2018-11-08 02:44:55

解决方案2 4 2018-11-08 02:46:11

解决方案3 2 2018-11-08 02:42:33

解决方案1
4 已采纳 2018-11-08 02:44:55

解决方案2
4 2018-11-08 02:46:11

解决方案3
2 2018-11-08 02:42:33