从熊猫数据框中选择包含某些值的行

Question

I have a pandas dataframe whose entries are all strings:我有一个 Pandas 数据框，它的条目都是字符串：

   A     B      C
1 apple  banana pear
2 pear   pear   apple
3 banana pear   pear
4 apple  apple  pear

etc. I want to select all the rows that contain a certain string, say, 'banana'.等我想选择包含某个字符串的所有行，比如“香蕉”。 I don't know which column it will appear in each time.不知道每次会出现在哪一栏。 Of course, I can write a for loop and iterate over all rows.当然，我可以编写一个 for 循环并遍历所有行。 But is there an easier or faster way to do this?但是有没有更简单或更快的方法来做到这一点？

Answer 1

Introduction介绍

At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df , let's call it mask .在选择行的核心，我们需要一个一维掩码或一个熊猫系列的布尔元素，其长度与df的长度相同，我们称之为mask 。 So, finally with df[mask] , we would get the selected rows off df following boolean-indexing .因此，最后使用df[mask] ，我们将在boolean-indexing 之后从df获取选定的行。

Here's our starting df :这是我们的起始df ：

In [42]: df
Out[42]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

I. Match one string一、匹配一个字符串

Now, if we need to match just one string, it's straight-foward with elementwise equality :现在，如果我们只需要匹配一个字符串，就可以直接使用元素相等：

In [42]: df == 'banana'
Out[42]: 
       A      B      C
1  False   True  False
2  False  False  False
3   True  False  False
4  False  False  False

If we need to look ANY one match in each row, use .any method :如果我们需要在每一行中查找ANY一个匹配项，请使用.any方法：

In [43]: (df == 'banana').any(axis=1)
Out[43]: 
1     True
2    False
3     True
4    False
dtype: bool

To select corresponding rows :要选择相应的行：

In [44]: df[(df == 'banana').any(axis=1)]
Out[44]: 
        A       B     C
1   apple  banana  pear
3  banana    pear  pear

II.二、 Match multiple strings匹配多个字符串

1. Search for ANY match 1. 搜索ANY匹配项

Here's our starting df :这是我们的起始df ：

In [42]: df
Out[42]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

NumPy's np.isin would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df . NumPy 的np.isin将在这里工作（或使用其他帖子中列出的 pandas.isin ）从df中的搜索字符串列表中获取所有匹配项。 So, say we are looking for 'pear' or 'apple' in df :所以，假设我们在df中寻找'pear'或'apple' ：

In [51]: np.isin(df, ['pear','apple'])
Out[51]: 
array([[ True, False,  True],
       [ True,  True,  True],
       [False,  True,  True],
       [ True,  True,  True]])

# ANY match along each row
In [52]: np.isin(df, ['pear','apple']).any(axis=1)
Out[52]: array([ True,  True,  True,  True])

# Select corresponding rows with masking
In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)]
Out[56]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

2. Search for ALL match 2. 搜索ALL匹配项

Here's our starting df again :这是我们再次开始的df ：

In [42]: df
Out[42]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

So, now we are looking for rows that have BOTH say ['pear','apple'] .所以，现在我们正在寻找BOTH都说['pear','apple'] 。 We will make use of NumPy-broadcasting :我们将使用NumPy-broadcasting ：

In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1)
Out[66]: 
array([[ True,  True],
       [ True,  True],
       [ True, False],
       [ True,  True]])

So, we have a search list of 2 items and hence we have a 2D mask with number of rows = len(df) and number of cols = number of search items .因此，我们有一个包含2项目的搜索列表，因此我们有一个 2D 掩码， number of rows = len(df)和number of cols = number of search items 。 Thus, in the above result, we have the first col for 'pear' and second one for 'apple' .因此，在上面的结果中，我们有'pear'的第一个 col 和'apple'第二个 col。

To make things concrete, let's get a mask for three items ['apple','banana', 'pear'] :为了使事情具体化，让我们为三个项目['apple','banana', 'pear']一个掩码：

In [62]: np.equal.outer(df.to_numpy(copy=False),  ['apple','banana', 'pear']).any(axis=1)
Out[62]: 
array([[ True,  True,  True],
       [ True, False,  True],
       [False,  True,  True],
       [ True, False,  True]])

The columns of this mask are for 'apple','banana', 'pear' respectively.这个面具的列分别是'apple','banana', 'pear' 。

Back to 2 search items case, we had earlier :回到2搜索项目案例，我们之前有：

In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1)
Out[66]: 
array([[ True,  True],
       [ True,  True],
       [ True, False],
       [ True,  True]])

Since, we are looking for ALL matches in each row :因为，我们正在寻找每一行中的ALL匹配项：

In [67]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1)
Out[67]: array([ True,  True, False,  True])

Finally, select rows :最后，选择行：

In [70]: df[np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1)]
Out[70]: 
       A       B      C
1  apple  banana   pear
2   pear    pear  apple
4  apple   apple   pear

Answer 2

For single search value对于单个搜索值

df[df.values  == "banana"]

or或者

 df[df.isin(['banana'])]

For multiple search terms:对于多个搜索词：

  df[(df.values  == "banana")|(df.values  == "apple" ) ]

or或者

df[df.isin(['banana', "apple"])]

  #         A       B      C
  #  1   apple  banana    NaN
  #  2     NaN     NaN  apple
  #  3  banana     NaN    NaN
  #  4   apple   apple    NaN

From Divakar: lines with both are returned.来自 Divakar：返回带有两者的行。

select_rows(df,['apple','banana'])

 #         A       B     C
 #   0  apple  banana  pear

Answer 3

You can create a boolean mask from comparing the entire df against your string and call dropna passing param how='all' to drop rows where your string doesn't appear in all cols:您可以通过将整个 df 与您的字符串进行比较来创建一个布尔掩码，并调用dropna传递参数how='all'来删除您的字符串未出现在所有列中的行：

In [59]:
df[df == 'banana'].dropna(how='all')

Out[59]:
        A       B    C
1     NaN  banana  NaN
3  banana     NaN  NaN

To test for multiple values you can use multiple masks:要测试多个值，您可以使用多个掩码：

In [90]:
banana = df[(df=='banana')].dropna(how='all')
banana

Out[90]:
        A       B    C
1     NaN  banana  NaN
3  banana     NaN  NaN

In [91]:    
apple = df[(df=='apple')].dropna(how='all')
apple

Out[91]:
       A      B      C
1  apple    NaN    NaN
2    NaN    NaN  apple
4  apple  apple    NaN

You can use index.intersection to index just the common index values:您可以使用index.intersection来索引常见的索引值：

In [93]:
df.loc[apple.index.intersection(banana.index)]

Out[93]:
       A       B     C
1  apple  banana  pear

Answer 4

If you want all the rows of df that contain any of the values in values , use:如果您希望df所有行都包含 values 中的任何values ，请使用：

df[df.isin(values).any(1)]

Example:例子：

In [2]: df                                                                                                                       
Out[2]: 
   0  1  2
0  7  4  9
1  8  2  7
2  1  9  7
3  3  8  5
4  5  1  1

In [3]: df[df.isin({1, 9, 123}).any(1)]                                                                                          
Out[3]: 
   0  1  2
0  7  4  9
2  1  9  7
4  5  1  1

从熊猫数据框中选择包含某些值的行

问题描述

4 个解决方案

解决方案1
18 已采纳 2016-07-04 13:41:05

Introduction介绍

I. Match one string一、匹配一个字符串

II.二、 Match multiple strings匹配多个字符串

解决方案2
17 2016-07-04 15:06:21

解决方案3
3 2016-07-04 13:15:25

解决方案4
2 2019-12-13 11:24:39

从熊猫数据框中选择包含某些值的行

问题描述

4 个解决方案

解决方案1 18 已采纳 2016-07-04 13:41:05

Introduction介绍

I. Match one string一、匹配一个字符串

II.二、 Match multiple strings匹配多个字符串

解决方案2 17 2016-07-04 15:06:21

解决方案3 3 2016-07-04 13:15:25

解决方案4 2 2019-12-13 11:24:39

解决方案1
18 已采纳 2016-07-04 13:41:05

解决方案2
17 2016-07-04 15:06:21

解决方案3
3 2016-07-04 13:15:25

解决方案4
2 2019-12-13 11:24:39