简体   繁体   English

从熊猫数据框中选择包含某些值的行

[英]Select rows containing certain values from pandas dataframe

I have a pandas dataframe whose entries are all strings:我有一个 Pandas 数据框,它的条目都是字符串:

   A     B      C
1 apple  banana pear
2 pear   pear   apple
3 banana pear   pear
4 apple  apple  pear

etc. I want to select all the rows that contain a certain string, say, 'banana'.等我想选择包含某个字符串的所有行,比如“香蕉”。 I don't know which column it will appear in each time.不知道每次会出现在哪一栏。 Of course, I can write a for loop and iterate over all rows.当然,我可以编写一个 for 循环并遍历所有行。 But is there an easier or faster way to do this?但是有没有更简单或更快的方法来做到这一点?

Introduction介绍

At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df , let's call it mask .在选择行的核心,我们需要一个一维掩码或一个熊猫系列的布尔元素,其长度与df的长度相同,我们称之为mask So, finally with df[mask] , we would get the selected rows off df following boolean-indexing .因此,最后使用df[mask] ,我们将在boolean-indexing 之后df获取选定的行。

Here's our starting df :这是我们的起始df

In [42]: df
Out[42]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

I. Match one string一、匹配一个字符串

Now, if we need to match just one string, it's straight-foward with elementwise equality :现在,如果我们只需要匹配一个字符串,就可以直接使用元素相等:

In [42]: df == 'banana'
Out[42]: 
       A      B      C
1  False   True  False
2  False  False  False
3   True  False  False
4  False  False  False

If we need to look ANY one match in each row, use .any method :如果我们需要在每一行中查找ANY一个匹配项,请使用.any方法:

In [43]: (df == 'banana').any(axis=1)
Out[43]: 
1     True
2    False
3     True
4    False
dtype: bool

To select corresponding rows :要选择相应的行:

In [44]: df[(df == 'banana').any(axis=1)]
Out[44]: 
        A       B     C
1   apple  banana  pear
3  banana    pear  pear

II.二、 Match multiple strings匹配多个字符串

1. Search for ANY match 1. 搜索ANY匹配项

Here's our starting df :这是我们的起始df

In [42]: df
Out[42]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

NumPy's np.isin would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df . NumPy 的np.isin将在这里工作(或使用其他帖子中列出的 pandas.isin )从df中的搜索字符串列表中获取所有匹配项。 So, say we are looking for 'pear' or 'apple' in df :所以,假设我们在df中寻找'pear''apple'

In [51]: np.isin(df, ['pear','apple'])
Out[51]: 
array([[ True, False,  True],
       [ True,  True,  True],
       [False,  True,  True],
       [ True,  True,  True]])

# ANY match along each row
In [52]: np.isin(df, ['pear','apple']).any(axis=1)
Out[52]: array([ True,  True,  True,  True])

# Select corresponding rows with masking
In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)]
Out[56]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

2. Search for ALL match 2. 搜索ALL匹配项

Here's our starting df again :这是我们再次开始的df

In [42]: df
Out[42]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

So, now we are looking for rows that have BOTH say ['pear','apple'] .所以,现在我们正在寻找BOTH都说['pear','apple'] We will make use of NumPy-broadcasting :我们将使用NumPy-broadcasting

In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1)
Out[66]: 
array([[ True,  True],
       [ True,  True],
       [ True, False],
       [ True,  True]])

So, we have a search list of 2 items and hence we have a 2D mask with number of rows = len(df) and number of cols = number of search items .因此,我们有一个包含2项目的搜索列表,因此我们有一个 2D 掩码, number of rows = len(df)number of cols = number of search items Thus, in the above result, we have the first col for 'pear' and second one for 'apple' .因此,在上面的结果中,我们有'pear'的第一个 col 和'apple'第二个 col。

To make things concrete, let's get a mask for three items ['apple','banana', 'pear'] :为了使事情具体化,让我们为三个项目['apple','banana', 'pear']一个掩码:

In [62]: np.equal.outer(df.to_numpy(copy=False),  ['apple','banana', 'pear']).any(axis=1)
Out[62]: 
array([[ True,  True,  True],
       [ True, False,  True],
       [False,  True,  True],
       [ True, False,  True]])

The columns of this mask are for 'apple','banana', 'pear' respectively.这个面具的列分别是'apple','banana', 'pear'

Back to 2 search items case, we had earlier :回到2搜索项目案例,我们之前有:

In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1)
Out[66]: 
array([[ True,  True],
       [ True,  True],
       [ True, False],
       [ True,  True]])

Since, we are looking for ALL matches in each row :因为,我们正在寻找每一行中的ALL匹配项:

In [67]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1)
Out[67]: array([ True,  True, False,  True])

Finally, select rows :最后,选择行:

In [70]: df[np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1)]
Out[70]: 
       A       B      C
1  apple  banana   pear
2   pear    pear  apple
4  apple   apple   pear

For single search value对于单个搜索值

df[df.values  == "banana"]

or或者

 df[df.isin(['banana'])]

For multiple search terms:对于多个搜索词:

  df[(df.values  == "banana")|(df.values  == "apple" ) ]

or或者

df[df.isin(['banana', "apple"])]

  #         A       B      C
  #  1   apple  banana    NaN
  #  2     NaN     NaN  apple
  #  3  banana     NaN    NaN
  #  4   apple   apple    NaN

From Divakar: lines with both are returned.来自 Divakar:返回带有两者的行。

select_rows(df,['apple','banana'])

 #         A       B     C
 #   0  apple  banana  pear

You can create a boolean mask from comparing the entire df against your string and call dropna passing param how='all' to drop rows where your string doesn't appear in all cols:您可以通过将整个 df 与您的字符串进行比较来创建一个布尔掩码,并调用dropna传递参数how='all'来删除您的字符串未出现在所有列中的行:

In [59]:
df[df == 'banana'].dropna(how='all')

Out[59]:
        A       B    C
1     NaN  banana  NaN
3  banana     NaN  NaN

To test for multiple values you can use multiple masks:要测试多个值,您可以使用多个掩码:

In [90]:
banana = df[(df=='banana')].dropna(how='all')
banana

Out[90]:
        A       B    C
1     NaN  banana  NaN
3  banana     NaN  NaN

In [91]:    
apple = df[(df=='apple')].dropna(how='all')
apple

Out[91]:
       A      B      C
1  apple    NaN    NaN
2    NaN    NaN  apple
4  apple  apple    NaN

You can use index.intersection to index just the common index values:您可以使用index.intersection来索引常见的索引值:

In [93]:
df.loc[apple.index.intersection(banana.index)]

Out[93]:
       A       B     C
1  apple  banana  pear

If you want all the rows of df that contain any of the values in values , use:如果您希望df所有行都包含 values 中的任何values ,请使用:

df[df.isin(values).any(1)]

Example:例子:

In [2]: df                                                                                                                       
Out[2]: 
   0  1  2
0  7  4  9
1  8  2  7
2  1  9  7
3  3  8  5
4  5  1  1

In [3]: df[df.isin({1, 9, 123}).any(1)]                                                                                          
Out[3]: 
   0  1  2
0  7  4  9
2  1  9  7
4  5  1  1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM