[英]Select rows containing certain values from pandas dataframe
I have a pandas dataframe whose entries are all strings:我有一个 Pandas 数据框,它的条目都是字符串:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
etc. I want to select all the rows that contain a certain string, say, 'banana'.等我想选择包含某个字符串的所有行,比如“香蕉”。 I don't know which column it will appear in each time.
不知道每次会出现在哪一栏。 Of course, I can write a for loop and iterate over all rows.
当然,我可以编写一个 for 循环并遍历所有行。 But is there an easier or faster way to do this?
但是有没有更简单或更快的方法来做到这一点?
At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df
, let's call it mask
.在选择行的核心,我们需要一个一维掩码或一个熊猫系列的布尔元素,其长度与
df
的长度相同,我们称之为mask
。 So, finally with df[mask]
, we would get the selected rows off df
following boolean-indexing .因此,最后使用
df[mask]
,我们将在boolean-indexing 之后从df
获取选定的行。
Here's our starting df
:这是我们的起始
df
:
In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
Now, if we need to match just one string, it's straight-foward with elementwise equality :现在,如果我们只需要匹配一个字符串,就可以直接使用元素相等:
In [42]: df == 'banana'
Out[42]:
A B C
1 False True False
2 False False False
3 True False False
4 False False False
If we need to look ANY
one match in each row, use .any
method :如果我们需要在每一行中查找
ANY
一个匹配项,请使用.any
方法:
In [43]: (df == 'banana').any(axis=1)
Out[43]:
1 True
2 False
3 True
4 False
dtype: bool
To select corresponding rows :要选择相应的行:
In [44]: df[(df == 'banana').any(axis=1)]
Out[44]:
A B C
1 apple banana pear
3 banana pear pear
1. Search for ANY
match 1. 搜索
ANY
匹配项
Here's our starting df
:这是我们的起始
df
:
In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
NumPy's np.isin
would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df
. NumPy 的
np.isin
将在这里工作(或使用其他帖子中列出的 pandas.isin )从df
中的搜索字符串列表中获取所有匹配项。 So, say we are looking for 'pear'
or 'apple'
in df
:所以,假设我们在
df
中寻找'pear'
或'apple'
:
In [51]: np.isin(df, ['pear','apple'])
Out[51]:
array([[ True, False, True],
[ True, True, True],
[False, True, True],
[ True, True, True]])
# ANY match along each row
In [52]: np.isin(df, ['pear','apple']).any(axis=1)
Out[52]: array([ True, True, True, True])
# Select corresponding rows with masking
In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)]
Out[56]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
2. Search for ALL
match 2. 搜索
ALL
匹配项
Here's our starting df
again :这是我们再次开始的
df
:
In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
So, now we are looking for rows that have BOTH
say ['pear','apple']
.所以,现在我们正在寻找
BOTH
都说['pear','apple']
。 We will make use of NumPy-broadcasting
:我们将使用
NumPy-broadcasting
:
In [66]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1)
Out[66]:
array([[ True, True],
[ True, True],
[ True, False],
[ True, True]])
So, we have a search list of 2
items and hence we have a 2D mask with number of rows = len(df)
and number of cols = number of search items
.因此,我们有一个包含
2
项目的搜索列表,因此我们有一个 2D 掩码, number of rows = len(df)
和number of cols = number of search items
。 Thus, in the above result, we have the first col for 'pear'
and second one for 'apple'
.因此,在上面的结果中,我们有
'pear'
的第一个 col 和'apple'
第二个 col。
To make things concrete, let's get a mask for three items ['apple','banana', 'pear']
:为了使事情具体化,让我们为三个项目
['apple','banana', 'pear']
一个掩码:
In [62]: np.equal.outer(df.to_numpy(copy=False), ['apple','banana', 'pear']).any(axis=1)
Out[62]:
array([[ True, True, True],
[ True, False, True],
[False, True, True],
[ True, False, True]])
The columns of this mask are for 'apple','banana', 'pear'
respectively.这个面具的列分别是
'apple','banana', 'pear'
。
Back to 2
search items case, we had earlier :回到
2
搜索项目案例,我们之前有:
In [66]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1)
Out[66]:
array([[ True, True],
[ True, True],
[ True, False],
[ True, True]])
Since, we are looking for ALL
matches in each row :因为,我们正在寻找每一行中的
ALL
匹配项:
In [67]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1).all(axis=1)
Out[67]: array([ True, True, False, True])
Finally, select rows :最后,选择行:
In [70]: df[np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1).all(axis=1)]
Out[70]:
A B C
1 apple banana pear
2 pear pear apple
4 apple apple pear
For single search value对于单个搜索值
df[df.values == "banana"]
or或者
df[df.isin(['banana'])]
For multiple search terms:对于多个搜索词:
df[(df.values == "banana")|(df.values == "apple" ) ]
or或者
df[df.isin(['banana', "apple"])]
# A B C
# 1 apple banana NaN
# 2 NaN NaN apple
# 3 banana NaN NaN
# 4 apple apple NaN
From Divakar: lines with both are returned.来自 Divakar:返回带有两者的行。
select_rows(df,['apple','banana'])
# A B C
# 0 apple banana pear
You can create a boolean mask from comparing the entire df against your string and call dropna
passing param how='all'
to drop rows where your string doesn't appear in all cols:您可以通过将整个 df 与您的字符串进行比较来创建一个布尔掩码,并调用
dropna
传递参数how='all'
来删除您的字符串未出现在所有列中的行:
In [59]:
df[df == 'banana'].dropna(how='all')
Out[59]:
A B C
1 NaN banana NaN
3 banana NaN NaN
To test for multiple values you can use multiple masks:要测试多个值,您可以使用多个掩码:
In [90]:
banana = df[(df=='banana')].dropna(how='all')
banana
Out[90]:
A B C
1 NaN banana NaN
3 banana NaN NaN
In [91]:
apple = df[(df=='apple')].dropna(how='all')
apple
Out[91]:
A B C
1 apple NaN NaN
2 NaN NaN apple
4 apple apple NaN
You can use index.intersection
to index just the common index values:您可以使用
index.intersection
来索引常见的索引值:
In [93]:
df.loc[apple.index.intersection(banana.index)]
Out[93]:
A B C
1 apple banana pear
If you want all the rows of df
that contain any of the values in values
, use:如果您希望
df
所有行都包含 values 中的任何values
,请使用:
df[df.isin(values).any(1)]
Example:例子:
In [2]: df
Out[2]:
0 1 2
0 7 4 9
1 8 2 7
2 1 9 7
3 3 8 5
4 5 1 1
In [3]: df[df.isin({1, 9, 123}).any(1)]
Out[3]:
0 1 2
0 7 4 9
2 1 9 7
4 5 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.