[英]How do I select rows from a DataFrame based on column values?
How can I select rows from a DataFrame based on values in some column in Pandas?如何根据 Pandas 中某列中的值从 DataFrame 中 select 行?
In SQL, I would use:在 SQL 中,我会使用:
SELECT *
FROM table
WHERE column_name = some_value
To select rows whose column value equals a scalar, some_value
, use ==
:要选择列值等于标量
some_value
的行,请使用==
:
df.loc[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values
, use isin
:要选择列值在可迭代
some_values
中的行,请使用isin
:
df.loc[df['column_name'].isin(some_values)]
Combine multiple conditions with &
:用
&
组合多个条件:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
Note the parentheses.注意括号。 Due to Python's operator precedence rules ,
&
binds more tightly than <=
and >=
.由于 Python 的运算符优先级规则,
&
比<=
和>=
绑定得更紧密。 Thus, the parentheses in the last example are necessary.因此,最后一个示例中的括号是必要的。 Without the parentheses
没有括号
df['column_name'] >= A & df['column_name'] <= B
is parsed as被解析为
df['column_name'] >= (A & df['column_name']) <= B
which results in a Truth value of a Series is ambiguous error .这导致Series 的 Truth value is ambiguous error 。
To select rows whose column value does not equal some_value
, use !=
:要选择列值不等于
some_value
的行,请使用!=
:
df.loc[df['column_name'] != some_value]
isin
returns a boolean Series, so to select rows whose value is not in some_values
, negate the boolean Series using ~
: isin
返回一个布尔系列,因此要选择值不在some_values
中的行,请使用~
否定布尔系列:
df.loc[~df['column_name'].isin(some_values)]
For example,例如,
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields产量
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
If you have multiple values you want to include, put them in a list (or more generally, any iterable) and use isin
:如果您要包含多个值,请将它们放在一个列表中(或更一般地说,任何可迭代的)并使用
isin
:
print(df.loc[df['B'].isin(['one','three'])])
yields产量
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
Note, however, that if you wish to do this many times, it is more efficient to make an index first, and then use df.loc
:但是请注意,如果您希望多次执行此操作,则先创建索引然后使用
df.loc
会更有效:
df = df.set_index(['B'])
print(df.loc['one'])
yields产量
A C D
B
one foo 0 0
one bar 1 2
one foo 6 12
or, to include multiple values from the index use df.index.isin
:或者,要包含索引中的多个值,请使用
df.index.isin
:
df.loc[df.index.isin(['one','two'])]
yields产量
A C D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12
There are several ways to select rows from a Pandas dataframe:有几种方法可以从 Pandas 数据框中选择行:
df[df['col'] == value
] )df[df['col'] == value
] )df.iloc[...]
)df.iloc[...]
)df.xs(...)
)df.xs(...)
)df.query(...)
API df.query(...)
API Below I show you examples of each, with advice when to use certain techniques.下面我将向您展示每个示例,以及何时使用某些技术的建议。 Assume our criterion is column
'A'
== 'foo'
假设我们的标准是列
'A'
== 'foo'
(Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.) (关于性能的注意事项:对于每种基本类型,我们可以使用 Pandas API 使事情变得简单,或者我们可以在 API 之外冒险,通常进入 NumPy,并加快速度。)
Setup设置
The first thing we'll need is to identify a condition that will act as our criterion for selecting rows.我们需要的第一件事是确定一个条件,它将作为我们选择行的标准。 We'll start with the OP's case
column_name == some_value
, and include some other common use cases.我们将从 OP 的案例
column_name == some_value
,并包括一些其他常见用例。
Borrowing from @unutbu:借用@unutbu:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
... Boolean indexing requires finding the true value of each row's 'A'
column being equal to 'foo'
, then using those truth values to identify which rows to keep. ...布尔索引需要找到每行的
'A'
列的真值等于'foo'
,然后使用这些真值来确定要保留的行。 Typically, we'd name this series, an array of truth values, mask
.通常,我们将这个系列命名为真值数组
mask
。 We'll do so here as well.我们也会在这里这样做。
mask = df['A'] == 'foo'
We can then use this mask to slice or index the data frame然后我们可以使用这个掩码对数据帧进行切片或索引
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn't an issue, this should be your chosen method.这是完成此任务的最简单方法之一,如果性能或直观性不是问题,这应该是您选择的方法。 However, if performance is a concern, then you might want to consider an alternative way of creating the
mask
.但是,如果性能是一个问题,那么您可能需要考虑另一种创建
mask
的方法。
Positional indexing ( df.iloc[...]
) has its use cases, but this isn't one of them.位置索引(
df.iloc[...]
)有它的用例,但这不是其中之一。 In order to identify where to slice, we first need to perform the same boolean analysis we did above.为了确定切片的位置,我们首先需要执行与上面相同的布尔分析。 This leaves us performing one extra step to accomplish the same task.
这让我们执行了一个额外的步骤来完成相同的任务。
mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Label indexing can be very handy, but in this case, we are again doing more work for no benefit标签索引可以非常方便,但在这种情况下,我们再次做更多的工作没有任何好处
df.set_index('A', append=True, drop=False).xs('foo', level=1)
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
df.query()
API df.query()
API pd.DataFrame.query
is a very elegant/intuitive way to perform this task, but is often slower. pd.DataFrame.query
是执行此任务的一种非常优雅/直观的方式,但通常速度较慢。 However , if you pay attention to the timings below, for large data, the query is very efficient.但是,如果你注意下面的时序,对于大数据,查询是非常有效的。 More so than the standard approach and of similar magnitude as my best suggestion.
比标准方法更重要,并且与我的最佳建议相似。
df.query('A == "foo"')
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
My preference is to use the Boolean
mask
我的偏好是使用
Boolean
mask
Actual improvements can be made by modifying how we create our Boolean
mask
.可以通过修改我们创建
Boolean
mask
的方式来进行实际改进。
mask
alternative 1 Use the underlying NumPy array and forgo the overhead of creating another pd.Series
mask
替代 1使用底层 NumPy 数组并放弃创建另一个pd.Series
的开销
mask = df['A'].values == 'foo'
I'll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame.我将在最后展示更完整的时间测试,但只需看看我们使用示例数据框获得的性能提升。 First, we look at the difference in creating the
mask
首先,我们看一下创建
mask
的区别
%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'
5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Evaluating the mask
with the NumPy array is ~ 30 times faster.使用 NumPy 数组评估
mask
大约快 30 倍。 This is partly due to NumPy evaluation often being faster.这部分是由于 NumPy 评估通常更快。 It is also partly due to the lack of overhead necessary to build an index and a corresponding
pd.Series
object.这也部分是由于构建索引和相应的
pd.Series
对象所需的开销不足。
Next, we'll look at the timing for slicing with one mask
versus the other.接下来,我们将看看使用一个
mask
与另一个掩码进行切片的时间。
mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]
219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The performance gains aren't as pronounced.性能提升并不那么明显。 We'll see if this holds up over more robust testing.
我们将看看这是否支持更强大的测试。
mask
alternative 2 We could have reconstructed the data frame as well. mask
备选方案 2我们也可以重建数据帧。 There is a big caveat when reconstructing a dataframe—you must take care of the dtypes
when doing so!重建数据
dtypes
时有一个很大的警告——这样做时你必须注意数据类型!
Instead of df[mask]
we will do this我们将这样做而不是
df[mask]
pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
If the data frame is of mixed type, which our example is, then when we get df.values
the resulting array is of dtype
object
and consequently, all columns of the new data frame will be of dtype
object
.如果数据框是混合类型,例如我们的示例,那么当我们得到
df.values
时,结果数组是dtype
object
,因此,新数据框的所有列都将是dtype
object
。 Thus requiring the astype(df.dtypes)
and killing any potential performance gains.因此需要
astype(df.dtypes)
并扼杀任何潜在的性能提升。
%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)
216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
However, if the data frame is not of mixed type, this is a very useful way to do it.但是,如果数据框不是混合类型,这是一种非常有用的方法。
Given给定
np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
d1
A B C D E
0 0 2 7 3 8
1 7 0 6 8 6
2 0 2 0 4 9
3 7 3 2 4 3
4 3 6 7 7 4
5 5 3 7 5 9
6 8 7 6 4 7
7 6 2 6 6 5
8 2 8 7 5 8
9 4 7 6 1 5
%%timeit
mask = d1['A'].values == 7
d1[mask]
179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Versus相对
%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)
87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We cut the time in half.我们把时间缩短了一半。
mask
alternative 3 mask
替代品 3
@unutbu also shows us how to use pd.Series.isin
to account for each element of df['A']
being in a set of values. @unutbu 还向我们展示了如何使用
pd.Series.isin
来说明df['A']
的每个元素都在一组值中。 This evaluates to the same thing if our set of values is a set of one value, namely 'foo'
.如果我们的一组值是一组一个值,即
'foo'
,这将评估为相同的事情。 But it also generalizes to include larger sets of values if needed.但如果需要,它也可以概括为包括更大的值集。 Turns out, this is still pretty fast even though it is a more general solution.
事实证明,这仍然相当快,即使它是一个更通用的解决方案。 The only real loss is in intuitiveness for those not familiar with the concept.
对于那些不熟悉这个概念的人来说,唯一真正的损失是直觉。
mask = df['A'].isin(['foo'])
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing.然而,和以前一样,我们可以利用 NumPy 来提高性能,同时几乎不牺牲任何东西。 We'll use
np.in1d
我们将使用
np.in1d
mask = np.in1d(df['A'].values, ['foo'])
df[mask]
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Timing定时
I'll include other concepts mentioned in other posts as well for reference.我还将包括其他帖子中提到的其他概念以供参考。
Code Below下面的代码
Each column in this table represents a different length data frame over which we test each function.此表中的每一列代表一个不同长度的数据帧,我们在该数据帧上测试每个函数。 Each column shows relative time taken, with the fastest function given a base index of
1.0
.每列显示相对时间,最快的函数给出的基本索引为
1.0
。
res.div(res.min())
10 30 100 300 1000 3000 10000 30000
mask_standard 2.156872 1.850663 2.034149 2.166312 2.164541 3.090372 2.981326 3.131151
mask_standard_loc 1.879035 1.782366 1.988823 2.338112 2.361391 3.036131 2.998112 2.990103
mask_with_values 1.010166 1.000000 1.005113 1.026363 1.028698 1.293741 1.007824 1.016919
mask_with_values_loc 1.196843 1.300228 1.000000 1.000000 1.038989 1.219233 1.037020 1.000000
query 4.997304 4.765554 5.934096 4.500559 2.997924 2.397013 1.680447 1.398190
xs_label 4.124597 4.272363 5.596152 4.295331 4.676591 5.710680 6.032809 8.950255
mask_with_isin 1.674055 1.679935 1.847972 1.724183 1.345111 1.405231 1.253554 1.264760
mask_with_in1d 1.000000 1.083807 1.220493 1.101929 1.000000 1.000000 1.000000 1.144175
You'll notice that the fastest times seem to be shared between mask_with_values
and mask_with_in1d
.您会注意到最快的时间似乎在
mask_with_values
和mask_with_in1d
之间共享。
res.T.plot(loglog=True)
Functions功能
def mask_standard(df):
mask = df['A'] == 'foo'
return df[mask]
def mask_standard_loc(df):
mask = df['A'] == 'foo'
return df.loc[mask]
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_values_loc(df):
mask = df['A'].values == 'foo'
return df.loc[mask]
def query(df):
return df.query('A == "foo"')
def xs_label(df):
return df.set_index('A', append=True, drop=False).xs('foo', level=-1)
def mask_with_isin(df):
mask = df['A'].isin(['foo'])
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
Testing测试
res = pd.DataFrame(
index=[
'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
for j in res.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in res.index:a
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)
res.at[i, j] = timeit(stmt, setp, number=50)
Special Timing特殊时间
Looking at the special case when we have a single non-object dtype
for the entire data frame.看看我们对整个数据框有一个非对象
dtype
的特殊情况。
Code Below下面的代码
spec.div(spec.min())
10 30 100 300 1000 3000 10000 30000
mask_with_values 1.009030 1.000000 1.194276 1.000000 1.236892 1.095343 1.000000 1.000000
mask_with_in1d 1.104638 1.094524 1.156930 1.072094 1.000000 1.000000 1.040043 1.027100
reconstruct 1.000000 1.142838 1.000000 1.355440 1.650270 2.222181 2.294913 3.406735
Turns out, reconstruction isn't worth it past a few hundred rows.事实证明,重建几百行是不值得的。
spec.T.plot(loglog=True)
Functions功能
np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))
def mask_with_values(df):
mask = df['A'].values == 'foo'
return df[mask]
def mask_with_in1d(df):
mask = np.in1d(df['A'].values, ['foo'])
return df[mask]
def reconstruct(df):
v = df.values
mask = np.in1d(df['A'].values, ['foo'])
return pd.DataFrame(v[mask], df.index[mask], df.columns)
spec = pd.DataFrame(
index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
dtype=float
)
Testing测试
for j in spec.columns:
d = pd.concat([df] * j, ignore_index=True)
for i in spec.index:
stmt = '{}(d)'.format(i)
setp = 'from __main__ import d, {}'.format(i)
spec.at[i, j] = timeit(stmt, setp, number=50)
The Pandas equivalent to熊猫相当于
select * from table where column_name = some_value
is是
table[table.column_name == some_value]
Multiple conditions:多个条件:
table[(table.column_name == some_value) | (table.column_name2 == some_value2)]
or或者
table.query('column_name == some_value | column_name2 == some_value2')
import pandas as pd
# Create data set
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555]}
df = pd.DataFrame(d)
# Full dataframe:
df
# Shows:
# bar foo
# 0 333 100
# 1 444 111
# 2 555 222
# Output only the row(s) in df where foo is 222:
df[df.foo == 222]
# Shows:
# bar foo
# 2 555 222
In the above code it is the line df[df.foo == 222]
that gives the rows based on the column value, 222
in this case.在上面的代码中,
df[df.foo == 222]
行根据列值给出行,在这种情况下为222
。
Multiple conditions are also possible:多个条件也是可能的:
df[(df.foo == 222) | (df.bar == 444)]
# bar foo
# 1 444 111
# 2 555 222
But at that point I would recommend using the query function, since it's less verbose and yields the same result:但那时我建议使用查询函数,因为它不那么冗长并且产生相同的结果:
df.query('foo == 222 | bar == 444')
I find the syntax of the previous answers to be redundant and difficult to remember.我发现以前答案的语法是多余的,很难记住。 Pandas introduced the
query()
method in v0.13 and I much prefer it. Pandas 在 v0.13 中引入了
query()
方法,我更喜欢它。 For your question, you could do df.query('col == val')
对于您的问题,您可以执行
df.query('col == val')
Reproduced from http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query转载自http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query
In [167]: n = 10
In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [169]: df
Out[169]:
a b c
0 0.687704 0.582314 0.281645
1 0.250846 0.610021 0.420121
2 0.624328 0.401816 0.932146
3 0.011763 0.022921 0.244186
4 0.590198 0.325680 0.890392
5 0.598892 0.296424 0.007312
6 0.634625 0.803069 0.123872
7 0.924168 0.325076 0.303746
8 0.116822 0.364564 0.454607
9 0.986142 0.751953 0.561512
# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
You can also access variables in the environment by prepending an @
.您还可以通过添加
@
来访问环境中的变量。
exclude = ('red', 'orange')
df.query('color not in @exclude')
.query
with pandas >= 0.25.0:.query
更加灵活: Since pandas >= 0.25.0 we can use the query
method to filter dataframes with pandas methods and even column names which have spaces.由于 pandas >= 0.25.0 我们可以使用
query
方法来过滤带有 pandas 方法的数据帧,甚至是包含空格的列名。 Normally the spaces in column names would give an error, but now we can solve that using a backtick (`) - see GitHub :通常,列名中的空格会出错,但现在我们可以使用反引号 (`) 来解决这个问题 - 请参阅GitHub :
# Example dataframe
df = pd.DataFrame({'Sender email':['ex@example.com', "reply@shop.com", "buy@shop.com"]})
Sender email
0 ex@example.com
1 reply@shop.com
2 buy@shop.com
Using .query
with method str.endswith
:将
.query
与str.endswith
方法一起使用:
df.query('`Sender email`.str.endswith("@shop.com")')
Output输出
Sender email
1 reply@shop.com
2 buy@shop.com
Also we can use local variables by prefixing it with an @
in our query:我们还可以通过在查询中使用
@
前缀来使用局部变量:
domain = 'shop.com'
df.query('`Sender email`.str.endswith(@domain)')
Output输出
Sender email
1 reply@shop.com
2 buy@shop.com
For selecting only specific columns out of multiple columns for a given value in Pandas:对于 Pandas 中的给定值,仅从多列中选择特定列:
select col_name1, col_name2 from table where column_name = some_value.
df.loc[df['column_name'] == some_value, [col_name1, col_name2]]
df.query('column_name == some_value')[[col_name1, col_name2]]
Faster results can be achieved using numpy.where .使用numpy.where 可以获得更快的结果。
For example, with unubtu's setup -例如,使用unubtu 的设置-
In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]:
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
Timing comparisons:时间比较:
In [68]: %timeit df.iloc[np.where(df.A.values=='foo')] # fastest
1000 loops, best of 3: 380 µs per loop
In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop
In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop
In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop
In [74]: %timeit df.query('(A=="foo")') # slowest
1000 loops, best of 3: 1.71 ms per loop
In newer versions of Pandas, inspired by the documentation ( Viewing data ):在较新版本的 Pandas 中,受文档( 查看数据)的启发:
df[df["colume_name"] == some_value] #Scalar, True/False..
df[df["colume_name"] == "some_value"] #String
Combine multiple conditions by putting the clause in parentheses, ()
, and combining them with &
and |
通过将子句放在括号中来组合多个条件,
()
,并将它们与&
和|
组合(and/or). (和/或)。 Like this:
像这样:
df[(df["colume_name"] == "some_value1") & (pd[pd["colume_name"] == "some_value2"])]
Other filters其他过滤器
pandas.notna(df["colume_name"]) == True # Not NaN
df['colume_name'].str.contains("text") # Search for "text"
df['colume_name'].str.lower().str.contains("text") # Search for "text", after converting to lowercase
Here is a simple example这是一个简单的例子
from pandas import DataFrame
# Create data set
d = {'Revenue':[100,111,222],
'Cost':[333,444,555]}
df = DataFrame(d)
# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111
print mask
# Result:
# 0 False
# 1 True
# 2 False
# Name: Revenue, dtype: bool
# Select * FROM df WHERE Revenue = 111
df[mask]
# Result:
# Cost Revenue
# 1 444 111
To append to this famous question (though a bit too late): You can also do df.groupby('column_name').get_group('column_desired_value').reset_index()
to make a new data frame with specified column having a particular value.要附加到这个著名的问题(虽然有点太晚了):您还可以执行
df.groupby('column_name').get_group('column_desired_value').reset_index()
来创建一个具有特定值的指定列的新数据框. Eg例如
import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)
b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1)
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)
Run this gives:运行这个给出:
Original dataframe:
A B
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
Sub dataframe where B is two:
A B
0 foo two
1 foo two
2 bar two
You can also use .apply:您也可以使用 .apply:
df.apply(lambda row: row[df['B'].isin(['one','three'])])
It actually works row-wise (ie, applies the function to each row).它实际上是按行工作的(即,将函数应用于每一行)。
The output is输出是
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
The results is the same as using as mentioned by @unutbu结果与@unutbu 提到的使用相同
df[[df['B'].isin(['one','three'])]]
If you want to make query to your dataframe repeatedly and speed is important to you, the best thing is to convert your dataframe to dictionary and then by doing this you can make query thousands of times faster.如果您想反复查询您的数据框并且速度对您很重要,最好的办法是将您的数据框转换为字典,然后通过这样做您可以使查询速度提高数千倍。
my_df = df.set_index(column_name)
my_dict = my_df.to_dict('index')
After make my_dict dictionary you can go through:制作 my_dict 字典后,您可以通过:
if some_value in my_dict.keys():
my_result = my_dict[some_value]
If you have duplicated values in column_name you can't make a dictionary.如果您在 column_name 中有重复的值,则无法制作字典。 but you can use:
但你可以使用:
my_result = my_df.loc[some_value]
@unutbu, for your provided answer shouldn't the following code work @unutbu,对于您提供的答案,以下代码不应该起作用
print(df.loc[(df['A'] == 'foo') & (df['B'] == 'one') & (df['D'] == 12)])
yield:屈服:
A B C D
6 foo one 6 12
With duckdb we can query pandas DataFrames with SQL statements, in a highly performant way .使用duckdb ,我们可以用SQL 语句以一种高性能的方式查询pandas DataFrames。
Since the question is How do I select rows from a DataFrame based on column values?由于问题是如何根据列值从 DataFrame 中选择行? , and the example in the question is a SQL query, this answer looks logical in this topic.
,并且问题中的示例是 SQL 查询,这个答案在本主题中看起来是合乎逻辑的。
Example :示例:
In [1]: import duckdb
In [2]: import pandas as pd
In [3]: con = duckdb.connect()
In [4]: df = pd.DataFrame({"A": range(11), "B": range(11, 22)})
In [5]: df
Out[5]:
A B
0 0 11
1 1 12
2 2 13
3 3 14
4 4 15
5 5 16
6 6 17
7 7 18
8 8 19
9 9 20
10 10 21
In [6]: results = con.execute("SELECT * FROM df where A > 2").df()
In [7]: results
Out[7]:
A B
0 3 14
1 4 15
2 5 16
3 6 17
4 7 18
5 8 19
6 9 20
7 10 21
Great answers.很好的答案。 Only, when the size of the dataframe approaches million rows , many of the methods tend to take ages when using
df[df['col']==val]
.只有当数据框的大小接近百万行时,许多方法在使用
df[df['col']==val]
时往往需要很长时间。 I wanted to have all possible values of "another_column" that correspond to specific values in "some_column" (in this case in a dictionary).我想拥有与“some_column”中的特定值相对应的“another_column”的所有可能值(在本例中为字典)。 This worked and fast.
这有效且快速。
s=datetime.datetime.now()
my_dict={}
for i, my_key in enumerate(df['some_column'].values):
if i%100==0:
print(i) # to see the progress
if my_key not in my_dict.keys():
my_dict[my_key]={}
my_dict[my_key]['values']=[df.iloc[i]['another_column']]
else:
my_dict[my_key]['values'].append(df.iloc[i]['another_column'])
e=datetime.datetime.now()
print('operation took '+str(e-s)+' seconds')```
You can use loc
(square brackets) with a function:您可以将
loc
(方括号)与 function 一起使用:
# Series
s = pd.Series([1, 2, 3, 4])
s.loc[lambda x: x > 1]
# s[lambda x: x > 1]
Output: Output:
1 2
2 3
3 4
dtype: int64
or或者
# DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
df.loc[lambda x: x['A'] > 1]
# df[lambda x: x['A'] > 1]
Output: Output:
A B
1 2 20
2 3 30
The advantage of this method is that you can chain selection with previous operations.这种方法的优点是您可以将选择与先前的操作链接起来。 For example:
例如:
df.mul(2).loc[lambda x: x['A'] > 3, 'B']
# (df * 2).loc[lambda x: x['A'] > 3, 'B']
vs对比
df_temp = df * 2
df_temp.loc[df_temp['A'] > 3, 'B']
Output: Output:
1 40
2 60
Name: B, dtype: int64
numexpr
to speed up query()
calls numexpr
以加快query()
调用The pandas documentation recommends installing numexpr to speed up numeric calculation when using query()
. pandas 文档建议安装 numexpr以在使用
query()
时加快数值计算。 Use pip install numexpr
(or conda
, sudo
etc. depending on your environment) to install it.使用
pip install numexpr
(或conda
, sudo
等,具体取决于您的环境)来安装它。
For larger dataframes (where performance actually matters), df.query()
with numexpr
engine performs much faster than df[mask]
.对于较大的数据帧(性能实际上很重要),带有
numexpr
引擎的df.query()
比df[mask]
执行得快得多。 In particular, it performs better for the following cases.特别是,它在以下情况下表现更好。
Logical and/or comparison operators on columns of strings字符串列上的逻辑和/或比较运算符
If a column of strings are compared to some other string(s) and matching rows are to be selected, even for a single comparison operation, query()
performs faster than df[mask]
.如果将一列字符串与其他一些字符串进行比较并选择匹配的行,即使对于单个比较操作,
query()
的执行速度也比df[mask]
快。 For example, for a dataframe with 80k rows, it's 30% faster 1 and for a dataframe with 800k rows, it's 60% faster.例如,对于具有 80k 行的 dataframe,速度提高了 30% 1 ,对于具有 800k 行的 dataframe,速度提高了 60%。 2
2
df[df.A == 'foo']
df.query("A == 'foo'") # <--- performs 30%-60% faster
This gap increases as the number of operations increases (if 4 comparisons are chained df.query()
is 2-2.3 times faster than df[mask]
) 1,2 and/or the dataframe length increases.这个差距随着操作数量的增加而增加(如果链接 4 个比较
df.query()
比df[mask]
快 2-2.3 倍) 1,2和/或 dataframe 长度增加。 2 2
Multiple operations on numeric columns对数值列的多项操作
If multiple arithmetic, logical or comparison operations need to be computed to create a boolean mask to filter df
, query()
performs faster.如果需要计算多个算术、逻辑或比较操作以创建 boolean 掩码来过滤
df
,则query()
执行得更快。 For example, for a frame with 80k rows, it's 20% faster 1 and for a frame with 800k rows, it's 2 times faster.例如,对于具有 80k 行的帧,它快 20% 1 ,对于具有 800k 行的帧,它快 2 倍。 2
2
df[(df.B % 5) **2 < 0.1]
df.query("(B % 5) **2 < 0.1") # <--- performs 20%-100% faster.
This gap in performance increases as the number of operations increases and/or the dataframe length increases.这种性能差距随着操作数量的增加和/或 dataframe 长度的增加而增加。 2
2
The following plot shows how the methods perform as the dataframe length increases.以下 plot 显示了这些方法如何随着 dataframe 长度的增加而执行。 3
3
.values
to call pandas methods inside query()
.values
调用query()
里面的pandas方法Numexpr
currently supports only logical ( &
, |
, ~
), comparison ( ==
, >
, <
, >=
, <=
, !=
) and basic arithmetic operators ( +
, -
, *
, /
, **
, %
). Numexpr
目前仅支持逻辑( &
、 |
、 ~
)、比较( ==
、 >
、 <
、 >=
、 <=
、 !=
)和基本算术运算符( +
、 -
、 *
、 /
、 **
、 %
)。
For example, it doesn't support integer division ( //
).例如,它不支持 integer 除法 (
//
)。 However, calling the equivalent pandas method ( floordiv()
) and accessing the values
attribute on the resulting Series makes numexpr
evaluate its underlying numpy array and query()
works.但是,调用等效的 pandas 方法(
floordiv()
)并访问结果 Series 上的values
属性会使numexpr
评估其底层 numpy 数组和query()
工作。 Or setting engine
parameter to 'python'
also works.或者将
engine
参数设置为'python'
也可以。
df.query('B.floordiv(2).values <= 3') # or
df.query('B.floordiv(2).le(3).values') # or
df.query('B.floordiv(2).le(3)', engine='python')
The same applies for Erfan 's suggested method calls as well.这同样适用于Erfan建议的方法调用。 The code in their answer spits TypeError as is (as of Pandas 1.3.4) for
numexpr
engine but accessing .values
attribute makes it work.他们答案中的代码按原样(从 Pandas 1.3.4 开始)为
numexpr
引擎吐出 TypeError,但访问.values
属性使其工作。
df.query('`Sender email`.str.endswith("@shop.com")') # <--- TypeError
df.query('`Sender email`.str.endswith("@shop.com").values') # OK
1 : Benchmark code using a frame with 80k rows 1 :使用 80k 行帧的基准代码
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*10000,
'B': np.random.rand(80000)})
%timeit df[df.A == 'foo']
# 8.5 ms ± 104.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 6.36 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 29 ms ± 554 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 16 ms ± 339 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5) **2 < 0.1]
# 5.35 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 4.37 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2 : Benchmark code using a frame with 800k rows 2 :使用 800k 行的框架的基准代码
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*100000,
'B': np.random.rand(800000)})
%timeit df[df.A == 'foo']
# 87.9 ms ± 873 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 54.4 ms ± 726 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 310 ms ± 3.4 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 132 ms ± 2.43 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5) **2 < 0.1]
# 54 ms ± 488 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 26.3 ms ± 320 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
3 : Code used to produce the performance graphs of the two methods for strings and numbers. 3 :用于生成字符串和数字两种方法的性能图的代码。
from perfplot import plot
constructor = lambda n: pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*n, 'B': np.random.rand(8*n)})
plot(
setup=constructor,
kernels=[lambda df: df[(df.B%5)**2<0.1], lambda df: df.query("(B%5)**2<0.1")],
labels= ['df[(df.B % 5) **2 < 0.1]', 'df.query("(B % 5) **2 < 0.1")'],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Multiple mathematical operations on numbers',
equality_check=pd.DataFrame.equals);
plot(
setup=constructor,
kernels=[lambda df: df[df.A == 'foo'], lambda df: df.query("A == 'foo'")],
labels= ["df[df.A == 'foo']", """df.query("A == 'foo'")"""],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Comparison operation on strings',
equality_check=pd.DataFrame.equals);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.