简体   繁体   English

如何根据列值从 DataFrame 中 select 行?

[英]How do I select rows from a DataFrame based on column values?

How can I select rows from a DataFrame based on values in some column in Pandas?如何根据 Pandas 中某列中的值从 DataFrame 中 select 行?

In SQL, I would use:在 SQL 中,我会使用:

SELECT *
FROM table
WHERE column_name = some_value

To select rows whose column value equals a scalar, some_value , use == :要选择列值等于标量some_value的行,请使用==

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values , use isin :要选择列值在可迭代some_values中的行,请使用isin

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with & :&组合多个条件:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses.注意括号。 Due to Python's operator precedence rules , & binds more tightly than <= and >= .由于 Python 的运算符优先级规则&<=>=绑定得更紧密。 Thus, the parentheses in the last example are necessary.因此,最后一个示例中的括号是必要的。 Without the parentheses没有括号

df['column_name'] >= A & df['column_name'] <= B

is parsed as被解析为

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error .这导致Series 的 Truth value is ambiguous error


To select rows whose column value does not equal some_value , use != :要选择列值不等于some_value的行,请使用!=

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values , negate the boolean Series using ~ : isin返回一个布尔系列,因此要选择值不在some_values中的行,请使用~否定布尔系列:

df.loc[~df['column_name'].isin(some_values)]

For example,例如,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14

print(df.loc[df['A'] == 'foo'])

yields产量

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

If you have multiple values you want to include, put them in a list (or more generally, any iterable) and use isin :如果您要包含多个值,请将它们放在一个列表中(或更一般地说,任何可迭代的)并使用isin

print(df.loc[df['B'].isin(['one','three'])])

yields产量

     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

Note, however, that if you wish to do this many times, it is more efficient to make an index first, and then use df.loc :但是请注意,如果您希望多次执行此操作,则先创建索引然后使用df.loc会更有效:

df = df.set_index(['B'])
print(df.loc['one'])

yields产量

       A  C   D
B              
one  foo  0   0
one  bar  1   2
one  foo  6  12

or, to include multiple values from the index use df.index.isin :或者,要包含索引中的多个值,请使用df.index.isin

df.loc[df.index.isin(['one','two'])]

yields产量

       A  C   D
B              
one  foo  0   0
one  bar  1   2
two  foo  2   4
two  foo  4   8
two  bar  5  10
one  foo  6  12

There are several ways to select rows from a Pandas dataframe:有几种方法可以从 Pandas 数据框中选择行:

  1. Boolean indexing ( df[df['col'] == value ] )布尔索引( df[df['col'] == value ] )
  2. Positional indexing ( df.iloc[...] )位置索引( df.iloc[...]
  3. Label indexing ( df.xs(...) )标签索引( df.xs(...)
  4. df.query(...) API df.query(...) API

Below I show you examples of each, with advice when to use certain techniques.下面我将向您展示每个示例,以及何时使用某些技术的建议。 Assume our criterion is column 'A' == 'foo'假设我们的标准是列'A' == 'foo'

(Note on performance: For each base type, we can keep things simple by using the Pandas API or we can venture outside the API, usually into NumPy, and speed things up.) (关于性能的注意事项:对于每种基本类型,我们可以使用 Pandas API 使事情变得简单,或者我们可以在 API 之外冒险,通常进入 NumPy,并加快速度。)


Setup设置

The first thing we'll need is to identify a condition that will act as our criterion for selecting rows.我们需要的第一件事是确定一个条件,它将作为我们选择行的标准。 We'll start with the OP's case column_name == some_value , and include some other common use cases.我们将从 OP 的案例column_name == some_value ,并包括一些其他常见用例。

Borrowing from @unutbu:借用@unutbu:

import pandas as pd, numpy as np

df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})

1. Boolean indexing 1.布尔索引

... Boolean indexing requires finding the true value of each row's 'A' column being equal to 'foo' , then using those truth values to identify which rows to keep. ...布尔索引需要找到每行的'A'列的真值等于'foo' ,然后使用这些真值来确定要保留的行。 Typically, we'd name this series, an array of truth values, mask .通常,我们将这个系列命名为真值数组mask We'll do so here as well.我们也会在这里这样做。

mask = df['A'] == 'foo'

We can then use this mask to slice or index the data frame然后我们可以使用这个掩码对数据帧进行切片或索引

df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn't an issue, this should be your chosen method.这是完成此任务的最简单方法之一,如果性能或直观性不是问题,这应该是您选择的方法。 However, if performance is a concern, then you might want to consider an alternative way of creating the mask .但是,如果性能是一个问题,那么您可能需要考虑另一种创建mask的方法。


2. Positional indexing 2.位置索引

Positional indexing ( df.iloc[...] ) has its use cases, but this isn't one of them.位置索引( df.iloc[...] )有它的用例,但这不是其中之一。 In order to identify where to slice, we first need to perform the same boolean analysis we did above.为了确定切片的位置,我们首先需要执行与上面相同的布尔分析。 This leaves us performing one extra step to accomplish the same task.这让我们执行了一个额外的步骤来完成相同的任务。

mask = df['A'] == 'foo'
pos = np.flatnonzero(mask)
df.iloc[pos]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

3. Label indexing 3. 标签索引

Label indexing can be very handy, but in this case, we are again doing more work for no benefit标签索引可以非常方便,但在这种情况下,我们再次做更多的工作没有任何好处

df.set_index('A', append=True, drop=False).xs('foo', level=1)

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

4. df.query() API 4. df.query() API

pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. pd.DataFrame.query是执行此任务的一种非常优雅/直观的方式,但通常速度较慢。 However , if you pay attention to the timings below, for large data, the query is very efficient.但是,如果你注意下面的时序,对于大数据,查询是非常有效的。 More so than the standard approach and of similar magnitude as my best suggestion.比标准方法更重要,并且与我的最佳建议相似。

df.query('A == "foo"')

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

My preference is to use the Boolean mask我的偏好是使用Boolean mask

Actual improvements can be made by modifying how we create our Boolean mask .可以通过修改我们创建Boolean mask的方式来进行实际改进。

mask alternative 1 Use the underlying NumPy array and forgo the overhead of creating another pd.Series mask替代 1使用底层 NumPy 数组并放弃创建另一个pd.Series的开销

mask = df['A'].values == 'foo'

I'll show more complete time tests at the end, but just take a look at the performance gains we get using the sample data frame.我将在最后展示更完整的时间测试,但只需看看我们使用示例数据框获得的性能提升。 First, we look at the difference in creating the mask首先,我们看一下创建mask的区别

%timeit mask = df['A'].values == 'foo'
%timeit mask = df['A'] == 'foo'

5.84 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
166 µs ± 4.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Evaluating the mask with the NumPy array is ~ 30 times faster.使用 NumPy 数组评估mask大约快 30 倍。 This is partly due to NumPy evaluation often being faster.这部分是由于 NumPy 评估通常更快。 It is also partly due to the lack of overhead necessary to build an index and a corresponding pd.Series object.这也部分是由于构建索引和相应的pd.Series对象所需的开销不足。

Next, we'll look at the timing for slicing with one mask versus the other.接下来,我们将看看使用一个mask与另一个掩码进行切片的时间。

mask = df['A'].values == 'foo'
%timeit df[mask]
mask = df['A'] == 'foo'
%timeit df[mask]

219 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
239 µs ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The performance gains aren't as pronounced.性能提升并不那么明显。 We'll see if this holds up over more robust testing.我们将看看这是否支持更强大的测试。


mask alternative 2 We could have reconstructed the data frame as well. mask备选方案 2我们也可以重建数据帧。 There is a big caveat when reconstructing a dataframe—you must take care of the dtypes when doing so!重建数据dtypes时有一个很大的警告——这样做时你必须注意数据类型!

Instead of df[mask] we will do this我们将这样做而不是df[mask]

pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

If the data frame is of mixed type, which our example is, then when we get df.values the resulting array is of dtype object and consequently, all columns of the new data frame will be of dtype object .如果数据框是混合类型,例如我们的示例,那么当我们得到df.values时,结果数组是dtype object ,因此,新数据框的所有列都将是dtype object Thus requiring the astype(df.dtypes) and killing any potential performance gains.因此需要astype(df.dtypes)并扼杀任何潜在的性能提升。

%timeit df[m]
%timeit pd.DataFrame(df.values[mask], df.index[mask], df.columns).astype(df.dtypes)

216 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.43 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

However, if the data frame is not of mixed type, this is a very useful way to do it.但是,如果数据框不是混合类型,这是一种非常有用的方法。

Given给定

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

d1

   A  B  C  D  E
0  0  2  7  3  8
1  7  0  6  8  6
2  0  2  0  4  9
3  7  3  2  4  3
4  3  6  7  7  4
5  5  3  7  5  9
6  8  7  6  4  7
7  6  2  6  6  5
8  2  8  7  5  8
9  4  7  6  1  5

%%timeit
mask = d1['A'].values == 7
d1[mask]

179 µs ± 8.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Versus相对

%%timeit
mask = d1['A'].values == 7
pd.DataFrame(d1.values[mask], d1.index[mask], d1.columns)

87 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We cut the time in half.我们把时间缩短了一半。


mask alternative 3 mask替代品 3

@unutbu also shows us how to use pd.Series.isin to account for each element of df['A'] being in a set of values. @unutbu 还向我们展示了如何使用pd.Series.isin来说明df['A']的每个元素都在一组值中。 This evaluates to the same thing if our set of values is a set of one value, namely 'foo' .如果我们的一组值是一组一个值,即'foo' ,这将评估为相同的事情。 But it also generalizes to include larger sets of values if needed.但如果需要,它也可以概括为包括更大的值集。 Turns out, this is still pretty fast even though it is a more general solution.事实证明,这仍然相当快,即使它是一个更通用的解决方案。 The only real loss is in intuitiveness for those not familiar with the concept.对于那些不熟悉这个概念的人来说,唯一真正的损失是直觉。

mask = df['A'].isin(['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

However, as before, we can utilize NumPy to improve performance while sacrificing virtually nothing.然而,和以前一样,我们可以利用 NumPy 来提高性能,同时几乎不牺牲任何东西。 We'll use np.in1d我们将使用np.in1d

mask = np.in1d(df['A'].values, ['foo'])
df[mask]

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing定时

I'll include other concepts mentioned in other posts as well for reference.我还将包括其他帖子中提到的其他概念以供参考。

Code Below下面的代码

Each column in this table represents a different length data frame over which we test each function.此表中的每一代表一个不同长度的数据帧,我们在该数据帧上测试每个函数。 Each column shows relative time taken, with the fastest function given a base index of 1.0 .每列显示相对时间,最快的函数给出的基本索引为1.0

res.div(res.min())

                         10        30        100       300       1000      3000      10000     30000
mask_standard         2.156872  1.850663  2.034149  2.166312  2.164541  3.090372  2.981326  3.131151
mask_standard_loc     1.879035  1.782366  1.988823  2.338112  2.361391  3.036131  2.998112  2.990103
mask_with_values      1.010166  1.000000  1.005113  1.026363  1.028698  1.293741  1.007824  1.016919
mask_with_values_loc  1.196843  1.300228  1.000000  1.000000  1.038989  1.219233  1.037020  1.000000
query                 4.997304  4.765554  5.934096  4.500559  2.997924  2.397013  1.680447  1.398190
xs_label              4.124597  4.272363  5.596152  4.295331  4.676591  5.710680  6.032809  8.950255
mask_with_isin        1.674055  1.679935  1.847972  1.724183  1.345111  1.405231  1.253554  1.264760
mask_with_in1d        1.000000  1.083807  1.220493  1.101929  1.000000  1.000000  1.000000  1.144175

You'll notice that the fastest times seem to be shared between mask_with_values and mask_with_in1d .您会注意到最快的时间似乎在mask_with_valuesmask_with_in1d之间共享。

res.T.plot(loglog=True)

在此处输入图像描述

Functions功能

def mask_standard(df):
    mask = df['A'] == 'foo'
    return df[mask]

def mask_standard_loc(df):
    mask = df['A'] == 'foo'
    return df.loc[mask]

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_values_loc(df):
    mask = df['A'].values == 'foo'
    return df.loc[mask]

def query(df):
    return df.query('A == "foo"')

def xs_label(df):
    return df.set_index('A', append=True, drop=False).xs('foo', level=-1)

def mask_with_isin(df):
    mask = df['A'].isin(['foo'])
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

Testing测试

res = pd.DataFrame(
    index=[
        'mask_standard', 'mask_standard_loc', 'mask_with_values', 'mask_with_values_loc',
        'query', 'xs_label', 'mask_with_isin', 'mask_with_in1d'
    ],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

for j in res.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in res.index:a
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        res.at[i, j] = timeit(stmt, setp, number=50)

Special Timing特殊时间

Looking at the special case when we have a single non-object dtype for the entire data frame.看看我们对整个数据框有一个非对象dtype的特殊情况。

Code Below下面的代码

spec.div(spec.min())

                     10        30        100       300       1000      3000      10000     30000
mask_with_values  1.009030  1.000000  1.194276  1.000000  1.236892  1.095343  1.000000  1.000000
mask_with_in1d    1.104638  1.094524  1.156930  1.072094  1.000000  1.000000  1.040043  1.027100
reconstruct       1.000000  1.142838  1.000000  1.355440  1.650270  2.222181  2.294913  3.406735

Turns out, reconstruction isn't worth it past a few hundred rows.事实证明,重建几百行是不值得的。

spec.T.plot(loglog=True)

在此处输入图像描述

Functions功能

np.random.seed([3,1415])
d1 = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('ABCDE'))

def mask_with_values(df):
    mask = df['A'].values == 'foo'
    return df[mask]

def mask_with_in1d(df):
    mask = np.in1d(df['A'].values, ['foo'])
    return df[mask]

def reconstruct(df):
    v = df.values
    mask = np.in1d(df['A'].values, ['foo'])
    return pd.DataFrame(v[mask], df.index[mask], df.columns)

spec = pd.DataFrame(
    index=['mask_with_values', 'mask_with_in1d', 'reconstruct'],
    columns=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    dtype=float
)

Testing测试

for j in spec.columns:
    d = pd.concat([df] * j, ignore_index=True)
    for i in spec.index:
        stmt = '{}(d)'.format(i)
        setp = 'from __main__ import d, {}'.format(i)
        spec.at[i, j] = timeit(stmt, setp, number=50)

tl;dr tl;博士

The Pandas equivalent to熊猫相当于

select * from table where column_name = some_value

is

table[table.column_name == some_value]

Multiple conditions:多个条件:

table[(table.column_name == some_value) | (table.column_name2 == some_value2)]

or或者

table.query('column_name == some_value | column_name2 == some_value2')

Code example代码示例

import pandas as pd

# Create data set
d = {'foo':[100, 111, 222],
     'bar':[333, 444, 555]}
df = pd.DataFrame(d)

# Full dataframe:
df

# Shows:
#    bar   foo
# 0  333   100
# 1  444   111
# 2  555   222

# Output only the row(s) in df where foo is 222:
df[df.foo == 222]

# Shows:
#    bar  foo
# 2  555  222

In the above code it is the line df[df.foo == 222] that gives the rows based on the column value, 222 in this case.在上面的代码中, df[df.foo == 222]行根据列值给出行,在这种情况下为222

Multiple conditions are also possible:多个条件也是可能的:

df[(df.foo == 222) | (df.bar == 444)]
#    bar  foo
# 1  444  111
# 2  555  222

But at that point I would recommend using the query function, since it's less verbose and yields the same result:但那时我建议使用查询函数,因为它不那么冗长并且产生相同的结果:

df.query('foo == 222 | bar == 444')

I find the syntax of the previous answers to be redundant and difficult to remember.我发现以前答案的语法是多余的,很难记住。 Pandas introduced the query() method in v0.13 and I much prefer it. Pandas 在 v0.13 中引入了query()方法,我更喜欢它。 For your question, you could do df.query('col == val')对于您的问题,您可以执行df.query('col == val')

Reproduced from http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query转载自http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query

In [167]: n = 10

In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [169]: df
Out[169]: 
          a         b         c
0  0.687704  0.582314  0.281645
1  0.250846  0.610021  0.420121
2  0.624328  0.401816  0.932146
3  0.011763  0.022921  0.244186
4  0.590198  0.325680  0.890392
5  0.598892  0.296424  0.007312
6  0.634625  0.803069  0.123872
7  0.924168  0.325076  0.303746
8  0.116822  0.364564  0.454607
9  0.986142  0.751953  0.561512

# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

You can also access variables in the environment by prepending an @ .您还可以通过添加@来访问环境中的变量。

exclude = ('red', 'orange')
df.query('color not in @exclude')

More flexibility using .query with pandas >= 0.25.0:使用带有 pandas >= 0.25.0 的.query更加灵活:

Since pandas >= 0.25.0 we can use the query method to filter dataframes with pandas methods and even column names which have spaces.由于 pandas >= 0.25.0 我们可以使用query方法来过滤带有 pandas 方法的数据帧,甚至是包含空格的列名。 Normally the spaces in column names would give an error, but now we can solve that using a backtick (`) - see GitHub :通常,列名中的空格会出错,但现在我们可以使用反引号 (`) 来解决这个问题 - 请参阅GitHub

# Example dataframe
df = pd.DataFrame({'Sender email':['ex@example.com', "reply@shop.com", "buy@shop.com"]})

     Sender email
0  ex@example.com
1  reply@shop.com
2    buy@shop.com

Using .query with method str.endswith :.querystr.endswith方法一起使用:

df.query('`Sender email`.str.endswith("@shop.com")')

Output输出

     Sender email
1  reply@shop.com
2    buy@shop.com

Also we can use local variables by prefixing it with an @ in our query:我们还可以通过在查询中使用@前缀来使用局部变量:

domain = 'shop.com'
df.query('`Sender email`.str.endswith(@domain)')

Output输出

     Sender email
1  reply@shop.com
2    buy@shop.com

For selecting only specific columns out of multiple columns for a given value in Pandas:对于 Pandas 中的给定值,仅从多列中选择特定列:

select col_name1, col_name2 from table where column_name = some_value.

Options loc :选项loc

df.loc[df['column_name'] == some_value, [col_name1, col_name2]]

or query :query

df.query('column_name == some_value')[[col_name1, col_name2]]

Faster results can be achieved using numpy.where .使用numpy.where 可以获得更快的结果。

For example, with unubtu's setup -例如,使用unubtu 的设置-

In [76]: df.iloc[np.where(df.A.values=='foo')]
Out[76]: 
     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

Timing comparisons:时间比较:

In [68]: %timeit df.iloc[np.where(df.A.values=='foo')]  # fastest
1000 loops, best of 3: 380 µs per loop

In [69]: %timeit df.loc[df['A'] == 'foo']
1000 loops, best of 3: 745 µs per loop

In [71]: %timeit df.loc[df['A'].isin(['foo'])]
1000 loops, best of 3: 562 µs per loop

In [72]: %timeit df[df.A=='foo']
1000 loops, best of 3: 796 µs per loop

In [74]: %timeit df.query('(A=="foo")')  # slowest
1000 loops, best of 3: 1.71 ms per loop

In newer versions of Pandas, inspired by the documentation ( Viewing data ):在较新版本的 Pandas 中,受文档( 查看数据)的启发:

df[df["colume_name"] == some_value] #Scalar, True/False..

df[df["colume_name"] == "some_value"] #String

Combine multiple conditions by putting the clause in parentheses, () , and combining them with & and |通过将子句放在括号中来组合多个条件, () ,并将它们与&|组合(and/or). (和/或)。 Like this:像这样:

df[(df["colume_name"] == "some_value1") & (pd[pd["colume_name"] == "some_value2"])]

Other filters其他过滤器

pandas.notna(df["colume_name"]) == True # Not NaN
df['colume_name'].str.contains("text") # Search for "text"
df['colume_name'].str.lower().str.contains("text") # Search for "text", after converting  to lowercase

Here is a simple example这是一个简单的例子

from pandas import DataFrame

# Create data set
d = {'Revenue':[100,111,222], 
     'Cost':[333,444,555]}
df = DataFrame(d)


# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111

print mask

# Result:
# 0    False
# 1     True
# 2    False
# Name: Revenue, dtype: bool


# Select * FROM df WHERE Revenue = 111
df[mask]

# Result:
#    Cost    Revenue
# 1  444     111

To append to this famous question (though a bit too late): You can also do df.groupby('column_name').get_group('column_desired_value').reset_index() to make a new data frame with specified column having a particular value.要附加到这个著名的问题(虽然有点太晚了):您还可以执行df.groupby('column_name').get_group('column_desired_value').reset_index()来创建一个具有特定值的指定列的新数据框. Eg例如

import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)

b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1) 
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)

Run this gives:运行这个给出:

Original dataframe:
     A      B
0  foo    one
1  bar    one
2  foo    two
3  bar  three
4  foo    two
5  bar    two
6  foo    one
7  foo  three
Sub dataframe where B is two:
     A    B
0  foo  two
1  foo  two
2  bar  two

You can also use .apply:您也可以使用 .apply:

df.apply(lambda row: row[df['B'].isin(['one','three'])])

It actually works row-wise (ie, applies the function to each row).它实际上是按行工作的(即,将函数应用于每一行)。

The output is输出是

   A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

The results is the same as using as mentioned by @unutbu结果与@unutbu 提到的使用相同

df[[df['B'].isin(['one','three'])]]

If you want to make query to your dataframe repeatedly and speed is important to you, the best thing is to convert your dataframe to dictionary and then by doing this you can make query thousands of times faster.如果您想反复查询您的数据框并且速度对您很重要,最好的办法是将您的数据框转换为字典,然后通过这样做您可以使查询速度提高数千倍。

my_df = df.set_index(column_name)
my_dict = my_df.to_dict('index')

After make my_dict dictionary you can go through:制作 my_dict 字典后,您可以通过:

if some_value in my_dict.keys():
   my_result = my_dict[some_value]

If you have duplicated values in column_name you can't make a dictionary.如果您在 column_name 中有重复的值,则无法制作字典。 but you can use:但你可以使用:

my_result = my_df.loc[some_value]

@unutbu, for your provided answer shouldn't the following code work @unutbu,对于您提供的答案,以下代码不应该起作用

print(df.loc[(df['A'] == 'foo') & (df['B'] == 'one') & (df['D'] == 12)])

yield:屈服:

     A      B  C   D
6  foo    one  6  12 

SQL statements on DataFrames to select rows using DuckDB DataFrames 上的 SQL 语句以使用 DuckDB 选择行

With duckdb we can query pandas DataFrames with SQL statements, in a highly performant way .使用duckdb ,我们可以用SQL 语句以一种高性能的方式查询pandas DataFrames。

Since the question is How do I select rows from a DataFrame based on column values?由于问题是如何根据列值从 DataFrame 中选择行? , and the example in the question is a SQL query, this answer looks logical in this topic. ,并且问题中的示例是 SQL 查询,这个答案在本主题中看起来是合乎逻辑的。

Example :示例

In [1]: import duckdb

In [2]: import pandas as pd

In [3]: con = duckdb.connect()

In [4]: df = pd.DataFrame({"A": range(11), "B": range(11, 22)})

In [5]: df
Out[5]:
     A   B
0    0  11
1    1  12
2    2  13
3    3  14
4    4  15
5    5  16
6    6  17
7    7  18
8    8  19
9    9  20
10  10  21

In [6]: results = con.execute("SELECT * FROM df where A > 2").df()

In [7]: results
Out[7]:
    A   B
0   3  14
1   4  15
2   5  16
3   6  17
4   7  18
5   8  19
6   9  20
7  10  21

Great answers.很好的答案。 Only, when the size of the dataframe approaches million rows , many of the methods tend to take ages when using df[df['col']==val] .只有当数据框的大小接近百万行时,许多方法在使用df[df['col']==val]时往往需要很长时间。 I wanted to have all possible values of "another_column" that correspond to specific values in "some_column" (in this case in a dictionary).我想拥有与“some_column”中的特定值相对应的“another_column”的所有可能值(在本例中为字典)。 This worked and fast.这有效且快速。

s=datetime.datetime.now()

my_dict={}

for i, my_key in enumerate(df['some_column'].values): 
    if i%100==0:
        print(i)  # to see the progress
    if my_key not in my_dict.keys():
        my_dict[my_key]={}
        my_dict[my_key]['values']=[df.iloc[i]['another_column']]
    else:
        my_dict[my_key]['values'].append(df.iloc[i]['another_column'])
        
e=datetime.datetime.now()

print('operation took '+str(e-s)+' seconds')```

You can use loc (square brackets) with a function:您可以将loc (方括号)与 function 一起使用:

# Series
s = pd.Series([1, 2, 3, 4]) 
s.loc[lambda x: x > 1]
# s[lambda x: x > 1]

Output: Output:

1    2
2    3
3    4
dtype: int64

or或者

# DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [10, 20, 30]})
df.loc[lambda x: x['A'] > 1]
# df[lambda x: x['A'] > 1]

Output: Output:

   A   B
1  2  20
2  3  30

The advantage of this method is that you can chain selection with previous operations.这种方法的优点是您可以将选择与先前的操作链接起来。 For example:例如:

df.mul(2).loc[lambda x: x['A'] > 3, 'B']
# (df * 2).loc[lambda x: x['A'] > 3, 'B']

vs对比

df_temp = df * 2
df_temp.loc[df_temp['A'] > 3, 'B']

Output: Output:

1    40
2    60
Name: B, dtype: int64

1. Install numexpr to speed up query() calls 1. 安装numexpr以加快query()调用

The pandas documentation recommends installing numexpr to speed up numeric calculation when using query() . pandas 文档建议安装 numexpr以在使用query()时加快数值计算。 Use pip install numexpr (or conda , sudo etc. depending on your environment) to install it.使用pip install numexpr (或condasudo等,具体取决于您的环境)来安装它。

For larger dataframes (where performance actually matters), df.query() with numexpr engine performs much faster than df[mask] .对于较大的数据帧(性能实际上很重要),带有numexpr引擎的df.query()df[mask]执行得快得多。 In particular, it performs better for the following cases.特别是,它在以下情况下表现更好。

Logical and/or comparison operators on columns of strings字符串列上的逻辑和/或比较运算符

If a column of strings are compared to some other string(s) and matching rows are to be selected, even for a single comparison operation, query() performs faster than df[mask] .如果将一列字符串与其他一些字符串进行比较并选择匹配的行,即使对于单个比较操作, query()的执行速度也比df[mask]快。 For example, for a dataframe with 80k rows, it's 30% faster 1 and for a dataframe with 800k rows, it's 60% faster.例如,对于具有 80k 行的 dataframe,速度提高了 30% 1 ,对于具有 800k 行的 dataframe,速度提高了 60%。 2 2

df[df.A == 'foo']
df.query("A == 'foo'")  # <--- performs 30%-60% faster

This gap increases as the number of operations increases (if 4 comparisons are chained df.query() is 2-2.3 times faster than df[mask] ) 1,2 and/or the dataframe length increases.这个差距随着操作数量的增加而增加(如果链接 4 个比较df.query()df[mask]快 2-2.3 倍) 1,2和/或 dataframe 长度增加。 2 2

Multiple operations on numeric columns对数值列的多项操作

If multiple arithmetic, logical or comparison operations need to be computed to create a boolean mask to filter df , query() performs faster.如果需要计算多个算术、逻辑或比较操作以创建 boolean 掩码来过滤df ,则query()执行得更快。 For example, for a frame with 80k rows, it's 20% faster 1 and for a frame with 800k rows, it's 2 times faster.例如,对于具有 80k 行的帧,它快 20% 1 ,对于具有 800k 行的帧,它快 2 倍。 2 2

df[(df.B % 5) **2 < 0.1]
df.query("(B % 5) **2 < 0.1")  # <--- performs 20%-100% faster.

This gap in performance increases as the number of operations increases and/or the dataframe length increases.这种性能差距随着操作数量的增加和/或 dataframe 长度的增加而增加。 2 2

The following plot shows how the methods perform as the dataframe length increases.以下 plot 显示了这些方法如何随着 dataframe 长度的增加而执行。 3 3

性能图

2. Access .values to call pandas methods inside query() 2.访问.values调用query()里面的pandas方法

Numexpr currently supports only logical ( & , | , ~ ), comparison ( == , > , < , >= , <= , != ) and basic arithmetic operators ( + , - , * , / , ** , % ). Numexpr 目前仅支持逻辑( &|~ )、比较( ==><>=<=!= )和基本算术运算符( +-*/**% )。

For example, it doesn't support integer division ( // ).例如,它不支持 integer 除法 ( // )。 However, calling the equivalent pandas method ( floordiv() ) and accessing the values attribute on the resulting Series makes numexpr evaluate its underlying numpy array and query() works.但是,调用等效的 pandas 方法( floordiv() )并访问结果 Series 上的values属性会使numexpr评估其底层 numpy 数组和query()工作。 Or setting engine parameter to 'python' also works.或者将engine参数设置为'python'也可以。

df.query('B.floordiv(2).values <= 3')  # or 
df.query('B.floordiv(2).le(3).values') # or
df.query('B.floordiv(2).le(3)', engine='python')

The same applies for Erfan 's suggested method calls as well.这同样适用于Erfan建议的方法调用。 The code in their answer spits TypeError as is (as of Pandas 1.3.4) for numexpr engine but accessing .values attribute makes it work.他们答案中的代码按原样(从 Pandas 1.3.4 开始)为numexpr引擎吐出 TypeError,但访问.values属性使其工作。

df.query('`Sender email`.str.endswith("@shop.com")')         # <--- TypeError
df.query('`Sender email`.str.endswith("@shop.com").values')  # OK


1 : Benchmark code using a frame with 80k rows 1 :使用 80k 行帧的基准代码

import numpy as np
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*10000, 
                   'B': np.random.rand(80000)})

%timeit df[df.A == 'foo']
# 8.5 ms ± 104.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 6.36 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 29 ms ± 554 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 16 ms ± 339 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df[(df.B % 5) **2 < 0.1]
# 5.35 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 4.37 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2 : Benchmark code using a frame with 800k rows 2 :使用 800k 行的框架的基准代码

df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*100000, 
                   'B': np.random.rand(800000)})

%timeit df[df.A == 'foo']
# 87.9 ms ± 873 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 54.4 ms ± 726 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 310 ms ± 3.4 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 132 ms ± 2.43 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df[(df.B % 5) **2 < 0.1]
# 54 ms ± 488 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("(B % 5) **2 < 0.1")
# 26.3 ms ± 320 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

3 : Code used to produce the performance graphs of the two methods for strings and numbers. 3 :用于生成字符串和数字两种方法的性能图的代码。

from perfplot import plot
constructor = lambda n: pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*n, 'B': np.random.rand(8*n)})
plot(
    setup=constructor,
    kernels=[lambda df: df[(df.B%5)**2<0.1], lambda df: df.query("(B%5)**2<0.1")],
    labels= ['df[(df.B % 5) **2 < 0.1]', 'df.query("(B % 5) **2 < 0.1")'],
    n_range=[2**k for k in range(4, 24)],
    xlabel='Rows in DataFrame',
    title='Multiple mathematical operations on numbers',
    equality_check=pd.DataFrame.equals);
plot(
    setup=constructor,
    kernels=[lambda df: df[df.A == 'foo'], lambda df: df.query("A == 'foo'")],
    labels= ["df[df.A == 'foo']", """df.query("A == 'foo'")"""],
    n_range=[2**k for k in range(4, 24)],
    xlabel='Rows in DataFrame',
    title='Comparison operation on strings',
    equality_check=pd.DataFrame.equals);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM