简体   繁体   English

修改 Pandas 数据框中的行子集

[英]Modifying a subset of rows in a pandas dataframe

Assume I have a pandas DataFrame with two columns, A and B. I'd like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?假设我有一个包含 A 和 B 两列的 Pandas DataFrame。我想修改这个 DataFrame(或创建一个副本),以便每当 A 为 0 时 B 总是 NaN。我将如何实现?

I tried the following我尝试了以下

df['A'==0]['B'] = np.nan

and

df['A'==0]['B'].values.fill(np.nan)

without success.没有成功。

Use .loc for label based indexing: 使用.loc进行基于标签的索引:

df.loc[df.A==0, 'B'] = np.nan

The df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. df.A==0表达式创建一个索引行的布尔序列, 'B'选择列。 You can also use this to transform a subset of a column, eg: 您还可以使用它来转换列的子集,例如:

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. 我对pandas内部结构知之甚少并不知道为什么可行,但基本问题是有时索引到DataFrame会返回结果的副本,有时它会返回原始对象的视图。 According to documentation here , this behavior depends on the underlying numpy behavior. 根据此处的文档,此行为取决于潜在的numpy行为。 I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting. 我发现在一次操作中访问所有内容(而不是[one] [two])更有可能用于设置。

Here is from pandas docs on advanced indexing: 是关于高级索引的pandas docs:

The section will explain exactly what you need! 该部分将准确解释您的需求! Turns out df.loc (as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. 结果是df.loc (因为.ix已被弃用 - 正如下面许多人所指出的那样)可以用于数据帧的冷切片/切割。 And. 和。 It can also be used to set things. 它也可以用来设置东西。

df.loc[selection criteria, columns I want] = value

So Bren's answer is saying 'find me all the places where df.A == 0 , select column B and set it to np.nan ' 所以Bren的回答是说'找到df.A == 0所有地方,选择B列并将其设置为np.nan '

Starting from pandas 0.20 ix is deprecated . 从pandas 0.20 ix开始不推荐使用 The right way is to use df.loc 正确的方法是使用df.loc

here is a working example 这是一个有效的例子

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>> 

Explanation: 说明:

As explained in the doc here , .loc is primarily label based, but may also be used with a boolean array . 如在doc解释这里.loc 主要是基于标签,但也可以用布尔阵列使用

So, what we are doing above is applying df.loc[row_index, column_index] by: 所以,我们上面所做的是通过df.loc[row_index, column_index]方式应用df.loc[row_index, column_index]

  • Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index 利用loc可以将布尔数组作为掩码的事实告诉pandas我们想要在row_index更改哪些行的子集
  • Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index 利用事实loc也是基于标签的,以使用column_index的标签'B'选择列

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. 我们可以使用逻辑,条件或任何返回一系列布尔值的操作来构造布尔数组。 In the above example, we want any rows that contain a 0 , for that we can use df.A == 0 , as you can see in the example below, this returns a series of booleans. 在上面的例子中,我们想要任何包含0 rows ,为此我们可以使用df.A == 0 ,正如您在下面的示例中所看到的,这将返回一系列布尔值。

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>> 

Then, we use the above array of booleans to select and modify the necessary rows: 然后,我们使用上面的布尔数组来选择和修改必要的行:

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here . 有关更多信息,请在此处查看高级索引文档。

For a massive speed increase, use NumPy's where function. 为了大幅提速,请使用NumPy的功能。

Setup 建立

Create a two-column DataFrame with 100,000 rows with some zeros. 创建一个包含100,000行且带有零的两列DataFrame。

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with numpy.where 使用numpy.where快速解决方案

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings 计时

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy's where is about 4x faster NumPy的公司where大约快4倍

要替换多列,请使用.values转换为numpy数组:

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2

Alternatives:备择方案:

  1. Filter (note: filter comes after column being written to, not before)过滤器(注意:过滤器在列被写入之后,而不是之前)
df.B[df.A==0] = np.nan
  1. loc位置
df.loc[df.A == 0, 'B'] = np.nan
  1. numpy where麻木的地方
import numpy as np
df.B = np.where(df.A== 0, np.nan, df.B)

To modify a DataFrame in Pandas you can use "syntactic sugar" operators like += , *= , /= etc. So instead of:要修改 Pandas 中的 DataFrame,您可以使用“语法糖”运算符,如+=*=/=等。因此,而不是:

df.loc[df.A == 0, 'B'] = df.loc[df.A == 0, 'B'] / 2

You can write:你可以写:

df.loc[df.A == 0, 'B'] /= 2

To replace values with NaN you can use Pandas method where .要用NaN替换值,您可以使用 Pandas 方法where For example:例如:

df  = pd.DataFrame({'A': [1, 2, 3], 'B': [0, 0, 4]})

   A  B
0  1  0
1  2  0
2  3  4

df['A'].where(df['B'] != 0, inplace=True) # other=np.nan by default

Result:结果:

     A  B
0  NaN  0
1  NaN  0
2  3.0  4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM