修改 Pandas 数据框中的行子集

Question

Assume I have a pandas DataFrame with two columns, A and B. I'd like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?假设我有一个包含 A 和 B 两列的 Pandas DataFrame。我想修改这个 DataFrame（或创建一个副本），以便每当 A 为 0 时 B 总是 NaN。我将如何实现？

I tried the following我尝试了以下

df['A'==0]['B'] = np.nan

and和

df['A'==0]['B'].values.fill(np.nan)

without success.没有成功。

Answer 1

Use .loc for label based indexing: 使用.loc进行基于标签的索引：

df.loc[df.A==0, 'B'] = np.nan

The df.A==0 expression creates a boolean series that indexes the rows, 'B' selects the column. df.A==0表达式创建一个索引行的布尔序列， 'B'选择列。 You can also use this to transform a subset of a column, eg: 您还可以使用它来转换列的子集，例如：

df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2

I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. 我对pandas内部结构知之甚少并不知道为什么可行，但基本问题是有时索引到DataFrame会返回结果的副本，有时它会返回原始对象的视图。 According to documentation here , this behavior depends on the underlying numpy behavior. 根据此处的文档，此行为取决于潜在的numpy行为。 I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting. 我发现在一次操作中访问所有内容（而不是[one] [two]）更有可能用于设置。

Answer 2

Here is from pandas docs on advanced indexing: 这是关于高级索引的pandas docs：

The section will explain exactly what you need! 该部分将准确解释您的需求！ Turns out df.loc (as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. 结果是df.loc （因为.ix已被弃用 - 正如下面许多人所指出的那样）可以用于数据帧的冷切片/切割。 And. 和。 It can also be used to set things. 它也可以用来设置东西。

df.loc[selection criteria, columns I want] = value

So Bren's answer is saying 'find me all the places where df.A == 0 , select column B and set it to np.nan ' 所以Bren的回答是说'找到df.A == 0所有地方，选择B列并将其设置为np.nan '

Answer 3

Starting from pandas 0.20 ix is deprecated . 从pandas 0.20 ix开始不推荐使用。 The right way is to use df.loc 正确的方法是使用df.loc

here is a working example 这是一个有效的例子

>>> import pandas as pd 
>>> import numpy as np 
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN
>>>

Explanation: 说明：

As explained in the doc here , .loc is primarily label based, but may also be used with a boolean array . 如在doc解释这里， .loc 主要是基于标签，但也可以用布尔阵列使用 。

So, what we are doing above is applying df.loc[row_index, column_index] by: 所以，我们上面所做的是通过df.loc[row_index, column_index]方式应用df.loc[row_index, column_index] ：

Exploiting the fact that loc can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index 利用loc可以将布尔数组作为掩码的事实告诉pandas我们想要在row_index更改哪些行的子集
Exploiting the fact loc is also label based to select the column using the label 'B' in the column_index 利用事实loc也是基于标签的，以使用column_index的标签'B'选择列

We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. 我们可以使用逻辑，条件或任何返回一系列布尔值的操作来构造布尔数组。 In the above example, we want any rows that contain a 0 , for that we can use df.A == 0 , as you can see in the example below, this returns a series of booleans. 在上面的例子中，我们想要任何包含0 rows ，为此我们可以使用df.A == 0 ，正如您在下面的示例中所看到的，这将返回一系列布尔值。

>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df 
   A  B
0  0  2
1  1  0
2  0  5
>>> df.A == 0 
0     True
1    False
2     True
Name: A, dtype: bool
>>>

Then, we use the above array of booleans to select and modify the necessary rows: 然后，我们使用上面的布尔数组来选择和修改必要的行：

>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
   A   B
0  0 NaN
1  1   0
2  0 NaN

For more information check the advanced indexing documentation here . 有关更多信息，请在此处查看高级索引文档。

Answer 4

For a massive speed increase, use NumPy's where function. 为了大幅提速，请使用NumPy的功能。

Setup 建立

Create a two-column DataFrame with 100,000 rows with some zeros. 创建一个包含100,000行且带有零的两列DataFrame。

df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))

Fast solution with `numpy.where` 使用`numpy.where`快速解决方案

df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)

Timings 计时

%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy's where is about 4x faster NumPy的公司where大约快4倍

Answer 5

要替换多列，请使用.values转换为numpy数组：

df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2

Answer 6

Alternatives:备择方案：

Filter (note: filter comes after column being written to, not before)过滤器（注意：过滤器在列被写入之后，而不是之前）

df.B[df.A==0] = np.nan

loc位置

df.loc[df.A == 0, 'B'] = np.nan

numpy where麻木的地方

import numpy as np
df.B = np.where(df.A== 0, np.nan, df.B)

Answer 7

To modify a DataFrame in Pandas you can use "syntactic sugar" operators like += , *= , /= etc. So instead of:要修改 Pandas 中的 DataFrame，您可以使用“语法糖”运算符，如+= 、 *= 、 /=等。因此，而不是：

df.loc[df.A == 0, 'B'] = df.loc[df.A == 0, 'B'] / 2

You can write:你可以写：

df.loc[df.A == 0, 'B'] /= 2

To replace values with NaN you can use Pandas method where .要用NaN替换值，您可以使用 Pandas 方法where 。 For example:例如：

df  = pd.DataFrame({'A': [1, 2, 3], 'B': [0, 0, 4]})

   A  B
0  1  0
1  2  0
2  3  4

df['A'].where(df['B'] != 0, inplace=True) # other=np.nan by default

Result:结果：

     A  B
0  NaN  0
1  NaN  0
2  3.0  4

修改 Pandas 数据框中的行子集

问题描述

7 个解决方案

解决方案1
216 已采纳 2012-09-06 19:37:18

解决方案2
82 2012-09-26 17:14:39

解决方案3
24 2017-07-04 20:27:54

Explanation: 说明：

解决方案4
5 2017-11-02 23:18:25

Setup 建立

Fast solution with `numpy.where` 使用`numpy.where`快速解决方案

Timings 计时

解决方案5
3 2017-10-25 04:22:16

解决方案6
0 2021-12-09 13:50:44

解决方案7
0 2021-12-14 07:43:41

修改 Pandas 数据框中的行子集

问题描述

7 个解决方案

解决方案1 216 已采纳 2012-09-06 19:37:18

解决方案2 82 2012-09-26 17:14:39

解决方案3 24 2017-07-04 20:27:54

Explanation: 说明：

解决方案4 5 2017-11-02 23:18:25

Setup 建立

Fast solution with numpy.where 使用numpy.where快速解决方案

Timings 计时

解决方案5 3 2017-10-25 04:22:16

解决方案6 0 2021-12-09 13:50:44

解决方案7 0 2021-12-14 07:43:41

解决方案1
216 已采纳 2012-09-06 19:37:18

解决方案2
82 2012-09-26 17:14:39

解决方案3
24 2017-07-04 20:27:54

解决方案4
5 2017-11-02 23:18:25

Fast solution with `numpy.where` 使用`numpy.where`快速解决方案

解决方案5
3 2017-10-25 04:22:16

解决方案6
0 2021-12-09 13:50:44

解决方案7
0 2021-12-14 07:43:41