[英]Modifying a subset of rows in a pandas dataframe
Assume I have a pandas DataFrame with two columns, A and B. I'd like to modify this DataFrame (or create a copy) so that B is always NaN whenever A is 0. How would I achieve that?假设我有一个包含 A 和 B 两列的 Pandas DataFrame。我想修改这个 DataFrame(或创建一个副本),以便每当 A 为 0 时 B 总是 NaN。我将如何实现?
I tried the following我尝试了以下
df['A'==0]['B'] = np.nan
and和
df['A'==0]['B'].values.fill(np.nan)
without success.没有成功。
Use .loc
for label based indexing: 使用.loc
进行基于标签的索引:
df.loc[df.A==0, 'B'] = np.nan
The df.A==0
expression creates a boolean series that indexes the rows, 'B'
selects the column. df.A==0
表达式创建一个索引行的布尔序列, 'B'
选择列。 You can also use this to transform a subset of a column, eg: 您还可以使用它来转换列的子集,例如:
df.loc[df.A==0, 'B'] = df.loc[df.A==0, 'B'] / 2
I don't know enough about pandas internals to know exactly why that works, but the basic issue is that sometimes indexing into a DataFrame returns a copy of the result, and sometimes it returns a view on the original object. 我对pandas内部结构知之甚少并不知道为什么可行,但基本问题是有时索引到DataFrame会返回结果的副本,有时它会返回原始对象的视图。 According to documentation here , this behavior depends on the underlying numpy behavior. 根据此处的文档,此行为取决于潜在的numpy行为。 I've found that accessing everything in one operation (rather than [one][two]) is more likely to work for setting. 我发现在一次操作中访问所有内容(而不是[one] [two])更有可能用于设置。
Here is from pandas docs on advanced indexing: 这是关于高级索引的pandas docs:
The section will explain exactly what you need! 该部分将准确解释您的需求! Turns out df.loc
(as .ix has been deprecated -- as many have pointed out below) can be used for cool slicing/dicing of a dataframe. 结果是df.loc
(因为.ix已被弃用 - 正如下面许多人所指出的那样)可以用于数据帧的冷切片/切割。 And. 和。 It can also be used to set things. 它也可以用来设置东西。
df.loc[selection criteria, columns I want] = value
So Bren's answer is saying 'find me all the places where df.A == 0
, select column B
and set it to np.nan
' 所以Bren的回答是说'找到df.A == 0
所有地方,选择B
列并将其设置为np.nan
'
Starting from pandas 0.20 ix is deprecated . 从pandas 0.20 ix开始不推荐使用 。 The right way is to use df.loc 正确的方法是使用df.loc
here is a working example 这是一个有效的例子
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
A B
0 0 NaN
1 1 0
2 0 NaN
>>>
As explained in the doc here , .loc
is primarily label based, but may also be used with a boolean array . 如在doc解释这里 , .loc
主要是基于标签,但也可以用布尔阵列使用 。
So, what we are doing above is applying df.loc[row_index, column_index]
by: 所以,我们上面所做的是通过df.loc[row_index, column_index]
方式应用df.loc[row_index, column_index]
:
loc
can take a boolean array as a mask that tells pandas which subset of rows we want to change in row_index
利用loc
可以将布尔数组作为掩码的事实告诉pandas我们想要在row_index
更改哪些行的子集 loc
is also label based to select the column using the label 'B'
in the column_index
利用事实loc
也是基于标签的,以使用column_index
的标签'B'
选择列 We can use logical, condition or any operation that returns a series of booleans to construct the array of booleans. 我们可以使用逻辑,条件或任何返回一系列布尔值的操作来构造布尔数组。 In the above example, we want any rows
that contain a 0
, for that we can use df.A == 0
, as you can see in the example below, this returns a series of booleans. 在上面的例子中,我们想要任何包含0
rows
,为此我们可以使用df.A == 0
,正如您在下面的示例中所看到的,这将返回一系列布尔值。
>>> df = pd.DataFrame({"A":[0,1,0], "B":[2,0,5]}, columns=list('AB'))
>>> df
A B
0 0 2
1 1 0
2 0 5
>>> df.A == 0
0 True
1 False
2 True
Name: A, dtype: bool
>>>
Then, we use the above array of booleans to select and modify the necessary rows: 然后,我们使用上面的布尔数组来选择和修改必要的行:
>>> df.loc[df.A == 0, 'B'] = np.nan
>>> df
A B
0 0 NaN
1 1 0
2 0 NaN
For more information check the advanced indexing documentation here . 有关更多信息,请在此处查看高级索引文档。
For a massive speed increase, use NumPy's where function. 为了大幅提速,请使用NumPy的功能。
Create a two-column DataFrame with 100,000 rows with some zeros. 创建一个包含100,000行且带有零的两列DataFrame。
df = pd.DataFrame(np.random.randint(0,3, (100000,2)), columns=list('ab'))
numpy.where
使用numpy.where
快速解决方案 df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
%timeit df['b'] = np.where(df.a.values == 0, np.nan, df.b.values)
685 µs ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.loc[df['a'] == 0, 'b'] = np.nan
3.11 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy's where
is about 4x faster NumPy的公司where
大约快4倍
要替换多列,请使用.values
转换为numpy数组:
df.loc[df.A==0, ['B', 'C']] = df.loc[df.A==0, ['B', 'C']].values / 2
Alternatives:备择方案:
df.B[df.A==0] = np.nan
df.loc[df.A == 0, 'B'] = np.nan
import numpy as np
df.B = np.where(df.A== 0, np.nan, df.B)
To modify a DataFrame in Pandas you can use "syntactic sugar" operators like +=
, *=
, /=
etc. So instead of:要修改 Pandas 中的 DataFrame,您可以使用“语法糖”运算符,如+=
、 *=
、 /=
等。因此,而不是:
df.loc[df.A == 0, 'B'] = df.loc[df.A == 0, 'B'] / 2
You can write:你可以写:
df.loc[df.A == 0, 'B'] /= 2
To replace values with NaN
you can use Pandas method where
.要用NaN
替换值,您可以使用 Pandas 方法where
。 For example:例如:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [0, 0, 4]})
A B
0 1 0
1 2 0
2 3 4
df['A'].where(df['B'] != 0, inplace=True) # other=np.nan by default
Result:结果:
A B
0 NaN 0
1 NaN 0
2 3.0 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.