简体   繁体   English

基于条件切片更改Pandas DataFrame中的单元格

[英]Change Cells in Pandas DataFrame Based on Conditional Slices

I'm playing around with the Titanic dataset, and what I'd like to do is fill in all the NaN/Null values of the Age column with the median value base on that Pclass . 我正在处理Titanic数据集,我想做的就是使用那个Pclass的中值填充Age列的所有NaN / Null值。

Here is some data: 这是一些数据:

train

PassengerId Pclass  Age
0   1   3   22
1   2   1   35
2   3   3   26
3   4   1   35
4   5   3   35
5   6   1   NaN
6   7   1   54
7   8   3   2
8   9   3   27
9   10  2   14
10  11  1   Nan

Here is what I would like to end up with: 这是我想最后得到的:

PassengerId Pclass  Age
0   1   3   22
1   2   1   35
2   3   3   26
3   4   1   35
4   5   3   35
5   6   1   35
6   7   1   54
7   8   3   2
8   9   3   27
9   10  2   14
10  11  1   35

The first thing I came up with is this - In the interest of brevity I have only included one slice for Pclass equal to 1 rather than including 2 and 3: 我想到的第一件事是-为了简洁起见,我只为等于1的Pclass包括一个切片,而不包括2和3:

Pclass_1 = train['Pclass']==1

train[Pclass_1]['Age'].fillna(train[train['Pclass']==1]['Age'].median(), inplace=True)

As far as I understand, this method creates a view rather than editing train itself (I don't quite understand how this is different from a copy, or if they are analogous in terms of memory -- that is an aside I would love to hear about if possible). 据我了解,该方法创建视图而不是编辑train本身(我不太了解这与副本有何不同,或者它们在内存方面是否类似-我想在此补充一下)知道是否有可能)。 I particularly like this Q/A on the topic View vs Copy, How Do I Tell? 我特别喜欢“ 查看与复制,如何辨别”主题上的此问答 but it doesn't include the insight I'm looking for. 但其中不包括我要寻找的见识。

Looking through Pandas docs I learned why you want to use .loc to avoid this pitfall. 通过查阅Pandas文档,我了解了为什么要使用.loc来避免这种陷阱。 However I just can't seem to get the syntax right. 但是我似乎无法正确理解语法。

Pclass_1 = train.loc[:,['Pclass']==1]

Pclass_1.Age.fillna(train[train['Pclass']==1]['Age'].median(),inplace=True)

I'm getting lost in indices. 我迷失在索引中。 This one ends up looking for a column named False which obviously doesn't exist. 最后,这是查找一列False的列,该列显然不存在。 I don't know how to do this without chained indexing. 我不知道如何在没有链接索引的情况下执行此操作。 train.loc[:,train['Pclass']==1] returns an exception IndexingError: Unalignable boolean Series key provided . train.loc[:,train['Pclass']==1]返回异常IndexingError: Unalignable boolean Series key provided

In this part of the line, 在这部分

train.loc[:,['Pclass']==1]

the part ['Pclass'] == 1 is comparing the list ['Pclass'] to the value 1 , which returns False . 部分['Pclass'] == 1正在将列表['Pclass']与值1进行比较,该值返回False The .loc[] is then evaluated as .loc[:,False] which is causing the error. 然后将.loc[]评估为.loc[:,False] ,这将导致错误。

I think you mean: 我想你的意思是:

train.loc[train['Pclass']==1]

which selects all of the rows where Pclass is 1. This fixes the error, but it will still give you the "SettingWithCopyWarning". 它将选择Pclass为1的所有行。这可以修复错误,但仍会为您提供“ SettingWithCopyWarning”。

EDIT 1 编辑1

(old code removed) (旧代码已删除)

Here is an approach that uses groupby with transform to create a Series containing the median Age for each Pclass . 这是一种使用groupbytransform来创建一个Series的方法,其中包含每个Pclass的中位数Age The Series is then used as the argument to fillna() to replace the missing values with the median. 然后,将Series用作fillna()的参数,以中位数替换缺失值。 Using this approach will correct all passenger classes at the same time, which is what the OP originally requested. 使用此方法将同时纠正所有乘客等级,这是OP最初要求的。 The solution comes from the answer to Python-pandas Replace NA with the median or mean of a group in dataframe 该解决方案来自于Python-pandas的答案将NA替换为数据框中一组的中位数或均值

import pandas as pd
from io import StringIO

tbl = """PassengerId Pclass  Age
0   1   3   22
1   2   1   35
2   3   3   26
3   4   1   35
4   5   3   35
5   6   1
6   7   1   54
7   8   3   2
8   9   3   27
9   10  2   14
10  11  1
"""

train = pd.read_table(StringIO(tbl), sep='\s+')
print('Original:\n', train)
median_age = train.groupby('Pclass')['Age'].transform('median') #median Ages for all groups
train['Age'].fillna(median_age, inplace=True)
print('\nNaNs replaced with median:\n', train)

The code produces: 该代码产生:

 Original:
     PassengerId  Pclass   Age
0             1       3  22.0
1             2       1  35.0
2             3       3  26.0
3             4       1  35.0
4             5       3  35.0
5             6       1   NaN
6             7       1  54.0
7             8       3   2.0
8             9       3  27.0
9            10       2  14.0
10           11       1   NaN

NaNs replaced with median:
     PassengerId  Pclass   Age
0             1       3  22.0
1             2       1  35.0
2             3       3  26.0
3             4       1  35.0
4             5       3  35.0
5             6       1  35.0
6             7       1  54.0
7             8       3   2.0
8             9       3  27.0
9            10       2  14.0
10           11       1  35.0

One thing to note is that this line, which uses inplace=True : 需要注意的一件事是,该行使用inplace=True

train['Age'].fillna(median_age, inplace=True)

can be replaced with assignment using .loc : 可以用.loc替换为赋值:

train.loc[:,'Age'] = train['Age'].fillna(median_age)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM