如何修改pd.dataframe中的列值

Question

Background: Actually I wanna modify the value in dataframe, only top 20 sport should be kept, and the others should be displayed like "Others". 背景：实际上我想修改数据框中的值，只保留前20项运动，其他应该显示为“其他”。 It's a copy of existed columns, as following: 它是现有列的副本，如下所示：

athlete_events['Sport_modified'] = athlete_events['Sport']

And the filter that contains top20 sport name is generated like: 包含top20运动名称的过滤器生成如下：

top20_sport = athlete_events['Sport'].value_counts().head(20).index

And the modify process is like following: Method 1: 修改过程如下：方法1：

 def classify_sports(cols, filters):
for i in cols:
    if i in filters:
        pass
    else:
        i = 'Others'
classify_sports(athlete_events.Sport_modified, top20_sport)

Method 2: 方法2：

athlete_events.Sport_modified.apply(lambda x : x if x in top20_sport else 'Others')

However, the 2 method above did not works. 但是，上面的2方法不起作用。 The only way I could do just like this code: 我能做的唯一方法就像这段代码：

athlete_events.loc[
(athlete_events['Sport'] !='Athletics')&
(athlete_events['Sport'] !='Gymnastics')&
(athlete_events['Sport'] !='Swimming')&
(athlete_events['Sport'] !='Shooting')&
(athlete_events['Sport'] !='Cycling')&
(athlete_events['Sport'] !='Fencing')&
(athlete_events['Sport'] !='Rowing')&
(athlete_events['Sport'] !='Cross Country Skiing')&
(athlete_events['Sport'] !='Alpine Skiing')&
(athlete_events['Sport'] !='Wrestling')&
(athlete_events['Sport'] !='Football')&
(athlete_events['Sport'] !='Sailing')&
(athlete_events['Sport'] !='Equestrianism')&
(athlete_events['Sport'] !='Canoeing')&
(athlete_events['Sport'] !='Boxing')&
(athlete_events['Sport'] !='Speed Skating')&
(athlete_events['Sport'] !='Ice Hockey')&
(athlete_events['Sport'] !='Hockey')&
(athlete_events['Sport'] !='Biathlon')&
(athlete_events['Sport'] !='Basketball')
,'Sport_modified'] = 'Others'

What's the problems of that 2 ways above? 上述两种方式的问题是什么？ Thanks for help. 感谢帮助。

Answer 1

Your first method will never work, since your function does not return a series, nor does it return anything for a row-wise calculation. 你的第一个方法永远不会工作，因为你的函数不会返回一个序列，也不会为行计算return任何内容。

Your second method is not in-place , you need to assign back to a series. 您的第二种方法不是就地，您需要分配回系列。 For instance: 例如：

df['sport_modified'] = df['sport'].apply(lambda x : x if x in top20_sport else 'Others')

Your final solution can be more efficiently expressed using pd.Series.isin , negated via ~ : 使用pd.Series.isin可以更有效地表达您的最终解决方案，通过~来取消：

L = ['Athletics', 'Gymnastics', ...]

df.loc[~df['sport'].isin(L), 'sport_modified'] = 'Others'

如何修改pd.dataframe中的列值

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-08-09 07:03:05

如何修改pd.dataframe中的列值

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-08-09 07:03:05

解决方案1
2 已采纳 2018-08-09 07:03:05