根据value_counts（）更改pandas数据帧中的值

Question

我有以下pandas数据帧：

import pandas as pd 
from pandas import Series, DataFrame

data = DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'],
              'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'],
              'Qu3': ['apple', 'potato', 'sausage', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'egg']})

我想更改列中的值Qu1 ， Qu2 ， Qu3根据value_counts()当值数大或等于一定数目

例如，对于Qu1列

>>> pd.value_counts(data.Qu1) >= 2
cheese     True
potato     True
banana     True
apple     False
egg       False

我想保留cheese ， potato ， banana价值，因为每个值至少有两次出现。

从价值apple和egg我想创造价值others

对于列Qu2没有变化：

>>> pd.value_counts(data.Qu2) >= 2
banana     True
apple      True
sausage    True

附加的test_data的最终结果

test_data = DataFrame({'Qu1': ['other', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'other'],
                  'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'],
                  'Qu3': ['other', 'potato', 'other', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'other']})

谢谢！

Answer 1

我会创建一个相同形状的数据框，其中相应的条目是值计数：

data.apply(lambda x: x.map(x.value_counts()))
Out[229]: 
   Qu1  Qu2  Qu3
0    1    2    1
1    2    4    3
2    3    3    1
3    2    3    3
4    3    3    3
5    2    2    3
6    3    4    3
7    2    4    3
8    1    4    1

并且，使用df.where的结果返回相应条目小于2的“other”：

data.where(data.apply(lambda x: x.map(x.value_counts()))>=2, "other")

      Qu1      Qu2     Qu3
0   other  sausage   other
1  potato   banana  potato
2  cheese    apple   other
3  banana    apple  cheese
4  cheese    apple  cheese
5  banana  sausage  potato
6  cheese   banana  cheese
7  potato   banana  potato
8   other   banana   other

Answer 2

你可以：

value_counts = df.apply(lambda x: x.value_counts())

         Qu1  Qu2  Qu3
apple    1.0  3.0  1.0
banana   2.0  4.0  NaN
cheese   3.0  NaN  3.0
egg      1.0  NaN  1.0
potato   2.0  NaN  3.0
sausage  NaN  2.0  1.0

然后构建一个包含每列替换的dictionary ：

import cycle
replacements = {}
for col, s in value_counts.items():
    if s[s<2].any():
        replacements[col] = dict(zip(s[s < 2].index.tolist(), cycle(['other'])))

replacements
{'Qu1': {'egg': 'other', 'apple': 'other'}, 'Qu3': {'egg': 'other', 'apple': 'other', 'sausage': 'other'}}

使用dictionary替换值：

df.replace(replacements)

      Qu1      Qu2     Qu3
0   other  sausage   other
1  potato   banana  potato
2  cheese    apple   other
3  banana    apple  cheese
4  cheese    apple  cheese
5  banana  sausage  potato
6  cheese   banana  cheese
7  potato   banana  potato
8   other   banana   other

或者将循环包含在dictionary理解中：

from itertools import cycle

df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})

但是，这不仅比使用.where更麻烦，而且速度慢。 使用3,000列进行测试：

df = pd.concat([df for i in range(1000)], axis=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Columns: 3000 entries, Qu1 to Qu3
dtypes: object(3000)

使用.replace() ：

%%timeit
value_counts = df.apply(lambda x: x.value_counts())
df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})

1 loop, best of 3: 4.97 s per loop

vs .where() ：

%%timeit
df.where(df.apply(lambda x: x.map(x.value_counts()))>=2, "other")

1 loop, best of 3: 2.01 s per loop

根据value_counts（）更改pandas数据帧中的值

问题描述

2 个解决方案

解决方案1
10 已采纳 2016-05-15 15:57:25

解决方案2
2 2016-05-15 15:01:18

根据value_counts（）更改pandas数据帧中的值

问题描述

2 个解决方案

解决方案1 10 已采纳 2016-05-15 15:57:25

解决方案2 2 2016-05-15 15:01:18

解决方案1
10 已采纳 2016-05-15 15:57:25

解决方案2
2 2016-05-15 15:01:18