简体   繁体   English

熊猫:在应用函数中使用 value_counts

[英]pandas: use value_counts in an apply function

Here is a toy example of my pandas dataframe:这是我的熊猫数据框的一个玩具示例:

    country_market  language_market
0   United States   English
1   United States   French
2   Not used    Not used
3   Canada OR United States English
4   Germany English
5   United Kingdom  French
6   United States   German
7   United Kingdom  English
8   United Kingdom  English
9   Not used    Not used
10  United States   French
11  United States   English
12  United Kingdom  English
13  United States   French
14  Not used    English
15  Not used    English
16  United States   French
17  United States   Not used
18  Not used    English
19  United States   German

I want to add a column top_country that shows whether the value in country_market is one of the top two most commonly seen countries in the data.我想添加一列top_country来显示country_market的值是否是数据中最常见的两个国家之一。 If it is, I want the new top_country column show the value in country_market and if not, then I want it to show "Other".如果是,我希望新的top_country列显示country_market的值,如果不是,那么我希望它显示“其他”。 I want to repeat this process for language_market (and a whole load of other market columns I don't show here).我想为language_market重复这个过程(以及我没有在此处显示的大量其他市场列)。

This is how I'd like the data to look after processing:这是我希望数据在处理后的样子:

    country_market  language_market top_country top_language
0   United States   English United States   English
1   United States   French  United States   French
2   Not used    Not used    Not used    Other
3   Canada OR United States English Other   English
4   Germany English Other   English
5   United Kingdom  French  Other   French
6   United States   German  United States   Other
7   United Kingdom  English Other   English
8   United Kingdom  English Other   English
9   Not used    Not used    Not used    Other
10  United States   French  United States   French
11  United States   English United States   English
12  United Kingdom  English Other   English
13  United States   French  United States   French
14  Not used    English Not used    English
15  Not used    English Not used    English
16  United States   French  United States   French
17  United States   Not used    United States   Other
18  Not used    English Not used    English
19  United States   German  United States   Other

I made a function original_top_markets_function to do this, but I couldn't figure how to pass the value_counts part of my function to pandas apply .我创建了一个函数original_top_markets_function来执行此操作,但我无法弄清楚如何将函数的value_counts部分传递给 pandas apply I kept getting AttributeError: 'str' object has no attribute 'value_counts' .我不断收到AttributeError: 'str' object has no attribute 'value_counts'

def original_top_markets_function(x):
top2 = x.value_counts().nlargest(2).index
for i in x:
    if i in top2: 
        return i
    else: 
        return 'Other'         

I know this is because apply is looking at each element in my target column, but I also need the function to consider the whole column at once, so that I can use value_counts .我知道这是因为apply正在查看目标列中的每个元素,但我还需要该函数一次考虑整个列,以便我可以使用value_counts I don't know how to do that.我不知道该怎么做。

So I have come up with this top_markets function as a solution, using a list, which does what I want, but isn't very efficient.所以我想出了这个top_markets函数作为解决方案,使用一个列表,它top_markets我的需求,但效率不高。 I'll need to apply this function to lots of different market columns, so I'd like something more pythonic.我需要将此函数应用于许多不同的市场列,所以我想要更pythonic 的东西。

def top_markets(x):
top2 = x.value_counts().nlargest(2).index
results = []
for i in x:
    if i in top2: 
        results.append(i)
    else: 
        results.append('Other')         
return results

Here's a reproducible example.这是一个可重现的示例。 Please can somehow help me fix my top_markets function so I can use it with apply ?请以某种方式帮助我修复我的top_markets函数,以便我可以将它与apply一起apply

import pandas as pd

d = {0: {'country_market': 'United States', 'language_market': 'English'},
 1: {'country_market': 'United States', 'language_market': 'French'},
 2: {'country_market': 'Not used', 'language_market': 'Not used'},
 3: {'country_market': 'Canada OR United States',
  'language_market': 'English'},
 4: {'country_market': 'Germany', 'language_market': 'English'},
 5: {'country_market': 'United Kingdom', 'language_market': 'French'},
 6: {'country_market': 'United States', 'language_market': 'German'},
 7: {'country_market': 'United Kingdom', 'language_market': 'English'},
 8: {'country_market': 'United Kingdom', 'language_market': 'English'},
 9: {'country_market': 'Not used', 'language_market': 'Not used'},
 10: {'country_market': 'United States', 'language_market': 'French'},
 11: {'country_market': 'United States', 'language_market': 'English'},
 12: {'country_market': 'United Kingdom', 'language_market': 'English'},
 13: {'country_market': 'United States', 'language_market': 'French'},
 14: {'country_market': 'Not used', 'language_market': 'English'},
 15: {'country_market': 'Not used', 'language_market': 'English'},
 16: {'country_market': 'United States', 'language_market': 'French'},
 17: {'country_market': 'United States', 'language_market': 'Not used'},
 18: {'country_market': 'Not used', 'language_market': 'English'},
 19: {'country_market': 'United States', 'language_market': 'German'}}

df = pd.DataFrame.from_dict(d, orient='index')

def top_markets(x):
    top2 = x.value_counts().nlargest(2).index
    results = []
    for i in x:
        if i in top2: 
            results.append(i)
        else: 
            results.append('Other')         
    return results

df['top_country'] = top_markets(df['country_market'])
df['top_language'] = top_markets(df['language_market'])

df

I think u can just use:我认为你可以使用:

df['top_country'] = np.where(df['country_market'].isin(df['country_market'].value_counts().nlargest(2).index), df['country_market'], 'Other')
df['top_language'] = np.where(df['language_market'].isin(df['language_market'].value_counts().nlargest(2).index), df['language_market'], 'Other')

If u wish to use your own function, you can use:如果你想使用你自己的函数,你可以使用:

df['top_country'] = df[['country_market']].apply(top_markets)
df['top_language'] = df[['language_market']].apply(top_markets)

#OR
df[['top_country', 'top_language']] = df[['country_market', 'language_market']].apply(top_markets)

Edit as per discussion in comments:根据评论中的讨论进行编辑:

def top_markets(x, top):
    if x in top:
        return x
    else:
        'Other'

top_country = df['country_market'].value_counts().nlargest(2).index
top_languages = df['language_market'].value_counts().nlargest(2).index

df['top_country'] = df['country_market'].apply(lambda x: top_markets(x, top_country))
df['top_language'] = df['language_market'].apply(lambda x: top_markets(x, top_languages))

If need working by multiple columns by DataFrame.apply in some function, eg here lambda function use:如果需要在某些函数中通过DataFrame.apply处理多列,例如这里的lambda function使用:

cols = ['language_market', 'country_market']

f = lambda x: np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
df = df.join(df[cols].apply(f).add_prefix('total_'))

Solution without lambda function:没有 lambda 函数的解决方案:

def top_markets(x):
    return np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')

df = df.join(df[cols].apply(top_markets).add_prefix('total_'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM