[英]New column in pandas dataframe based on existing column values with conditions list
按照此链接: pandas dataframe 中的新列基于现有列值
我有一个数据框,其中有一列名为“国家”的列,其中列出了世界上的几个国家。 我需要使用像“欧洲”这样的区域说明符创建另一列。 我有三个属于多个地区的国家/地区列表,因此如果 df ['Country'] 中的 state 与 'Europe' 列表中的 state 匹配,则将 'Europe' 说明符插入新列 df[' Region'] .
我的数据是: https://sendeyo.com/up/d/2acd2eb849
问题是,当我使用上一个链接中表达的解决方案时,它们适用于示例数据框,但不适用于我的数据库。 我的 dataframe 像这样:
Year Country Population GDP
1870 Austria 4,520 8,419
1870 Belgium 5,096 13,716
1870 Denmark 1,888 3,782
1870 Finland 1,754 1,999
1870 France 38,440 72,100
我的清单:
Europa = ["Austria", "Belgium", "Denmark"]
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]
预期结果
Year Country Population GDP Region
1870 Austria 4,520 8,419 Europa
1870 Belgium 5,096 13,716 Europa
1870 Denmark 1,888 3,782 Europa
1870 Finland 1,754 1,999 Europa
1870 France 38,440 72,100 Europa
这是我尝试过的代码:
def Continent(country):
return "Europa" if country in Europa else "Unknown"
df['Region'] = df['Country'].apply(Continent)
谢谢。
将Series.map
与从列表创建的字典一起使用:
Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]
d = {'Europa':Europa,'RamasOccidentales':RamasOccidentales,
'Latinoamerica':Latinoamerica,'Asia':Asia}
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df['Region'] = df['Country'].map(d1)
print (df)
Year Country Population GDP Region
0 1870 Austria 4,520 8,419 Europa
1 1870 Belgium 5,096 13,716 Europa
2 1870 Denmark 1,888 3,782 Europa
3 1870 Finland 1,754 1,999 Europa
4 1870 France 38,440 72,100 Europa
print (d1)
{'Austria': 'Europa', 'Belgium': 'Europa', 'Denmark': 'Europa',
'France': 'Europa', 'Finland': 'Europa',
'Australia': 'RamasOccidentales',
'New Zealand': 'RamasOccidentales',
'Canada': 'RamasOccidentales',
'United States': 'RamasOccidentales',
'Brazil': 'Latinoamerica', 'Chile': 'Latinoamerica',
'Uruguay': 'Latinoamerica', 'Indonesia': 'Asia',
'Japan': 'Asia', 'Sri Lanka': 'Asia'}
性能是 10k 行的 2.58 倍:
np.random.seed(2019)
Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]
d = {'Europa':Europa,'RamasOccidentales':RamasOccidentales,
'Latinoamerica':Latinoamerica,'Asia':Asia}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df = pd.DataFrame({'Country': np.random.choice(list(d1.keys()), size=10000)})
In [280]: %%timeit
...: d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
...:
...: df['Region'] = df['Country'].map(d1)
...:
3.04 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [281]: %%timeit
...: classification_countries={'Europa':Europa,
...: 'RamasOccidentales':RamasOccidentales,
...: 'Latinoamerica':Latinoamerica ,
...: 'Asia':Asia}
...:
...: cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
...: values=[ key for key in classification_countries]
...:
...: df['Region']=np.select(cond,values)
...:
7.86 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [282]: %%timeit
...: cond=[df['Country'].isin(Europa),df['Country'].isin(RamasOccidentales),df['Country'].isin(Latinoamerica),df['Country'].isin(Asia)]
...: values=['Europa','RamasOccidentales','Latinoamerica','Asia']
...: df['Region']=np.select(cond,values)
...:
7.96 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [293]: %%timeit
...: classification_countries={'Europa':Europa,
...: 'RamasOccidentales':RamasOccidentales,
...: 'Latinoamerica':Latinoamerica ,
...: 'Asia':Asia}
...:
...: dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}
...:
...:
...: df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
...:
8.54 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
使用np.select
+ Series.isin
:
Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]
#using np.select
cond=[df['Country'].isin(Europa),df['Country'].isin(RamasOccidentales),df['Country'].isin(Latinoamerica),df['Country'].isin(Asia)]
values=['Europa','RamasOccidentales','Latinoamerica','Asia']
df['Region']=np.select(cond,values)
print(df)
Year Country Population GDP Region
0 1870 Austria 4,520 8,419 Europa
1 1870 Belgium 5,096 13,716 Europa
2 1870 Denmark 1,888 3,782 Europa
3 1870 Finland 1,754 1,999 Europa
4 1870 France 38,440 72,100 Europa
您也可以使用字典来创建条件和值列表。它更快:
classification_countries={'Europa':Europa,
'RamasOccidentales':RamasOccidentales,
'Latinoamerica':Latinoamerica ,
'Asia':Asia}
dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}
df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
print(df)
Year Country Population GDP Region
0 1870 Austria 4,520 8,419 Europa
1 1870 Belgium 5,096 13,716 Europa
2 1870 Denmark 1,888 3,782 Europa
3 1870 Finland 1,754 1,999 Europa
4 1870 France 38,440 72,100 Europa
或者
classification_countries={'Europa':Europa,
'RamasOccidentales':RamasOccidentales,
'Latinoamerica':Latinoamerica ,
'Asia':Asia}
cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
values=[ key for key in classification_countries]
df['Region']=np.select(cond,values)
print(df)
Year Country Population GDP Region
0 1870 Austria 4,520 8,419 Europa
1 1870 Belgium 5,096 13,716 Europa
2 1870 Denmark 1,888 3,782 Europa
3 1870 Finland 1,754 1,999 Europa
4 1870 France 38,440 72,100 Europa
与创建字典后直到执行打印(df)的jezrael测量解决方案进行比较
%%timeit
dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}
df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
print(df)
#5.06 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
values=[ key for key in classification_countries]
df['Region']=np.select(cond,values)
print(df)
#5.18 ms ± 652 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@jezrael
%%timeit
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df['Region'] = df['Country'].map(d1)
print (df)
#7.88 ms ± 824 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
一种非常相似但替代的方法是使用基于字典的查找来确定国家/地区。 在此实现中,您将创建一个字典,其中国家作为键,其对应的地区作为配对值。
region_map = {
'Austria': 'Europa',
'Brazil': 'Latinoamerica',
'Japan': 'Asia' # so on and so forth
}
df['Region'] = df['Country'].apply(lambda c: region_map.get(c, 'Unknown'))
这将从您的字典 map 中生成相应的国家,如果不存在键值对,则生成字符串“未知”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.