简体   繁体   English

如何根据python pandas.Dataframe中的列表分配标签?

[英]How to assign labels according to a list in python pandas.Dataframe?

I have two DataFrame, one is 'recipe', the combination of the ingredients, the other is 'like', which contains the popular combinations. 我有两个DataFrame,一个是“ recipe”,是成分的组合,另一个是“ like”,其中包含流行的组合。

recipe = pd.DataFrame({'A': ['chicken','beef','pork','egg', 'chicken', 'egg', 'beef'],
                       'B': ['sweet', 'hot', 'salty', 'hot', 'sweet', 'salty', 'hot']})
recipe
     A      B
0  chicken  sweet
1     beef    hot
2     pork  salty
3      egg    hot
4  chicken  sweet
5      egg  salty
6     beef    hot 

like = pd.DataFrame({'A':['beef', 'egg'], 'B':['hot', 'salty']})
like
    A      B
0  beef    hot
1   egg  salty

How can I add a column 'C' to recipe, if the combination listed in 'like', then I give it value 'yes', otherwise 'no'? 我如何在配方中添加列“ C”,如果组合列为“喜欢”,则给它赋予“是”,否则为“否”?

The result I want is 我想要的结果是

recipe
         A      B    C
0  chicken  sweet   no
1     beef    hot  yes
2     pork  salty   no
3      egg    hot   no
4  chicken  sweet   no
5      egg  salty  yes
6     beef    hot  yes

The problem is my both dataframes are large. 问题是我两个数据框都很大。 I can not manually choose the items in 'like' and assign the 'yes' label in 'recipe'. 我无法手动选择“喜欢”中的项目并在“食谱”中指定“是”标签。 Are there any easy ways to do that? 有没有简单的方法可以做到这一点?

You can use merge and numpy.where : 您可以使用mergenumpy.where

df = pd.merge(recipe, like, on=['A','B'], indicator=True, how='left')
print df
         A      B     _merge
0  chicken  sweet  left_only
1     beef    hot       both
2     pork  salty  left_only
3      egg    hot  left_only
4  chicken  sweet  left_only
5      egg  salty       both
6     beef    hot       both

df['C'] = np.where(df['_merge'] == 'both', 'yes', 'no')

print df[['A','B','C']]
         A      B    C
0  chicken  sweet   no
1     beef    hot  yes
2     pork  salty   no
3      egg    hot   no
4  chicken  sweet   no
5      egg  salty  yes
6     beef    hot  yes

Faster is use df['_merge'] == 'both' : 使用df['_merge'] == 'both'更快:

In [460]: %timeit np.where(np.in1d(df['_merge'],'both'), 'yes', 'no')
100 loops, best of 3: 2.22 ms per loop

In [461]: %timeit np.where(df['_merge'] == 'both', 'yes', 'no')
1000 loops, best of 3: 652 µs per loop

You could add a C column of 'yes' s to like and then merge recipe with like . 您可以在C添加'yes'C列至like ,然后将recipelike合并。 The rows that match will have yes in the C column, the rows without a match will have NaN s. 匹配的行在C列中为yes ,不匹配的行将为NaN You could then use fillna to replace the NaNs with 'no' s: 然后,您可以使用fillna将NaN替换为'no'

import pandas as pd
recipe = pd.DataFrame({'A': ['chicken','beef','pork','egg', 'chicken', 'egg', 'beef'],
                       'B': ['sweet', 'hot', 'salty', 'hot', 'sweet', 'salty', 'hot']})

like = pd.DataFrame({'A':['beef', 'egg'], 'B':['hot', 'salty']})
like['C'] = 'yes'
result = pd.merge(recipe, like, how='left').fillna('no')
print(result)

yields 产量

         A      B    C
0  chicken  sweet   no
1     beef    hot  yes
2     pork  salty   no
3      egg    hot   no
4  chicken  sweet   no
5      egg  salty  yes
6     beef    hot  yes

You can use set_value by matching both A and B as such: 您可以通过同时匹配AB来使用set_value

recipe.set_value(recipe[recipe.A.isin(like.A) & recipe.B.isin(like.B)].index,'C','yes')
recipe.fillna('no')

Which will give you: 这会给你:

         A      B    C
0  chicken  sweet   no
1     beef    hot  yes
2     pork  salty   no
3      egg    hot  yes
4  chicken  sweet   no
5      egg  salty  yes
6     beef    hot  yes

Note: These results do not mean my answer is better than other ones or vice versa. 注意:这些结果并不意味着我的回答比其他答案要好,反之亦然。

Using set_value : 使用set_value

%timeit recipe.set_value(recipe[recipe.A.isin(like.A) & recipe.B.isin(like.B)].index,'C','yes'); recipe.fillna('no')
100 loops, best of 3: 2.69 ms per loop

Using merge and creating new df : 使用merge并创建新的df

%timeit df = pd.merge(recipe, like, on=['A','B'], indicator=True, how='left'); df['C'] = np.where(df['_merge'] == 'both', 'yes', 'no')
100 loops, best of 3: 8.42 ms per loop

Using merge only: 仅使用merge

%timeit df['C'] = np.where(df['_merge'] == 'both', 'yes', 'no')
1000 loops, best of 3: 187 µs per loop

Again, it really depends on what you're timing. 同样,这实际上取决于您的时间安排。 Just be cautious of duplicating your data. 只是要小心复制数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM