简体   繁体   中英

Pandas compare all values of a column with different DataFrame and return column name (of a dif. DataFrame) where value matches

I'm trying to assign category per row of a dataset based on matching keywords from other dataset.

  1. I compare df_ONE['columnname'] to every value of df_TWO
  2. and if matching value found use column name of df_TWO where this value is located as a cell value of new column in df_ONE.

With example below, all the values of a new string would be non-sport (column name of df_TWO where the value is found)

df_ONE

Heroes      The Punisher        
Heroes      The Punisher        
Heroes      Human Torch - 1     
Heroes      Man Thing           
Heroes      Medusa              
Heroes      Mr. Fantastic       
Movies-TV   Star Wars           
Movies-TV   Star Wars

df_TWO

         sport  non_sport   gaming
0     baseball  movies-tv  pokemon
1   basketball      music   yugioh
2     football     people    magic
3       hockey    history   gaming
4       soccer     heroes      NaN
5       racing        NaN      NaN
6       boxing        NaN      NaN
7         golf        NaN      NaN
8          mma        NaN      NaN
9   multisport        NaN      NaN
10      tennis        NaN      NaN
11   wrestling        NaN      NaN
12       poker        NaN      NaN

would be nice to have this result:

Heroes      The Punisher        non-sport
Heroes      The Punisher        non-sport
Heroes      Human Torch - 1     non-sport
Heroes      Man Thing           non-sport
Heroes      Medusa              non-sport
Heroes      Mr. Fantastic       non-sport
Movies-TV   Star Wars           non-sport
Movies-TV   Star Wars           non-sport

I've tried to adopt following solutions but had no luck.

  • keywords.columns[keywords.eq('heroes').any()]
  • (keywords == 'pokemon').idxmax(axis=1)[0]

into something like

  • df[new_column] = df[category_column].isin(keywords).any()

You need to reshape your second dataframe. You can do this with melt pretty easily.

Here is an example of what the melted df looks like:

    col_match   genre
0   sport   baseball
1   sport   basketball
2   sport   football
3   sport   hockey
4   sport   soccer
5   sport   racing

So you can use the melted df to join the original on the genre. Be sure to lowercase your genre column in the first df.

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'genre': ['Heroes',  'Heroes',  'Heroes',  'Heroes',  'Heroes',  'Heroes',  'Movies-TV',  'Movies-TV'],
    ' title': ['The Punisher',  'The Punisher',  'Human Torch - 1',  'Man Thing',  'Medusa',  'Mr. Fantastic',  'Star Wars',  'Star Wars']})

df2 = pd.DataFrame({
    'sport': ['baseball',  'basketball',  'football',  'hockey',  'soccer',  'racing',  'boxing',  'golf',  'mma',  'multisport',  'tennis',  'wrestling',  'poker'],
    'non_sport': ['movies-tv',  'music',  'people',  'history',  'heroes',  np.nan,  np.nan,  np.nan,  np.nan, np.nan,  np.nan,  np.nan,  np.nan],
    'gaming': ['pokemon',  'yugioh',  'magic',  'gaming',  np.nan,  np.nan,  np.nan,  np.nan,  np.nan,  np.nan,  np.nan,  np.nan,  np.nan]})

df['genre'] = df['genre'].str.lower()

df.merge(df2.melt(value_vars=df2.columns, var_name='col_match', value_name='genre'), on='genre')

Output

       genre            title  col_match
0     heroes     The Punisher  non_sport
1     heroes     The Punisher  non_sport
2     heroes  Human Torch - 1  non_sport
3     heroes        Man Thing  non_sport
4     heroes           Medusa  non_sport
5     heroes    Mr. Fantastic  non_sport
6  movies-tv        Star Wars  non_sport
7  movies-tv        Star Wars  non_sport

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM