I'm trying to assign category per row of a dataset based on matching keywords from other dataset.
With example below, all the values of a new string would be non-sport (column name of df_TWO where the value is found)
df_ONE
Heroes The Punisher
Heroes The Punisher
Heroes Human Torch - 1
Heroes Man Thing
Heroes Medusa
Heroes Mr. Fantastic
Movies-TV Star Wars
Movies-TV Star Wars
df_TWO
sport non_sport gaming
0 baseball movies-tv pokemon
1 basketball music yugioh
2 football people magic
3 hockey history gaming
4 soccer heroes NaN
5 racing NaN NaN
6 boxing NaN NaN
7 golf NaN NaN
8 mma NaN NaN
9 multisport NaN NaN
10 tennis NaN NaN
11 wrestling NaN NaN
12 poker NaN NaN
would be nice to have this result:
Heroes The Punisher non-sport
Heroes The Punisher non-sport
Heroes Human Torch - 1 non-sport
Heroes Man Thing non-sport
Heroes Medusa non-sport
Heroes Mr. Fantastic non-sport
Movies-TV Star Wars non-sport
Movies-TV Star Wars non-sport
I've tried to adopt following solutions but had no luck.
into something like
You need to reshape your second dataframe. You can do this with melt
pretty easily.
Here is an example of what the melted df looks like:
col_match genre
0 sport baseball
1 sport basketball
2 sport football
3 sport hockey
4 sport soccer
5 sport racing
So you can use the melted df to join the original on the genre. Be sure to lowercase your genre column in the first df.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'genre': ['Heroes', 'Heroes', 'Heroes', 'Heroes', 'Heroes', 'Heroes', 'Movies-TV', 'Movies-TV'],
' title': ['The Punisher', 'The Punisher', 'Human Torch - 1', 'Man Thing', 'Medusa', 'Mr. Fantastic', 'Star Wars', 'Star Wars']})
df2 = pd.DataFrame({
'sport': ['baseball', 'basketball', 'football', 'hockey', 'soccer', 'racing', 'boxing', 'golf', 'mma', 'multisport', 'tennis', 'wrestling', 'poker'],
'non_sport': ['movies-tv', 'music', 'people', 'history', 'heroes', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'gaming': ['pokemon', 'yugioh', 'magic', 'gaming', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]})
df['genre'] = df['genre'].str.lower()
df.merge(df2.melt(value_vars=df2.columns, var_name='col_match', value_name='genre'), on='genre')
Output
genre title col_match
0 heroes The Punisher non_sport
1 heroes The Punisher non_sport
2 heroes Human Torch - 1 non_sport
3 heroes Man Thing non_sport
4 heroes Medusa non_sport
5 heroes Mr. Fantastic non_sport
6 movies-tv Star Wars non_sport
7 movies-tv Star Wars non_sport
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.