简体   繁体   中英

Pandas - matching values from a column in one dataframe to several columns in another dataframe and creating new columns from the original dataframe

I have two dataframes that can't be joined on any values but there is a column of values in the first dataframe ( dfA ) that might or might not match values in multiple columns of the second dataframe ( dfB ). The 'text_bod' column has especially large values with an average string length of ~1500 characters.

The columns value1 and value2 in dfB do not always have a value recorded even if a value exists but if a value exists it will almost always be found somewhere in the text of the text_bod column. I'm trying to figure out if the values in dfA exist in dfB .

If a value from dfA exists in dfB , I want to append some information from dfA to new columns in the dataframe where the value is found. For example below, I want to add a 'name', 'color', and 'animal' column to dfB and then append the respective names, colors, and animals for the values that are found.

This is what I've come up with so far:

def extract(t):
    s = ('|').join(dfA['value'])
    return re.search(s, t)

tqdm.pandas()

dfB['value'] = dfB['text_bod'].progress_map(extract)

I would love to hear any suggestions on how to 1) optimize this search and 2) append the info that corresponds to the values to new columns in dfB .

dfA (~200,000 rows)

    value   name     color         animal
0  es9bum  name1       red        dolphin
1    qgl8  name2  cerulean   mountaingoat
2   klkwv  name3  platinum   mantisshrimp
3   tokgs  name4   fuchsia      tarantula
4 cnwsaq5  name5     frost  gentoopenguin   

dfB (~1,500,000 rows)

   value1 value2              text_bod           
0    null  tokgs   here are some tokgs        
1    null   null      something es9bum 
2   klkwv   null     blahblahblahklkwv 
3    null   null    boop: qgl8. more&& 
4    null   null              hi it me
5    null   null   here are more words           
6   y2kbc   null       words and stuff
7    null   null          so much text
8    null   null   have a nice cnwsaq5 
9    null   null                  null

This is what I would like to output:

dfB (~1,500,000 rows)

   value1 value2              text_bod    name    color        animal         
0    null  tokgs   here are some tokgs   name4  fuchsia     tarantula
1    null   null      something es9bum   name1      red       dolphin
2   klkwv   null     blahblahblahklkwv   name3 platinum  mantisshrimp
3    null   null    boop: qgl8. more&&   name2 cerulean  mountaingoat
4    null   null              hi it me     NaN      NaN           NaN
5    null   null   here are more words     NaN      NaN           NaN 
6   y2kbc   null       words and stuff  name99     onyx      direwolf
7    null   null          so much text     NaN      NaN           NaN
8    null   null   have a nice cnwsaq5   name5    frost gentoopenguin
9    null   null                  null     NaN      NaN           NaN

We can use str.extract to find the words in your text_bod column and extract them. After that we use these extracted words as key to merge df1 with dfA to get the wanted columns together.

s = ('|').join(dfA['value'])

df1['value'] = df1['text_bod'].str.extract('({})'.format(s))

df1 = df1.merge(dfA, on='value', how='left').drop('value', axis=1)

print(df1)
  value1 value2             text_bod   name     color        animal
0    NaN  tokgs  here are some tokgs  name4   fuchsia     tarantula
1    NaN    NaN     something es9bum  name1       red       dolphin
2  klkwv    NaN    blahblahblahklkwv  name3  platinum  mantisshrimp
3    NaN    NaN   boop: qgl8. more&&  name2  cerulean  mountaingoat
4    NaN    NaN             hi it me    NaN       NaN           NaN

If you have Python version 3.6 or higher
We can use f-strings in the third line, which makes our code a bit cleaner:

df1['value'] = df1['text_bod'].str.extract(f'({s})')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM