I have two dataframes that can't be joined on any values but there is a column of values in the first dataframe ( dfA
) that might or might not match values in multiple columns of the second dataframe ( dfB
). The 'text_bod' column has especially large values with an average string length of ~1500 characters.
The columns value1
and value2
in dfB
do not always have a value recorded even if a value exists but if a value exists it will almost always be found somewhere in the text of the text_bod
column. I'm trying to figure out if the values in dfA
exist in dfB
.
If a value from dfA
exists in dfB
, I want to append some information from dfA
to new columns in the dataframe where the value is found. For example below, I want to add a 'name', 'color', and 'animal' column to dfB
and then append the respective names, colors, and animals for the values that are found.
This is what I've come up with so far:
def extract(t):
s = ('|').join(dfA['value'])
return re.search(s, t)
tqdm.pandas()
dfB['value'] = dfB['text_bod'].progress_map(extract)
I would love to hear any suggestions on how to 1) optimize this search and 2) append the info that corresponds to the values to new columns in dfB
.
dfA
(~200,000 rows)
value name color animal
0 es9bum name1 red dolphin
1 qgl8 name2 cerulean mountaingoat
2 klkwv name3 platinum mantisshrimp
3 tokgs name4 fuchsia tarantula
4 cnwsaq5 name5 frost gentoopenguin
dfB
(~1,500,000 rows)
value1 value2 text_bod
0 null tokgs here are some tokgs
1 null null something es9bum
2 klkwv null blahblahblahklkwv
3 null null boop: qgl8. more&&
4 null null hi it me
5 null null here are more words
6 y2kbc null words and stuff
7 null null so much text
8 null null have a nice cnwsaq5
9 null null null
This is what I would like to output:
dfB
(~1,500,000 rows)
value1 value2 text_bod name color animal
0 null tokgs here are some tokgs name4 fuchsia tarantula
1 null null something es9bum name1 red dolphin
2 klkwv null blahblahblahklkwv name3 platinum mantisshrimp
3 null null boop: qgl8. more&& name2 cerulean mountaingoat
4 null null hi it me NaN NaN NaN
5 null null here are more words NaN NaN NaN
6 y2kbc null words and stuff name99 onyx direwolf
7 null null so much text NaN NaN NaN
8 null null have a nice cnwsaq5 name5 frost gentoopenguin
9 null null null NaN NaN NaN
We can use str.extract
to find the words in your text_bod
column and extract them. After that we use these extracted words as key
to merge
df1
with dfA
to get the wanted columns together.
s = ('|').join(dfA['value'])
df1['value'] = df1['text_bod'].str.extract('({})'.format(s))
df1 = df1.merge(dfA, on='value', how='left').drop('value', axis=1)
print(df1)
value1 value2 text_bod name color animal
0 NaN tokgs here are some tokgs name4 fuchsia tarantula
1 NaN NaN something es9bum name1 red dolphin
2 klkwv NaN blahblahblahklkwv name3 platinum mantisshrimp
3 NaN NaN boop: qgl8. more&& name2 cerulean mountaingoat
4 NaN NaN hi it me NaN NaN NaN
If you have Python version 3.6 or higher
We can use f-strings
in the third line, which makes our code a bit cleaner:
df1['value'] = df1['text_bod'].str.extract(f'({s})')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.