简体   繁体   中英

Search a dataframe using another dataframe or RDD

I have 2 dataframes in apache spark.

df 1 has the show number and descriptions... the data looks like

show_no | descrip
a | this is mikey
b | here comes donald
c | mary and george go home
d | mary and george come to town

and the second data frame has the characters

characters
george
donald
mary
minnie

I need to search the the show description one to find out which shows feature which characters...

the final output should look like

character | showscharacterisin
george | c,d
donald | b
mary | cd
minnie | No show

these data sets are contrived and simple but it expresses the search functionality I am trying to implement. I basically need to search the text of 1 dataframe using the values from another dataframe.

This would be easy to do in a udf inside of sql server, I would basically loop through the show descrip each time and return the show no using a "contains" search on the description.

the problem I have is that I see no way to do this using a data frame.

1) I think you should further breakdown the first dataset so that show_no is mapped to each word in the description. For eg first row could be broken down like

show_no | descrip
a | this
a | is 
a | mikey

2) You can filter out the stopwords from this if needed.

3) After this you can join it with " characters " to get the final desired output.

Hope this helps. Amit

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM