I have 2 dataframes in apache spark.
df 1 has the show number and descriptions... the data looks like
show_no | descrip
a | this is mikey
b | here comes donald
c | mary and george go home
d | mary and george come to town
and the second data frame has the characters
characters
george
donald
mary
minnie
I need to search the the show description one to find out which shows feature which characters...
the final output should look like
character | showscharacterisin
george | c,d
donald | b
mary | cd
minnie | No show
these data sets are contrived and simple but it expresses the search functionality I am trying to implement. I basically need to search the text of 1 dataframe using the values from another dataframe.
This would be easy to do in a udf inside of sql server, I would basically loop through the show descrip each time and return the show no using a "contains" search on the description.
the problem I have is that I see no way to do this using a data frame.
1) I think you should further breakdown the first dataset so that show_no is mapped to each word in the description. For eg first row could be broken down like
show_no | descrip
a | this
a | is
a | mikey
2) You can filter out the stopwords from this if needed.
3) After this you can join it with " characters " to get the final desired output.
Hope this helps. Amit
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.