[英]Search a dataframe using another dataframe or RDD
I have 2 dataframes in apache spark. 我在apache spark中有2个数据帧。
df 1 has the show number and descriptions... the data looks like df 1有显示编号和说明......数据看起来像
show_no | show_no | descrip
记述
a | a | this is mikey
这是mikey
b | b | here comes donald
唐纳德来了
c | c | mary and george go home
玛丽和乔治回家
d | d | mary and george come to town
玛丽和乔治来到城里
and the second data frame has the characters 第二个数据框有字符
characters 人物
george 乔治
donald 唐纳德
mary 玛丽
minnie 米妮
I need to search the the show description one to find out which shows feature which characters... 我需要搜索节目描述一个,找出哪些节目特征是哪个...
the final output should look like 最终输出应该是这样的
character | 人物| showscharacterisin
showscharacterisin
george | 乔治| c,d
光盘
donald | 唐纳德| b
b
mary | 玛丽| cd
光盘
minnie | 米妮| No show
没有出现
these data sets are contrived and simple but it expresses the search functionality I am trying to implement. 这些数据集既人为又简单,但它表达了我试图实现的搜索功能。 I basically need to search the text of 1 dataframe using the values from another dataframe.
我基本上需要使用另一个数据帧中的值来搜索1个数据帧的文本。
This would be easy to do in a udf inside of sql server, I would basically loop through the show descrip each time and return the show no using a "contains" search on the description. 这在sql server里面的udf中很容易做,我基本上每次循环显示描述,并在描述中使用“包含”搜索返回show no。
the problem I have is that I see no way to do this using a data frame. 我遇到的问题是我看不到使用数据框做到这一点。
1) I think you should further breakdown the first dataset so that show_no is mapped to each word in the description. 1)我认为你应该进一步细分第一个数据集,以便show_no映射到描述中的每个单词。 For eg first row could be broken down like
例如,第一行可以分解为
show_no | descrip
a | this
a | is
a | mikey
2) You can filter out the stopwords from this if needed. 2)如果需要,您可以从中过滤掉停用词。
3) After this you can join it with " characters " to get the final desired output. 3)在此之后,您可以使用“ characters ”加入它以获得最终所需的输出。
Hope this helps. 希望这可以帮助。 Amit
阿米特
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.