[英]PySpark: How to Split String Value in Paired RDD and Map with Key
Given给定
data = sc.parallelize([(1,'winter is coming'),(2,'ours is the fury'),(3,'the old the true the brave')])
My desired output is我想要的 output 是
[('fury',[2],('true',[3]),('is',[1,2]),('old',[3]),('the',[2,3]),('ours',[2]),('brave',[3]),('winter',[1]),('coming',[1])]
I'm not sure how to map the following output我不确定如何 map 以下 output
[(1,'winter'),(1,'is'),(1,'coming'),(2,'ours'),(2,'is'),....etc.]`
I tried using我尝试使用
data.flatMap(lambda x: [(x[0], v) for v in x[1]]
but this ended up mapping the key to each letter of the string instead of the word.但这最终将键映射到字符串的每个字母而不是单词。 Should flatMap, map or split function be used here ?
应该在这里使用 flatMap、map 或拆分 function吗?
After mapping, I plan to reduce the paired RDDs with similar keys and inverse key and value by using映射后,我计划通过使用来减少具有相似键和逆键和值的配对 RDD
data.reduceByKey(lambda a,b: a+b).map(lambda x:(x[1],x[0])).collect()
Is my thinking correct?我的想法正确吗?
You can flatMap
and create tuples where keys are reused and an entry is created for each word (obtained using split()
):您可以
flatMap
并创建重用键的元组,并为每个单词创建一个条目(使用split()
获得):
data.flatMap(lambda pair: [(pair[0], word) for word in pair[1].split()])
When collected, that outputs收集时,输出
[(1, 'winter'),
(1, 'is'),
(1, 'coming'),
(2, 'ours'),
(2, 'is'),
(2, 'the'),
(2, 'fury'),
(3, 'the'),
(3, 'old'),
(3, 'the'),
(3, 'true'),
(3, 'the'),
(3, 'brave')]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.