简体   繁体   English

PySpark:如何使用密钥拆分配对 RDD 和 Map 中的字符串值

[英]PySpark: How to Split String Value in Paired RDD and Map with Key

Given给定

data = sc.parallelize([(1,'winter is coming'),(2,'ours is the fury'),(3,'the old the true the brave')])

My desired output is我想要的 output 是

[('fury',[2],('true',[3]),('is',[1,2]),('old',[3]),('the',[2,3]),('ours',[2]),('brave',[3]),('winter',[1]),('coming',[1])]

I'm not sure how to map the following output我不确定如何 map 以下 output

[(1,'winter'),(1,'is'),(1,'coming'),(2,'ours'),(2,'is'),....etc.]`

I tried using我尝试使用

data.flatMap(lambda x: [(x[0], v) for v in x[1]]

but this ended up mapping the key to each letter of the string instead of the word.但这最终将键映射到字符串的每个字母而不是单词。 Should flatMap, map or split function be used here ?应该在这里使用 flatMap、map 或拆分 function吗?

After mapping, I plan to reduce the paired RDDs with similar keys and inverse key and value by using映射后,我计划通过使用来减少具有相似键和逆键和值的配对 RDD

data.reduceByKey(lambda a,b: a+b).map(lambda x:(x[1],x[0])).collect()

Is my thinking correct?我的想法正确吗?

You can flatMap and create tuples where keys are reused and an entry is created for each word (obtained using split() ):您可以flatMap并创建重用键的元组,并为每个单词创建一个条目(使用split()获得):

data.flatMap(lambda pair: [(pair[0], word) for word in pair[1].split()])

When collected, that outputs收集时,输出

[(1, 'winter'),
 (1, 'is'),
 (1, 'coming'),
 (2, 'ours'),
 (2, 'is'),
 (2, 'the'),
 (2, 'fury'),
 (3, 'the'),
 (3, 'old'),
 (3, 'the'),
 (3, 'true'),
 (3, 'the'),
 (3, 'brave')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM