简体   繁体   English

PySpark:如何从给定的 RDD (id, [strings]) 创建唯一字符串列表的 RDD

[英]PySpark: how to create a RDD of lists of unique strings from a given RDD (id, [strings])

I am new to Spark.我是 Spark 的新手。 Suppose now I have a RDD1 whose data format is a tuple of (id, list[strs]), such as:假设现在我有一个 RDD1,它的数据格式是 (id, list[strs]) 的元组,例如:

(id1, ["okay","okay", "not Okay"])
(id2, ["okay","good","good","good1"])

Now i want to create another RDD2 from this given RDD1 which only contains lists of unique strings from each sublist, such as:现在我想从这个给定的 RDD1 创建另一个 RDD2,它只包含来自每个子列表的唯一字符串列表,例如:

["okay", "not Okay"]
["okay", "good","good1"]

Could you guys please let me know how to process this operation?你们能告诉我如何处理这个操作吗? I first flattered the RDD1 and called the distinct() function, but this would only gave me a full list of unique string.我首先恭维了 RDD1 并调用了 distinct() 函数,但这只会给我一个完整的唯一字符串列表。 What I really want is to have the unique string in each list of the original RDD1.我真正想要的是在原始 RDD1 的每个列表中都有唯一的字符串。

Lastly, suppose I have a HashMap, can i make it into an RDD?最后,假设我有一个 HashMap,我可以把它变成一个 RDD 吗? Thanks in advance.提前致谢。

您可以简单地使用:

rdd1.map(lambda x: list(set(x[1])))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM