PySpark：如何从给定的 RDD (id, [strings]) 创建唯一字符串列表的 RDD

Question

I am new to Spark.我是 Spark 的新手。 Suppose now I have a RDD1 whose data format is a tuple of (id, list[strs]), such as:假设现在我有一个 RDD1，它的数据格式是 (id, list[strs]) 的元组，例如：

(id1, ["okay"，"okay", "not Okay"])
(id2, ["okay","good","good","good1"])

Now i want to create another RDD2 from this given RDD1 which only contains lists of unique strings from each sublist, such as:现在我想从这个给定的 RDD1 创建另一个 RDD2，它只包含来自每个子列表的唯一字符串列表，例如：

["okay", "not Okay"]
["okay", "good","good1"]

Could you guys please let me know how to process this operation?你们能告诉我如何处理这个操作吗？ I first flattered the RDD1 and called the distinct() function, but this would only gave me a full list of unique string.我首先恭维了 RDD1 并调用了 distinct() 函数，但这只会给我一个完整的唯一字符串列表。 What I really want is to have the unique string in each list of the original RDD1.我真正想要的是在原始 RDD1 的每个列表中都有唯一的字符串。

Lastly, suppose I have a HashMap, can i make it into an RDD?最后，假设我有一个 HashMap，我可以把它变成一个 RDD 吗？ Thanks in advance.提前致谢。

Answer 1

您可以简单地使用：

rdd1.map(lambda x: list(set(x[1])))

PySpark：如何从给定的 RDD (id, [strings]) 创建唯一字符串列表的 RDD

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-01-28 04:00:39

PySpark：如何从给定的 RDD (id, [strings]) 创建唯一字符串列表的 RDD

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-01-28 04:00:39

解决方案1
0 已采纳 2020-01-28 04:00:39