结合Python Spark RDD中的两行

Question

我在处理python spark rdd时遇到小问题。 我的rdd看起来像

old_rdd = [( A1, Vector(V1)), (A2, Vector(V2)), (A3, Vector(V3)), ....].

我想使用flatMap，以便获得新的rdd，例如：

new_rdd = [((A1, A2), (V1, V2)), ((A1, A3), (V1, V3))] and so on.

问题是flatMap删除了[(A1, V1, A2, V2)...].这样的元组[(A1, V1, A2, V2)...]. 无论是否有flatMap（），您还有其他建议吗？ 先感谢您。

Answer 1

它与Scala Spark中笛卡尔变换中的显式排序有关。 但是，我假设您已经清理了RDD中的重复项，并且我假设ids具有一些简单的模式来解析然后识别，并且为简单起见，我将考虑使用Lists而不是Vectors

old_rdd = sc.parallelize([(1, [1, -2]), (2, [5, 7]), (3, [8, 23]), (4, [-1, 90])])

# It will provide all the permutations, but combinations are a subset of the permutations, so we need to filter.
combined_rdd = old_rdd.cartesian(old_
combinations = combined_rdd.filter(lambda (s1, s2): s1[0] < s2[0])

combinations.collect()

#  The output will be...
# -----------------------------
# [((1, [1, -2]), (2, [5, 7])),
#  ((1, [1, -2]), (3, [8, 23])),
#  ((1, [1, -2]), (4, [-1, 90])),
#  ((2, [5, 7]), (3, [8, 23])),
#  ((2, [5, 7]), (4, [-1, 90])),
#  ((3, [8, 23]), (4, [-1, 90]))]

# Now we need to set the tuple as you want
combinations = combinations.map(lambda (s1, s1): ((s1[0], s2[0]), (s1[1], s2[1]))).collect()

# The output will be...
# ----------------------
# [((1, 2), ([1, -2], [5, 7])),
# ((1, 3), ([1, -2], [8, 23])),
# ((1, 4), ([1, -2], [-1, 90])),
# ((2, 3), ([5, 7], [8, 23])),
# ((2, 4), ([5, 7], [-1, 90])),
# ((3, 4), ([8, 23], [-1, 90]))]

结合Python Spark RDD中的两行

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-11-13 20:08:13

结合Python Spark RDD中的两行

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-11-13 20:08:13

解决方案1
1 已采纳 2015-11-13 20:08:13