如何在 PySpark 中进行嵌套转换

Question

Here is an example of my data:这是我的数据示例：

data1 = [[ 'red blue hi you red' ],
     [ 'ball green ball go four ball'],
     [ 'nice red start nice' ],
     [ 'ball no kill tree go go' ]]

Obtaining the following from the previous data:从之前的数据中得到以下信息：

data2 = 
[[[ 'red', 2 ], [ 'blue', 1 ], [ 'hi', 1 ], [ 'you', 1 ]],
[[ 'green', 1 ], [ 'go', 1 ], [ 'four', 1 ], [ 'ball', 3 ]],
[[ 'red, 1 ], [ 'start', 1 ], [ 'nice', 2 ]],
[[ 'ball', 1 ], [ 'no', 1 ], [ 'kill', 1 ], [ 'tree', 1 ], [ 'go', 2 ]]]

Note: notice that the RDD data2 have nested lists contains the amount of times that the word is mentioned in every element in the RDD data1 What i want is applying the following code:注意：请注意，RDD 数据 2 的嵌套列表包含该词在 RDD 数据 1 中的每个元素中被提及的次数我想要的是应用以下代码：

data3 = data2.map(lambda x: [data1.filter(lambda z: y[0] in z) for y in x])

The output should be the lists or the elements from data1 which contains the given word.输出应该是包含给定单词的 data1 中的列表或元素。 For example: if the word 'red' passed to the loop then filter, it should give me 2 lists from data1 which are:例如：如果“红色”这个词传递给循环然后过滤，它应该给我来自 data1 的 2 个列表，它们是：

[ 'red blue hi you red' ]
[ 'nice red start nice' ]

But it keeps giving the following error:但它不断给出以下错误：

Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.例外：您似乎正在尝试广播 RDD 或从操作或转换中引用 RDD。 RDD transformations and actions can only be invoked by the driver, not inside of other transformations; RDD 转换和动作只能由驱动程序调用，不能在其他转换内部调用； for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation.例如，rdd1.map(lambda x: rdd2.values.count() * x) 无效，因为值转换和计数操作不能在 rdd1.map 转换内部执行。 For more information, see SPARK-5063.有关更多信息，请参阅 SPARK-5063。

I tried to do another way, which is defining a function then pass it inside the transformation map, like:我尝试做另一种方式，即定义一个函数，然后将其传递到转换映射中，例如：

def func(y)
    data1.filter(lambda z: y[0] in z)
data3 = data2.map(lambda x: [ func(y) for y in x])

But it's still the same error, apparently trying to be smart doesn't work tho :3 What can i do?但这仍然是同样的错误，显然试图变得聪明是行不通的：3 我能做什么？ Thanks in advance.提前致谢。

Answer 1

The answer is short and rather definitive: you cannot.答案是简短而明确的：你不能。 Nested operations on distributed data structures aren't, and most likely won't be, supported in Spark. Spark 不支持，而且很可能不会支持分布式数据结构上的嵌套操作。 Depending on a context you can replace these with join or map with local (optionally broadcasted) data structure.根据上下文，您可以将这些替换为join或map与本地（可选广播）数据结构。

如何在 PySpark 中进行嵌套转换

问题描述

1 个解决方案

解决方案1
1

如何在 PySpark 中进行嵌套转换

问题描述

1 个解决方案

解决方案1 1

解决方案1
1