简体   繁体   English

如何在 PySpark 中进行嵌套转换

[英]How I can have a nested transformation in PySpark

Here is an example of my data:这是我的数据示例:

data1 = [[ 'red blue hi you red' ],
     [ 'ball green ball go four ball'],
     [ 'nice red start nice' ],
     [ 'ball no kill tree go go' ]]

Obtaining the following from the previous data:从之前的数据中得到以下信息:

data2 = 
[[[ 'red', 2 ], [ 'blue', 1 ], [ 'hi', 1 ], [ 'you', 1 ]],
[[ 'green', 1 ], [ 'go', 1 ], [ 'four', 1 ], [ 'ball', 3 ]],
[[ 'red, 1 ], [ 'start', 1 ], [ 'nice', 2 ]],
[[ 'ball', 1 ], [ 'no', 1 ], [ 'kill', 1 ], [ 'tree', 1 ], [ 'go', 2 ]]]

Note: notice that the RDD data2 have nested lists contains the amount of times that the word is mentioned in every element in the RDD data1 What i want is applying the following code:注意:请注意,RDD 数据 2 的嵌套列表包含该词在 RDD 数据 1 中的每个元素中被提及的次数我想要的是应用以下代码:

data3 = data2.map(lambda x: [data1.filter(lambda z: y[0] in z) for y in x])

The output should be the lists or the elements from data1 which contains the given word.输出应该是包含给定单词的 data1 中的列表或元素。 For example: if the word 'red' passed to the loop then filter, it should give me 2 lists from data1 which are:例如:如果“红色”这个词传递给循环然后过滤,它应该给我来自 data1 的 2 个列表,它们是:

[ 'red blue hi you red' ]
[ 'nice red start nice' ]

But it keeps giving the following error:但它不断给出以下错误:

Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.例外:您似乎正在尝试广播 RDD 或从操作或转换中引用 RDD。 RDD transformations and actions can only be invoked by the driver, not inside of other transformations; RDD 转换和动作只能由驱动程序调用,不能在其他转换内部调用; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation.例如,rdd1.map(lambda x: rdd2.values.count() * x) 无效,因为值转换和计数操作不能在 rdd1.map 转换内部执行。 For more information, see SPARK-5063.有关更多信息,请参阅 SPARK-5063。

I tried to do another way, which is defining a function then pass it inside the transformation map, like:我尝试做另一种方式,即定义一个函数,然后将其传递到转换映射中,例如:

def func(y)
    data1.filter(lambda z: y[0] in z)
data3 = data2.map(lambda x: [ func(y) for y in x])

But it's still the same error, apparently trying to be smart doesn't work tho :3 What can i do?但这仍然是同样的错误,显然试图变得聪明是行不通的:3 我能做什么? Thanks in advance.提前致谢。

The answer is short and rather definitive: you cannot.答案是简短而明确的:你不能。 Nested operations on distributed data structures aren't, and most likely won't be, supported in Spark. Spark 不支持,而且很可能不会支持分布式数据结构上的嵌套操作。 Depending on a context you can replace these with join or map with local (optionally broadcasted) data structure.根据上下文,您可以将这些替换为joinmap与本地(可选广播)数据结构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM