[英]How I can have a nested transformation in PySpark
Here is an example of my data:这是我的数据示例:
data1 = [[ 'red blue hi you red' ],
[ 'ball green ball go four ball'],
[ 'nice red start nice' ],
[ 'ball no kill tree go go' ]]
Obtaining the following from the previous data:从之前的数据中得到以下信息:
data2 =
[[[ 'red', 2 ], [ 'blue', 1 ], [ 'hi', 1 ], [ 'you', 1 ]],
[[ 'green', 1 ], [ 'go', 1 ], [ 'four', 1 ], [ 'ball', 3 ]],
[[ 'red, 1 ], [ 'start', 1 ], [ 'nice', 2 ]],
[[ 'ball', 1 ], [ 'no', 1 ], [ 'kill', 1 ], [ 'tree', 1 ], [ 'go', 2 ]]]
Note: notice that the RDD data2 have nested lists contains the amount of times that the word is mentioned in every element in the RDD data1 What i want is applying the following code:注意:请注意,RDD 数据 2 的嵌套列表包含该词在 RDD 数据 1 中的每个元素中被提及的次数我想要的是应用以下代码:
data3 = data2.map(lambda x: [data1.filter(lambda z: y[0] in z) for y in x])
The output should be the lists or the elements from data1 which contains the given word.输出应该是包含给定单词的 data1 中的列表或元素。 For example: if the word 'red' passed to the loop then filter, it should give me 2 lists from data1 which are:
例如:如果“红色”这个词传递给循环然后过滤,它应该给我来自 data1 的 2 个列表,它们是:
[ 'red blue hi you red' ]
[ 'nice red start nice' ]
But it keeps giving the following error:但它不断给出以下错误:
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation.例外:您似乎正在尝试广播 RDD 或从操作或转换中引用 RDD。 RDD transformations and actions can only be invoked by the driver, not inside of other transformations;
RDD 转换和动作只能由驱动程序调用,不能在其他转换内部调用; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation.
例如,rdd1.map(lambda x: rdd2.values.count() * x) 无效,因为值转换和计数操作不能在 rdd1.map 转换内部执行。 For more information, see SPARK-5063.
有关更多信息,请参阅 SPARK-5063。
I tried to do another way, which is defining a function then pass it inside the transformation map, like:我尝试做另一种方式,即定义一个函数,然后将其传递到转换映射中,例如:
def func(y)
data1.filter(lambda z: y[0] in z)
data3 = data2.map(lambda x: [ func(y) for y in x])
But it's still the same error, apparently trying to be smart doesn't work tho :3 What can i do?但这仍然是同样的错误,显然试图变得聪明是行不通的:3 我能做什么? Thanks in advance.
提前致谢。
The answer is short and rather definitive: you cannot.答案是简短而明确的:你不能。 Nested operations on distributed data structures aren't, and most likely won't be, supported in Spark.
Spark 不支持,而且很可能不会支持分布式数据结构上的嵌套操作。 Depending on a context you can replace these with
join
or map
with local (optionally broadcasted) data structure.根据上下文,您可以将这些替换为
join
或map
与本地(可选广播)数据结构。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.