简体   繁体   中英

How I can have a nested transformation in PySpark

Here is an example of my data:

data1 = [[ 'red blue hi you red' ],
     [ 'ball green ball go four ball'],
     [ 'nice red start nice' ],
     [ 'ball no kill tree go go' ]]

Obtaining the following from the previous data:

data2 = 
[[[ 'red', 2 ], [ 'blue', 1 ], [ 'hi', 1 ], [ 'you', 1 ]],
[[ 'green', 1 ], [ 'go', 1 ], [ 'four', 1 ], [ 'ball', 3 ]],
[[ 'red, 1 ], [ 'start', 1 ], [ 'nice', 2 ]],
[[ 'ball', 1 ], [ 'no', 1 ], [ 'kill', 1 ], [ 'tree', 1 ], [ 'go', 2 ]]]

Note: notice that the RDD data2 have nested lists contains the amount of times that the word is mentioned in every element in the RDD data1 What i want is applying the following code:

data3 = data2.map(lambda x: [data1.filter(lambda z: y[0] in z) for y in x])

The output should be the lists or the elements from data1 which contains the given word. For example: if the word 'red' passed to the loop then filter, it should give me 2 lists from data1 which are:

[ 'red blue hi you red' ]
[ 'nice red start nice' ]

But it keeps giving the following error:

Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

I tried to do another way, which is defining a function then pass it inside the transformation map, like:

def func(y)
    data1.filter(lambda z: y[0] in z)
data3 = data2.map(lambda x: [ func(y) for y in x])

But it's still the same error, apparently trying to be smart doesn't work tho :3 What can i do? Thanks in advance.

The answer is short and rather definitive: you cannot. Nested operations on distributed data structures aren't, and most likely won't be, supported in Spark. Depending on a context you can replace these with join or map with local (optionally broadcasted) data structure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM