Modify and filter the PySpark RDD at the same time

Question

I have an PySpark RDD in which each element is of the form (key, val) and it takes one of the two following forms:

elm1 = ((1, 2), ((3, 4), (5, 6)))  # key = (1,2), rest is val
elm2 = ((1, 2), ((3, 4), None))

Now, I need to do the following.

Detect the elements where the second part of the val is None (as in elm2 ) and extract them.
Flatten them as follows and replace None with tuple of empty strings:
```
 elm = (1, 2, 3, 4, ('', '')) 
```

To do the above two steps in PySpark, I do:

elm = elm.filter(lambda x: detectNone(x))  # checks if x[-1][1] is None
elm = elm.map(formatElm) # where formatElm is a function that replaces None with tuple of empty strings and flattens the tuple.

In reality, the test x[-1][1] == None is a little complex and there is a more complex data structure being introduced in place of tuple of empty strings.

Question: Is there any way to speed up these operations. I think combining two operations into one may help, but I don't know how to do that.

Answer 1

I think combining two operations into one may help,

It won't. But if you really insist on doing this, then flatMap :

rdd = sc.parallelize([((1, 2), ((3, 4), (5, 6))), ((1, 2), ((3, 4), None))])


def detect_and_format(row):
    x, (y, z) = row
    return [x + y + (("", ""), )] if z is None else []

# [(1, 2, 3, 4, ('', ''))]

Modify and filter the PySpark RDD at the same time

Question

1 answers

solution1
0 ACCPTED 2018-04-13 23:06:58

Modify and filter the PySpark RDD at the same time

Question

1 answers

solution1 0 ACCPTED 2018-04-13 23:06:58

solution1
0 ACCPTED 2018-04-13 23:06:58