简体   繁体   中英

Modify and filter the PySpark RDD at the same time

I have an PySpark RDD in which each element is of the form (key, val) and it takes one of the two following forms:

elm1 = ((1, 2), ((3, 4), (5, 6)))  # key = (1,2), rest is val
elm2 = ((1, 2), ((3, 4), None))

Now, I need to do the following.

  1. Detect the elements where the second part of the val is None (as in elm2 ) and extract them.
  2. Flatten them as follows and replace None with tuple of empty strings:

     elm = (1, 2, 3, 4, ('', '')) 

To do the above two steps in PySpark, I do:

elm = elm.filter(lambda x: detectNone(x))  # checks if x[-1][1] is None
elm = elm.map(formatElm) # where formatElm is a function that replaces None with tuple of empty strings and flattens the tuple.

In reality, the test x[-1][1] == None is a little complex and there is a more complex data structure being introduced in place of tuple of empty strings.

Question: Is there any way to speed up these operations. I think combining two operations into one may help, but I don't know how to do that.

I think combining two operations into one may help,

It won't. But if you really insist on doing this, then flatMap :

rdd = sc.parallelize([((1, 2), ((3, 4), (5, 6))), ((1, 2), ((3, 4), None))])


def detect_and_format(row):
    x, (y, z) = row
    return [x + y + (("", ""), )] if z is None else []

# [(1, 2, 3, 4, ('', ''))] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM