I have an PySpark RDD in which each element is of the form (key, val)
and it takes one of the two following forms:
elm1 = ((1, 2), ((3, 4), (5, 6))) # key = (1,2), rest is val
elm2 = ((1, 2), ((3, 4), None))
Now, I need to do the following.
val
is None (as in elm2
) and extract them. Flatten them as follows and replace None
with tuple of empty strings:
elm = (1, 2, 3, 4, ('', ''))
To do the above two steps in PySpark, I do:
elm = elm.filter(lambda x: detectNone(x)) # checks if x[-1][1] is None
elm = elm.map(formatElm) # where formatElm is a function that replaces None with tuple of empty strings and flattens the tuple.
In reality, the test x[-1][1] == None
is a little complex and there is a more complex data structure being introduced in place of tuple of empty strings.
Question: Is there any way to speed up these operations. I think combining two operations into one may help, but I don't know how to do that.
I think combining two operations into one may help,
It won't. But if you really insist on doing this, then flatMap
:
rdd = sc.parallelize([((1, 2), ((3, 4), (5, 6))), ((1, 2), ((3, 4), None))])
def detect_and_format(row):
x, (y, z) = row
return [x + y + (("", ""), )] if z is None else []
# [(1, 2, 3, 4, ('', ''))]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.