简体   繁体   中英

Spark in Python Working with Tuples - How can I merge two tuples after joining two RDDs

I am kinda of new in the Spark Environment and Development.

I have two RDDs in which I merge via a joiner, the result of that joiner is the following:

(u'10611', ((u'Laura', u'Mcgee'), (u'66821', u'COMPLETE')))
(u'4026', ((u'Mary', u'Smith'), (u'3237', u'COMPLETE')))
(u'4026', ((u'Mary', u'Smith'), (u'4847', u'CLOSED')))

If you see I have two tuples and a key, I want to merge both tuples and leave it as key and one tuple, like the following:

(u'10611', (u'Laura', u'Mcgee', u'66821', u'COMPLETE'))
(u'4026', (u'Mary', u'Smith', u'3237', u'COMPLETE'))
(u'4026', (u'Mary', u'Smith', u'4847', u'CLOSED'))

Also how can I format this before saveAsTextFile, delimited by Tab. Example

10611   Laura   Mcgee   66821   COMPLETE
4026    Mary    Smith   3237    COMPLETE
4026    Mary    Smith   4847    CLOSED

I have something like this, but not sure how to access it with the tuple:

.map(lambda x: "%s\t%s\t%s\t%s" %(x[0], x[1], x[2], x[3]))

Assuming your data is consistently formatted you can merge your tuples with a simple addition operator...

>>> weird = (u'10611', ((u'Laura', u'Mcgee'), (u'66821', u'COMPLETE')))
>>> weirdMerged = (weird[0], (weird[1][0]+weird[1][1]))
>>> weirdMerged
(u'10611', (u'Laura', u'Mcgee', u'66821', u'COMPLETE'))

Outputting to text should be simple, but your oddball structure makes it a little odd also. Your lambda isn't bad but you could also do:

>>> print('\t'.join((weirdMerged[0],)+weirdMerged[1]))
10611   Laura   Mcgee   66821   COMPLETE

I'm not sure that's much better, but it works.

You can also use list/tuple comprehension to do it like this example:

my_tuple = (u'10611', ((u'Laura', u'Mcgee'), (u'66821', u'COMPLETE')))
new_tuple = (my_tuple[0], tuple(j for k in my_tuple[1] for j in k))

Output:

print(new_tuple)
>>> ('10611', ('Laura', 'Mcgee', '66821', 'COMPLETE'))

Then, to format your output you can do something like this too:

print("{0}\t{1}" .format(new_tuple[0], "\t".join(k for k in new_tuple[1])))

Output:

>>> 10611   Laura    Mcgee   66821   COMPLETE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM