简体   繁体   中英

How to split an RDD into two RDDs and save the result as RDDs with PySpark?

I'm looking for a way to split an RDD into two or more RDDs, and save the results obtained as two separated RDDs. Given for exemple :

rdd_test = sc.parallelize(range(50), 1)

My code :

def split_population_into_parts(rdd_test):

    N = 2
    repartionned_rdd = rdd_test.repartition(N).distinct()
    rdds_for_testab_populations = repartionned_rdd.glom()

    return rdds_for_testab_populations

rdds_for_testab_populations = split_population_into_parts(rdd_test)

Which gives :

[[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48], [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49]]

Now I want to associate every list here to a new RDD. RDD1 and RDD2 for example. What to do ?

I got the solutions:

def get_testab_populations_tables(rdds_for_testab_populations):
i = 0
while i < len(rdds_for_testab_populations.collect()):
    for testab_table in rdds_for_testab_populations.toLocalIterator():
        namespace = globals()
        namespace['tAB_%d' % i] = sc.parallelize(testab_table)
        i += 1

return;

Then you can do :

print tAB_0.collect()
print tAB_1.collect()
etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM