简体   繁体   中英

pyspark - spark - how to create a parallel multistage task using RDD

I am using python and would like to create a job that is as follows: 1.the job has n parts that can happen in parallel. 2.each part has m sub parts that need to be sequential

I would like spark to manage the fault tolerance for me therefor I tried to use RDD, the issue is that I don't find a way to create that "two dimensional" RDD, only flat ones.

Is there any way to do so with spark and pyspark?

I need it to deal with faults, and to be parallel.

Maybe some way of using a regular RDD and force some jobs to happen before the others? maybe something that is more like wait for?

I guess i can create n threads each running a RDD of it's own but that seems a bit blunt...

Thanks

There is two ways i familiar with to add multithreading to your job.

1. Lets say you have a RDD with X partitions and every partition has ~Y elements. your RDD is RDD[A] and you want to transform it to RDD[B], but the convert A->B is a little heavy and take time. Instead use the regular RDD[A].map(A=>transform(A)) that iterate every row in partition sequencley, you can use mapPartition that give you List[A](the elements in every partition) and you can create a multithreaded transform on the list, that can save time. *NOTE: mapPartitions give you an iterator, so collect it to List will bring all the elements to memory, be careful.

2. Lets say you done with your ETL, you have RDD[A] that you cached and now you want to write it to 3 diffrent datesources (I hope you`ll use kafka instead but lets say that is the scenario).

instead of doing:

RDD[A].saveToDataSource1
RDD[A].saveToDataSource2
RDD[A].saveToDataSource3

and make it sequencly you can use multithreading over here and do it parallel. The same you can do if you read from 3 different datasources and than union them for example.

That is two cases i see when multithreading can help you in spark, all other options, spark is taking care of them already to be parallel as much as he can.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM