简体   繁体   中英

uniformly partition a rdd in spark

I have a text file in HDFS, which has about 10 million records. I am trying to read the file do some transformations on that data. I am trying to uniformly partition the data before I do the processing on it. here is the sample code

var myRDD = sc.textFile("input file location")

myRDD = myRDD.repartition(10000)

and when I do my transformations on this re-partitioned data, I see that one partition has abnormally large number of records and others have very little data. ( image of the distribution )

So the load is high on only one executor I also tried and got the same result

myRDD.coalesce(10000, shuffle = true)

is there a way to uniformly distribute records among partitions.

Attached is the shuffle read size/ number of records on that particular executor the circled one has a lot more records to process than the others

any help is appreciated thank you.

To deal with the skew, you can repartition your data using distribute by(or using repartition as you used). For the expression to partition by, choose something that you know will evenly distribute the data.

You can even use the primary key of the DataFrame(RDD).

Even this approach will not guarantee that data will be distributed evenly between partitions. It all depends on the hash of the expression by which we distribute. Spark : how can evenly distribute my records in all partition

Salting can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data. ( here is link for salting )

For small data I have found that I need to enforce uniform partitioning myself. In pyspark the difference is easily reproducible. In this simple example I'm just trying to parallelize a list of 100 elements into 10 even partitions. I would expect each partition to hold 10 elements. Instead, I get an uneven distribution with partitions sizes anywhere from 4 to 22:

my_list = list(range(100))
rdd = spark.sparkContext.parallelize(my_list).repartition(10)
rdd.glom().map(len).collect()

# Outputs: [10, 4, 14, 6, 22, 6, 8, 10, 4, 16]

Here is the workaround I use, which is to index the data myself and then mod the index to find which partition to place the row in:

my_list = list(range(100))
number_of_partitions = 10
rdd = (
    spark.sparkContext
    .parallelize(zip(range(len(my_list)), my_list))
    .partitionBy(number_of_partitions, lambda idx: idx % number_of_partitions)
)
rdd.glom().map(len).collect()

# Outputs: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM