I have a pyspark
dataframe like the following in Databricks. The dataframe consists of 4844472
rows. If I show the dataframe it takes 2.70 minutes
mp.show()
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
| d1| d2| d3| d4| d5| idx1| idx2| idx3| idx4| idx5|stop_id|
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
|9.595641094599582E-4|0.001349351889471...|0.001349351889471...|0.001349351889471...|0.001349351889471...| 28230| 17538| 26928| 19679| 17939| 0|
|0.001073202843710...|0.001270625201076...|0.001270625201076...|0.001270625201076...|0.001270625201076...| 28230| 17939| 17538| 26928| 24350| 1|
| 0.5018332258683085| 0.6136104198426214| 0.7515940084598605| 0.7923086910541867| 0.8528614951791638| 36508| 352| 41406| 8666| 49244| 2|
| 0.5018463054690073| 0.6132230820328666| 0.7511594488585572| 0.7918622881865559| 0.8524241433703198| 36508| 352| 41406| 8666| 49244| 3|
| 0.03892296364448588| 0.10489822816393383| 0.11015065590036736| 0.11083574976820404| 0.11107823934046591| 8666| 41406| 15387| 48473| 67948| 4|
| 10.02685122773378| 10.026859886604985| 10.026931929963919| 10.027049899955523| 10.02708752857522| 96155| 99120| 93630| 95712| 95603| 5|
| 0.0949417179722534| 0.09624239157298783| 0.09663276949951659| 0.09666148620040976| 0.09668953319514831| 43297| 43729| 1552| 13413| 28338| 6|
| 1.58821803894894| 1.700924159639725| 1.7100413892619204| 1.7659644202932838| 1.7716894514740533| 36508| 31802| 32021| 352| 41742| 7|
| 0.14986457872379202| 2.792841786494224| 3.836931747376168| 3.843816724749531| 3.9381444585189453| 35388| 41824| 31802| 32021| 41742| 8|
| 0.07721536374839136| 0.08156724948742954| 0.08179178347923806| 0.08197182486131196| 0.08230211151587184| 28852| 5286| 15116| 43700| 43297| 9|
| 0.07729090186445249| 0.08164045431643911| 0.08186450776482652| 0.08204599950900325| 0.08237366675966874| 28852| 5286| 15116| 43700| 43297| 10|
| 0.0769126077608714| 0.08126623437928565| 0.0814915948802193| 0.08166946271648905| 0.08200422782781865| 28852| 5286| 15116| 43700| 43297| 11|
| 0.07726243730458815| 0.08161929282648625| 0.08184445756719544| 0.08202232556886682| 0.0823560729538226| 28852| 5286| 15116| 43700| 43297| 12|
|0.003059320786099506|0.006049295374860495|0.006068327803710736|0.006073689066371823|0.006076662805415367|116339|107049|115787|110162|115325| 13|
|0.008394860593130297| 0.01460154756618598| 0.01464517932249764|0.014657324902570745|0.014662473132286578|116339|107049|115787|110162|115325| 14|
|0.001033675981635...|0.002839691356009074|0.003808353392737469|0.003818776963070...| 0.00398314099011343|116760|114788|115385|111516|116688| 15|
| 1.3353905677767632| 2.3859918643288904| 2.5926306493938913| 2.6000405755949068| 2.6901787282764746| 35388| 41824| 31802| 32021| 41742| 16|
| 0.00476180910371182|0.005343904103854576|0.005609118384537962|0.005762718043973694|0.005970448424488381| 81157| 81355| 79754| 79586| 80617| 17|
|9.337105318309089E-5|4.923642967966935E-4|6.450655567561298E-4|7.293044985905078E-4|0.001032583874460...|100731| 92800|100571| 89266| 88715| 18|
|0.004311753043494...|0.005008322149796936|0.005161120819827323|0.005407692984541363|0.005592887249437105| 81157| 79754| 79586| 77492| 80617| 19|
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
Command took 2.70 minutes
If I try to save it takes infinity time
mp.write.mode("append").format("orc").save("mnt/tmp/")
Try using repartition before saving:
mp.repartition(200).write.mode("append").format("orc").save("mnt/tmp/")
Use an appropiate number of partitions based on the size of your dataframe. An optimal partition size it's between 500MB and 1GB.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.