简体   繁体   中英

Pyspark: saving a dataframe takes too long time

I have a pyspark dataframe like the following in Databricks. The dataframe consists of 4844472 rows. If I show the dataframe it takes 2.70 minutes

mp.show()
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
|                  d1|                  d2|                  d3|                  d4|                  d5|  idx1|  idx2|  idx3|  idx4|  idx5|stop_id|
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
|9.595641094599582E-4|0.001349351889471...|0.001349351889471...|0.001349351889471...|0.001349351889471...| 28230| 17538| 26928| 19679| 17939|      0|
|0.001073202843710...|0.001270625201076...|0.001270625201076...|0.001270625201076...|0.001270625201076...| 28230| 17939| 17538| 26928| 24350|      1|
|  0.5018332258683085|  0.6136104198426214|  0.7515940084598605|  0.7923086910541867|  0.8528614951791638| 36508|   352| 41406|  8666| 49244|      2|
|  0.5018463054690073|  0.6132230820328666|  0.7511594488585572|  0.7918622881865559|  0.8524241433703198| 36508|   352| 41406|  8666| 49244|      3|
| 0.03892296364448588| 0.10489822816393383| 0.11015065590036736| 0.11083574976820404| 0.11107823934046591|  8666| 41406| 15387| 48473| 67948|      4|
|   10.02685122773378|  10.026859886604985|  10.026931929963919|  10.027049899955523|   10.02708752857522| 96155| 99120| 93630| 95712| 95603|      5|
|  0.0949417179722534| 0.09624239157298783| 0.09663276949951659| 0.09666148620040976| 0.09668953319514831| 43297| 43729|  1552| 13413| 28338|      6|
|    1.58821803894894|   1.700924159639725|  1.7100413892619204|  1.7659644202932838|  1.7716894514740533| 36508| 31802| 32021|   352| 41742|      7|
| 0.14986457872379202|   2.792841786494224|   3.836931747376168|   3.843816724749531|  3.9381444585189453| 35388| 41824| 31802| 32021| 41742|      8|
| 0.07721536374839136| 0.08156724948742954| 0.08179178347923806| 0.08197182486131196| 0.08230211151587184| 28852|  5286| 15116| 43700| 43297|      9|
| 0.07729090186445249| 0.08164045431643911| 0.08186450776482652| 0.08204599950900325| 0.08237366675966874| 28852|  5286| 15116| 43700| 43297|     10|
|  0.0769126077608714| 0.08126623437928565|  0.0814915948802193| 0.08166946271648905| 0.08200422782781865| 28852|  5286| 15116| 43700| 43297|     11|
| 0.07726243730458815| 0.08161929282648625| 0.08184445756719544| 0.08202232556886682|  0.0823560729538226| 28852|  5286| 15116| 43700| 43297|     12|
|0.003059320786099506|0.006049295374860495|0.006068327803710736|0.006073689066371823|0.006076662805415367|116339|107049|115787|110162|115325|     13|
|0.008394860593130297| 0.01460154756618598| 0.01464517932249764|0.014657324902570745|0.014662473132286578|116339|107049|115787|110162|115325|     14|
|0.001033675981635...|0.002839691356009074|0.003808353392737469|0.003818776963070...| 0.00398314099011343|116760|114788|115385|111516|116688|     15|
|  1.3353905677767632|  2.3859918643288904|  2.5926306493938913|  2.6000405755949068|  2.6901787282764746| 35388| 41824| 31802| 32021| 41742|     16|
| 0.00476180910371182|0.005343904103854576|0.005609118384537962|0.005762718043973694|0.005970448424488381| 81157| 81355| 79754| 79586| 80617|     17|
|9.337105318309089E-5|4.923642967966935E-4|6.450655567561298E-4|7.293044985905078E-4|0.001032583874460...|100731| 92800|100571| 89266| 88715|     18|
|0.004311753043494...|0.005008322149796936|0.005161120819827323|0.005407692984541363|0.005592887249437105| 81157| 79754| 79586| 77492| 80617|     19|
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
Command took 2.70 minutes

If I try to save it takes infinity time

mp.write.mode("append").format("orc").save("mnt/tmp/")

Try using repartition before saving:

mp.repartition(200).write.mode("append").format("orc").save("mnt/tmp/")

Use an appropiate number of partitions based on the size of your dataframe. An optimal partition size it's between 500MB and 1GB.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM