简体   繁体   English

Pyspark:保存 dataframe 需要很长时间

[英]Pyspark: saving a dataframe takes too long time

I have a pyspark dataframe like the following in Databricks.我在 Databricks 中有一个pyspark dataframe,如下所示。 The dataframe consists of 4844472 rows. dataframe 由4844472行组成。 If I show the dataframe it takes 2.70 minutes如果我显示 dataframe 需要 2.70 分钟

mp.show()
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
|                  d1|                  d2|                  d3|                  d4|                  d5|  idx1|  idx2|  idx3|  idx4|  idx5|stop_id|
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
|9.595641094599582E-4|0.001349351889471...|0.001349351889471...|0.001349351889471...|0.001349351889471...| 28230| 17538| 26928| 19679| 17939|      0|
|0.001073202843710...|0.001270625201076...|0.001270625201076...|0.001270625201076...|0.001270625201076...| 28230| 17939| 17538| 26928| 24350|      1|
|  0.5018332258683085|  0.6136104198426214|  0.7515940084598605|  0.7923086910541867|  0.8528614951791638| 36508|   352| 41406|  8666| 49244|      2|
|  0.5018463054690073|  0.6132230820328666|  0.7511594488585572|  0.7918622881865559|  0.8524241433703198| 36508|   352| 41406|  8666| 49244|      3|
| 0.03892296364448588| 0.10489822816393383| 0.11015065590036736| 0.11083574976820404| 0.11107823934046591|  8666| 41406| 15387| 48473| 67948|      4|
|   10.02685122773378|  10.026859886604985|  10.026931929963919|  10.027049899955523|   10.02708752857522| 96155| 99120| 93630| 95712| 95603|      5|
|  0.0949417179722534| 0.09624239157298783| 0.09663276949951659| 0.09666148620040976| 0.09668953319514831| 43297| 43729|  1552| 13413| 28338|      6|
|    1.58821803894894|   1.700924159639725|  1.7100413892619204|  1.7659644202932838|  1.7716894514740533| 36508| 31802| 32021|   352| 41742|      7|
| 0.14986457872379202|   2.792841786494224|   3.836931747376168|   3.843816724749531|  3.9381444585189453| 35388| 41824| 31802| 32021| 41742|      8|
| 0.07721536374839136| 0.08156724948742954| 0.08179178347923806| 0.08197182486131196| 0.08230211151587184| 28852|  5286| 15116| 43700| 43297|      9|
| 0.07729090186445249| 0.08164045431643911| 0.08186450776482652| 0.08204599950900325| 0.08237366675966874| 28852|  5286| 15116| 43700| 43297|     10|
|  0.0769126077608714| 0.08126623437928565|  0.0814915948802193| 0.08166946271648905| 0.08200422782781865| 28852|  5286| 15116| 43700| 43297|     11|
| 0.07726243730458815| 0.08161929282648625| 0.08184445756719544| 0.08202232556886682|  0.0823560729538226| 28852|  5286| 15116| 43700| 43297|     12|
|0.003059320786099506|0.006049295374860495|0.006068327803710736|0.006073689066371823|0.006076662805415367|116339|107049|115787|110162|115325|     13|
|0.008394860593130297| 0.01460154756618598| 0.01464517932249764|0.014657324902570745|0.014662473132286578|116339|107049|115787|110162|115325|     14|
|0.001033675981635...|0.002839691356009074|0.003808353392737469|0.003818776963070...| 0.00398314099011343|116760|114788|115385|111516|116688|     15|
|  1.3353905677767632|  2.3859918643288904|  2.5926306493938913|  2.6000405755949068|  2.6901787282764746| 35388| 41824| 31802| 32021| 41742|     16|
| 0.00476180910371182|0.005343904103854576|0.005609118384537962|0.005762718043973694|0.005970448424488381| 81157| 81355| 79754| 79586| 80617|     17|
|9.337105318309089E-5|4.923642967966935E-4|6.450655567561298E-4|7.293044985905078E-4|0.001032583874460...|100731| 92800|100571| 89266| 88715|     18|
|0.004311753043494...|0.005008322149796936|0.005161120819827323|0.005407692984541363|0.005592887249437105| 81157| 79754| 79586| 77492| 80617|     19|
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
Command took 2.70 minutes

If I try to save it takes infinity time如果我尝试保存它需要无限的时间

mp.write.mode("append").format("orc").save("mnt/tmp/")

Try using repartition before saving:在保存之前尝试使用重新分区:

mp.repartition(200).write.mode("append").format("orc").save("mnt/tmp/")

Use an appropiate number of partitions based on the size of your dataframe.根据 dataframe 的大小使用适当数量的分区。 An optimal partition size it's between 500MB and 1GB.最佳分区大小在 500MB 和 1GB 之间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM