简体   繁体   English

使用coalesce(1) 将数据集写入s3 花费了太多时间

[英]Using coalesce(1) is taking too much time time for writing dataset to s3

I'm using coalesce(1) for writing the set of records in s3 bucket in csv process.我正在使用 coalesce(1) 在 csv 进程中写入 s3 存储桶中的记录集。 which is taking too much time for 505 records.这对于 505 条记录花费了太多时间。

dataset.coalesce(1).write().csv("s3a://bucketname/path");

And I want to mention that before this writing process, I'm having a encryption process which is changing value of some fields of the row of dataset.我想提一下,在此编写过程之前,我有一个加密过程,它正在更改数据集行的某些字段的值。 there i'm using repartion(200).我正在使用 repartion(200)。 As作为

dataset.javaRDD().repartition(200).map(r -> func());

if I'm skipping the encyption process, the writing process is not even taking single minute.如果我跳过加密过程,写入过程甚至不需要一分钟。
What is issue which is causing the process to slow down?是什么问题导致进程变慢?
How can I increase the performance?如何提高性能?

Always avoid using coalesce(1) instead use partition by, i suppose the function which you are using to encrypt data is taking a lot of time as it has to iterate through all the records you could change it to flat map and check the preformance始终避免使用coalesce(1) 而使用partition by,我想您用来加密数据的函数需要花费大量时间,因为它必须遍历所有记录,您可以将其更改为平面地图并检查性能

Request you to check map and flat map要求您查看地图和平面地图

Welcome to the community please do accept the answer if useful.欢迎来到社区,如果有用请采纳答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM