简体   繁体   English

如何在SPARK数据框创建的文件夹中合并所有零件文件并在Scala中将其重命名为文件夹名称

[英]How to merge all part files in a folder created by SPARK data frame and rename as folder name in scala

Hi i have output of my spark data frame which creates folder structure and creates so may part files . 嗨,我有spark数据框的输出,该数据框创建了文件夹结构,并可能创建了零件文件。 Now i have to merge all part files inside the folder and rename that one file as folder path name . 现在,我必须合并文件夹中的所有零件文件,并将该文件重命名为文件夹路径名。

This is how i do partition 这是我做分区的方法

df.write.partitionBy("DataPartition","PartitionYear")
  .format("csv")
  .option("nullValue", "")
  .option("header", "true")/
  .option("codec", "gzip")
  .save("hdfs:///user/zeppelin/FinancialLineItem/output")

It creates folder structure like this 它创建这样的文件夹结构

hdfs:///user/zeppelin/FinancialLineItem/output/DataPartition=Japan/PartitionYear=1971/part-00001-87a61115-92c9-4926-a803-b46315e55a08.c000.csv.gz
hdfs:///user/zeppelin/FinancialLineItem/output/DataPartition=Japan/PartitionYear=1971/part-00002-87a61115-92c9-4926-a803-b46315e55a08.c001.csv.gz

I have to create final file like this 我必须像这样创建最终文件

hdfs:///user/zeppelin/FinancialLineItem/output/Japan.1971.currenttime.csv.gz

No part files here bith 001 and 002 is merged two one . 此处位001和002的任何零件文件都未合并为两个。

My data size it very big 300 GB gzip and 35 GB zipped so coalesce(1) and repartition becomes very slow . 我的数据大小非常大,有300 GB gzip和35 GB的压缩文件,所以coalesce(1) and repartition速度非常慢。

I have seen one solution here Write single CSV file using spark-csv but i am not able to implement it please help me with it . 我在这里看到一种解决方案, 使用spark-csv编写单个CSV文件,但是我无法实现它,请帮助我。

Repartition throw error 分区抛出错误

error: value repartition is not a member of org.apache.spark.sql.DataFrameWriter[org.apache.spark.sql.Row]
       dfMainOutputFinalWithoutNull.write.repartition("DataPartition","StatementTypeCode")

Try this from the head node outside of Spark... 从Spark外部的头节点尝试...

hdfs dfs -getmerge <src> <localdst>

https://hadoop.apache.org/docs/r1.2.1/file_system_shell.html#getmerge https://hadoop.apache.org/docs/r1.2.1/file_system_shell.html#getmerge

"Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file." “将源目录和目标文件作为输入,并将src中的文件连接到目标本地文件中。可以选择将addnl设置为启用,以在每个文件的末尾添加换行符。”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Spark SCALA中重命名AWS中的spark数据框输出文件 - How to rename spark data frame output file in AWS in spark SCALA 在 scala 中使用 spark 流从文件夹流式传输时,如何读取包括子文件夹在内的所有文件? - How do I read in all files including subfolders when streaming from folder using spark streaming in scala? dataproc 重命名由火花写入 GCS 文件夹中的文件 - dataproc rename files written by spark in GCS folder 尽管在Spark中删除其中的所有文件,但无法删除文件夹名称 - Not able to remove folder name inspite of delting all files inside it in spark 使用Scala读取文件夹中的多个文件以执行Spark作业 - Read multiple files in a folder with Scala for a Spark job Scala:如何在数据框中合并多个CSV文件 - Scala: How to merge the multiple CSV files in data frame 如何使用 scala 列出资源文件夹中的所有文件 - How to list all files from resources folder with scala 你如何列出资源文件夹中的所有文件(java/scala) - How do you list all files in the Resources folder (java/scala) 如何在scala spark中读取创建文件夹的日志 - How to create log of folder is read in scala spark 根据 spark scala 中的文件夹名称重命名和移动 S3 文件 - Rename and Move S3 files based on their folders name in spark scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM