简体   繁体   English

如何在spark scala中写一个固定宽度的output文件

[英]How to write a fixed width output file in spark scala

Hi Now out of a spark Scala data frame I am getting a text file output with comma separated values and its coming in a folder as part files.您好现在,从 spark Scala 数据框中,我得到一个文本文件 output,其中包含逗号分隔值,并且它作为零件文件出现在文件夹中。 I wanted it to as fixed width like first column should be 10 bytes, next should be 5 bytes, 3rd should be 8 bytes..so on and as a single output file (output.txt) instead of part file我希望它的宽度固定,比如第一列应该是 10 个字节,接下来应该是 5 个字节,第三个应该是 8 个字节..等等,作为一个 output 文件(output.txt)而不是部分文件

myfile.rdd
   .map(r =>  { val x = r.toString; x.substring(1, x.length-1)})
   .saveAsTextFile("C:/Users/rukku/Desktop/op")

example output (as part0001, part0002 in a folder)示例 output(作为文件夹中的 part0001、part0002)

    aaaaa,bbbb,ccccc,dddddd,eee
    e,f,g,h,i
    jj,kk,ll,mm,nn

Needed output (output.txt without a folder)需要output(没有文件夹的output.txt)

     aaaaabbbbcccccddddddeee
         e   f    g     h  i
        jj  kk   ll    mm nn

Following could be two step solution to get the expected outcome以下可能是获得预期结果的两步解决方案

1- A Metadata need to be defined which holds the length of each column in the target file. 1- 需要定义一个元数据,它保存目标文件中每一列的长度。 For every column it would require something like i) a string return function to add leading/trailing spaces to meet the length of a column ii) add leading zeros to a numeric column and so on.对于每一列,它都需要类似 i) 的字符串返回 function 以添加前导/尾随空格以满足列的长度 ii) 将前导零添加到数字列等。

2- Once the dataframe is loaded by applying all length fixing, .coalesce(1) can repartition the data to one and write single output file. 2- 一旦通过应用所有长度固定加载 dataframe, .coalesce(1)可以将数据重新分区为一个并写入单个 output 文件。 But ensure that's not affecting the performance etc.但要确保这不会影响性能等。

If the job is running very frequently and create too many files, a roll out job can be set up too too read and merge the files (This need to be designed based on the application behaviour, SLA etc. )如果作业运行非常频繁并且创建了太多文件,则可以设置一个 roll out 作业来读取和合并文件(这需要根据应用程序行为、SLA 等进行设计)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM