简体   繁体   English

火花全文文件会选择部分创建的文件吗?

[英]Will spark wholetextfiles pick partially created file?

I am using Spark wholeTextFiles API to read the files from source folder and load it to hive table. 我正在使用Spark WholeTextFiles API从源文件夹读取文件并将其加载到配置单元表。

File are arriving at source folder from a remote server. 文件从远程服务器到达源文件夹。 File are of huge size like 1GB-3GB. 文件非常大,例如1GB-3GB。 SCP of the files is taking quite a while. 文件的SCP需要相当长的时间。

If i launch the spark job and file is being SCPd to the source folder and process is halfway, will spark pick the file? 如果我启动了spark作业,并且文件正在SCPd到源文件夹,并且进程已完成一半,spark会选择文件吗?

If spark pick the file when it is halfway, it would be a problem since it would ignore rest of the content of the file. 如果spark在文件中途取走文件,则将是一个问题,因为它将忽略文件的其余内容。

If you are SCPing the files in to the source folder; 如果您正在将文件SCPing到源文件夹中; and then spark is reading from that folder; 然后火花正在从该文件夹中读取; it might happen that, half-written files are picked by spark, as SCP might take some time to copy. 可能会发生这样的情况,即半写文件被火花选中,因为SCP可能需要一些时间才能复制。

That will happen for sure. 那肯定会发生。

Your task would be - how not to write directly in that source folder - so that Spark doesn't pick incomplete files. 您的任务是-如何不直接在该源文件夹中进行写操作-这样Spark不会选择不完整的文件。

Possible way to resolve: 解决的可能方法:

  1. At end of each file copy, SCP ZERO-kb file to indicate that SCP complete. 在每个文件副本的末尾,SCP ZERO-kb文件表示SCP已完成。
  2. In spark job, when you do sc.wholeTextFiles(...) , pick only those file names that has zero kb corresponding file - using map. 在spark作业中,当您执行sc.wholeTextFiles(...) ,仅使用map选择具有kb对应文件的那些文件名。

Possible way to resolve: 解决的可能方法:

  1. At end of each file copy, SCP ZERO-kb file to indicate that SCP complete. 在每个文件副本的末尾,SCP ZERO-kb文件表示SCP已完成。
  2. In spark job, when you do sc.wholeTextFiles(...), pick only those file names that has zero kb corresponding file - using map. 在spark作业中,当执行sc.wholeTextFiles(...)时,仅使用map选择具有kb对应文件的那些文件名。

So, Here's code to check if correspondidng .ctl files are present in src folder. 因此,以下代码检查src文件夹中是否存在对应的.ctl文件。

val fr = sc.wholeTextFiles("D:\\DATA\\TEST\\tempstatus")

// Get only .ctl file
val temp1 = fr.map(x => x._1).filter(x => x.endsWith(".ctl"))

// Identify corresponding REAL-FILEs - without .ctl suffix
val temp2 = temp1.map(x => (x.replace(".ctl", ""),x.replace(".ctl", "")))

val result = fr
  .join(xx)
  .map{
    case (_, (entry, x)) => (x, entry)
  }

... Process rdd result as required. ...根据需要处理rdd result

The rdd temp2 is changed from RDD[String] to RDD[String, String] - for JOIN operation. rdd temp2RDD[String]更改为RDD[String, String] -用于JOIN操作。 Never mind. 没关系。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM