火花全文文件会选择部分创建的文件吗？

Question

I am using Spark wholeTextFiles API to read the files from source folder and load it to hive table. 我正在使用Spark WholeTextFiles API从源文件夹读取文件并将其加载到配置单元表。

File are arriving at source folder from a remote server. 文件从远程服务器到达源文件夹。 File are of huge size like 1GB-3GB. 文件非常大，例如1GB-3GB。 SCP of the files is taking quite a while. 文件的SCP需要相当长的时间。

If i launch the spark job and file is being SCPd to the source folder and process is halfway, will spark pick the file? 如果我启动了spark作业，并且文件正在SCPd到源文件夹，并且进程已完成一半，spark会选择文件吗？

If spark pick the file when it is halfway, it would be a problem since it would ignore rest of the content of the file. 如果spark在文件中途取走文件，则将是一个问题，因为它将忽略文件的其余内容。

Answer 1

If you are SCPing the files in to the source folder; 如果您正在将文件SCPing到源文件夹中； and then spark is reading from that folder; 然后火花正在从该文件夹中读取； it might happen that, half-written files are picked by spark, as SCP might take some time to copy. 可能会发生这样的情况，即半写文件被火花选中，因为SCP可能需要一些时间才能复制。

That will happen for sure. 那肯定会发生。

Your task would be - how not to write directly in that source folder - so that Spark doesn't pick incomplete files. 您的任务是-如何不直接在该源文件夹中进行写操作-这样Spark不会选择不完整的文件。

Possible way to resolve: 解决的可能方法：

At end of each file copy, SCP ZERO-kb file to indicate that SCP complete. 在每个文件副本的末尾，SCP ZERO-kb文件表示SCP已完成。
In spark job, when you do sc.wholeTextFiles(...) , pick only those file names that has zero kb corresponding file - using map. 在spark作业中，当您执行sc.wholeTextFiles(...) ，仅使用map选择具有kb对应文件的那些文件名。

Answer 2

Possible way to resolve: 解决的可能方法：

At end of each file copy, SCP ZERO-kb file to indicate that SCP complete. 在每个文件副本的末尾，SCP ZERO-kb文件表示SCP已完成。
In spark job, when you do sc.wholeTextFiles(...), pick only those file names that has zero kb corresponding file - using map. 在spark作业中，当执行sc.wholeTextFiles（...）时，仅使用map选择具有kb对应文件的那些文件名。

So, Here's code to check if correspondidng .ctl files are present in src folder. 因此，以下代码检查src文件夹中是否存在对应的.ctl文件。

val fr = sc.wholeTextFiles("D:\\DATA\\TEST\\tempstatus")

// Get only .ctl file
val temp1 = fr.map(x => x._1).filter(x => x.endsWith(".ctl"))

// Identify corresponding REAL-FILEs - without .ctl suffix
val temp2 = temp1.map(x => (x.replace(".ctl", ""),x.replace(".ctl", "")))

val result = fr
  .join(xx)
  .map{
    case (_, (entry, x)) => (x, entry)
  }

... Process rdd result as required. ...根据需要处理rdd result 。

The rdd temp2 is changed from RDD[String] to RDD[String, String] - for JOIN operation. rdd temp2从RDD[String]更改为RDD[String, String] -用于JOIN操作。 Never mind. 没关系。

火花全文文件会选择部分创建的文件吗？

问题描述

2 个解决方案

解决方案1
1 2017-06-15 20:07:13

解决方案2
1 已采纳 2017-06-16 06:55:23

火花全文文件会选择部分创建的文件吗？

问题描述

2 个解决方案

解决方案1 1 2017-06-15 20:07:13

解决方案2 1 已采纳 2017-06-16 06:55:23

解决方案1
1 2017-06-15 20:07:13

解决方案2
1 已采纳 2017-06-16 06:55:23