[英]Will spark wholetextfiles pick partially created file?
I am using Spark wholeTextFiles API to read the files from source folder and load it to hive table. 我正在使用Spark WholeTextFiles API从源文件夹读取文件并将其加载到配置单元表。
File are arriving at source folder from a remote server. 文件从远程服务器到达源文件夹。 File are of huge size like 1GB-3GB.
文件非常大,例如1GB-3GB。 SCP of the files is taking quite a while.
文件的SCP需要相当长的时间。
If i launch the spark job and file is being SCPd to the source folder and process is halfway, will spark pick the file? 如果我启动了spark作业,并且文件正在SCPd到源文件夹,并且进程已完成一半,spark会选择文件吗?
If spark pick the file when it is halfway, it would be a problem since it would ignore rest of the content of the file. 如果spark在文件中途取走文件,则将是一个问题,因为它将忽略文件的其余内容。
If you are SCPing the files in to the source folder; 如果您正在将文件SCPing到源文件夹中; and then spark is reading from that folder;
然后火花正在从该文件夹中读取; it might happen that, half-written files are picked by spark, as SCP might take some time to copy.
可能会发生这样的情况,即半写文件被火花选中,因为SCP可能需要一些时间才能复制。
That will happen for sure. 那肯定会发生。
Your task would be - how not to write directly in that source folder - so that Spark doesn't pick incomplete files. 您的任务是-如何不直接在该源文件夹中进行写操作-这样Spark不会选择不完整的文件。
Possible way to resolve: 解决的可能方法:
sc.wholeTextFiles(...)
, pick only those file names that has zero kb corresponding file - using map. sc.wholeTextFiles(...)
,仅使用map选择具有kb对应文件的那些文件名。 Possible way to resolve: 解决的可能方法:
So, Here's code to check if correspondidng .ctl
files are present in src folder. 因此,以下代码检查src文件夹中是否存在对应的
.ctl
文件。
val fr = sc.wholeTextFiles("D:\\DATA\\TEST\\tempstatus")
// Get only .ctl file
val temp1 = fr.map(x => x._1).filter(x => x.endsWith(".ctl"))
// Identify corresponding REAL-FILEs - without .ctl suffix
val temp2 = temp1.map(x => (x.replace(".ctl", ""),x.replace(".ctl", "")))
val result = fr
.join(xx)
.map{
case (_, (entry, x)) => (x, entry)
}
... Process rdd result
as required. ...根据需要处理rdd
result
。
The rdd temp2
is changed from RDD[String]
to RDD[String, String]
- for JOIN
operation. rdd
temp2
从RDD[String]
更改为RDD[String, String]
-用于JOIN
操作。 Never mind. 没关系。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.