简体   繁体   English

Spark Streaming textFileStream COPYING

[英]Spark Streaming textFileStream COPYING

I'm trying to monitor a repository in HDFS to read and process data in files copied to it (to copy files from local system to HDFS I use hdfs dfs -put ), sometimes it generates the problem : Spark Streaming: java.io.FileNotFoundException: File does not exist: . 我试图监视HDFS中的存储库,以读取和处理复制到该存储库中的文件中的数据(要将文件从本地系统复制到HDFS,我使用hdfs dfs -put),有时会产生问题:Spark Streaming:java.io。 FileNotFoundException:文件不存在:。 COPYING so I read the problems in forums and the question here Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_ According to what I read the problem is linked to Spark streaming reading the file before it finishes being copied in HDFS and on Github : https://github.com/maji2014/spark/blob/b5af1bdc3e35c53564926dcbc5c06217884598bb/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala , they say that they corrected the problem but only for FileInputDStream as I could see but I'm using textFileStream When I tried to use FileInputDStream the IDE throws an error the Symbol is not accessible from this place. COPYING,所以我在论坛上阅读了问题,这里的问题是Spark Streaming:java.io.FileNotFoundException:文件不存在:<input_filename> ._ COPYING_根据我所读的问题,链接到Spark Streaming在文件完成之前读取文件在HDFS和Github上复制: https : //github.com/maji2014/spark/blob/b5af1bdc3e35c53564926dcbc5c06217884598bb/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala问题,但仅适用于FileInputDStream如我所见,但我使用的是textFileStream当我尝试使用FileInputDStream ,IDE引发错误,无法从此位置访问Symbol。 Does anyone know how to filter out the files that are still COPYING because I tried : 有谁知道如何过滤掉仍在复制的文件,因为我尝试过:

var lines = ssc.textFileStream(arg(0)).filter(!_.contains("_COPYING_") 

but that didn't work and it's expected because the filter should be applied on the name of the file process I guess which I can't access As you can see I did plenty of research before asking the question but didn't get lucky , Any help please ? 但这是行不通的,这是预料之中的,因为应该将过滤器应用于文件进程的名称,我想我无法访问它。正如您所看到的,我在提出问题之前做了很多研究,但并不幸运,有什么帮助吗?

So I had a look: -put is the wrong method . 所以我看了一下: -put是错误的方法 Look at the final comment: you have to use -rename in your shell script to have an atomical transaction on the HDFS. 看最后-rename评论:您必须在shell脚本中使用-rename在HDFS上进行原子事务。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM