Spark Streaming textFileStream COPYING

Question

I'm trying to monitor a repository in HDFS to read and process data in files copied to it (to copy files from local system to HDFS I use hdfs dfs -put ), sometimes it generates the problem : Spark Streaming: java.io.FileNotFoundException: File does not exist: . 我试图监视HDFS中的存储库，以读取和处理复制到该存储库中的文件中的数据（要将文件从本地系统复制到HDFS，我使用hdfs dfs -put），有时会产生问题：Spark Streaming：java.io。 FileNotFoundException：文件不存在：。 COPYING so I read the problems in forums and the question here Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_ According to what I read the problem is linked to Spark streaming reading the file before it finishes being copied in HDFS and on Github : https://github.com/maji2014/spark/blob/b5af1bdc3e35c53564926dcbc5c06217884598bb/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala , they say that they corrected the problem but only for FileInputDStream as I could see but I'm using textFileStream When I tried to use FileInputDStream the IDE throws an error the Symbol is not accessible from this place. COPYING，所以我在论坛上阅读了问题，这里的问题是Spark Streaming：java.io.FileNotFoundException：文件不存在：<input_filename> ._ COPYING_根据我所读的问题，链接到Spark Streaming在文件完成之前读取文件在HDFS和Github上复制： https : //github.com/maji2014/spark/blob/b5af1bdc3e35c53564926dcbc5c06217884598bb/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala问题，但仅适用于FileInputDStream如我所见，但我使用的是textFileStream当我尝试使用FileInputDStream ，IDE引发错误，无法从此位置访问Symbol。 Does anyone know how to filter out the files that are still COPYING because I tried : 有谁知道如何过滤掉仍在复制的文件，因为我尝试过：

var lines = ssc.textFileStream(arg(0)).filter(!_.contains("_COPYING_")

but that didn't work and it's expected because the filter should be applied on the name of the file process I guess which I can't access As you can see I did plenty of research before asking the question but didn't get lucky , Any help please ? 但这是行不通的，这是预料之中的，因为应该将过滤器应用于文件进程的名称，我想我无法访问它。正如您所看到的，我在提出问题之前做了很多研究，但并不幸运，有什么帮助吗？

Answer 1

So I had a look: -put is the wrong method . 所以我看了一下： -put是错误的方法。 Look at the final comment: you have to use -rename in your shell script to have an atomical transaction on the HDFS. 看最后-rename评论：您必须在shell脚本中使用-rename在HDFS上进行原子事务。

Spark Streaming textFileStream COPYING

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-06-09 07:59:02

Spark Streaming textFileStream COPYING

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-06-09 07:59:02

解决方案1
2 已采纳 2016-06-09 07:59:02