Spark Streaming：如何在 Python 中获取已处理文件的文件名

Question

I'm sort of a noob to Spark (and also Python honestly) so please forgive me if I've missed something obvious.我对 Spark（老实说也是 Python）有点菜鸟，所以如果我错过了一些明显的东西，请原谅我。

I am doing file streaming with Spark and Python.我正在使用 Spark 和 Python 进行文件流传输。 In the first example I did, Spark correctly listens to the given directory and counts word occurrences in the file, so I know that everything works in terms of listening to the directory.在我做的第一个例子中，Spark 正确地侦听给定的目录并计算文件中出现的单词，所以我知道一切都在侦听目录方面起作用。

Now I am trying to get the name of the file that is processed for auditing purposes.现在我正在尝试获取为审计目的而处理的文件的名称。 I read here http://mail-archives.us.apache.org/mod_mbox/spark-user/201504.mbox/%3CCANvfmP8OC9jrpVgWsRWfqjMxeYd6sE6EojfdyFy_GaJ3BO43_A@mail.gmail.com%3E that this is no trivial task.我在这里读到http://mail-archives.us.apache.org/mod_mbox/spark-user/201504.mbox/%3CCANvfmP8OC9jrpVgWsRWfqjMxeYd6sE6EojfdyFy_GaJ3BO43_A@mail.gmail.com%3E这不是一项琐碎的任务。 I got a possible solution here http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3CCAEgyCiZbnrd6Y_aG0cBRCVC1u37X8FERSEcHB=tR3A2VGrGrPQ@mail.gmail.com%3E and I have tried implementing it as follows:我在这里得到了一个可能的解决方案http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3CCAEgyCiZbnrd6Y_aG0cBRCVC1u37X8FERSEcHB=tR3A2VGrGrPQ@mail.gmail.com%3E我已经尝试实现它，如下所示：

from __future__ import print_function

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

def fileName(data):
    string = data.toDebugString

if __name__ == "__main__":
    sc = SparkContext(appName="PythonStreamingFileNamePrinter")
    ssc = StreamingContext(sc, 1)
    lines = ssc.textFileStream("file:///test/input/")
    files = lines.foreachRDD(fileName)
    print(files)
    ssc.start()
    ssc.awaitTermination()

Unfortunately, now rather than listening at the folder every second, it listens once, outputs 'None' and then just waits doing nothing.不幸的是，现在它不是每秒监听文件夹，而是监听一次，输出“无”，然后什么也不做。 The only difference between this and the code that did work is the这与确实有效的代码之间的唯一区别是

files = lines.foreachRDD(fileName)

Before I even worry about getting the filename (tomorrow's problems) can anybody see why this is only checking the directory once?在我什至担心获取文件名（明天的问题）之前，有人能明白为什么只检查一次目录吗？

Thanks in advance M提前致谢

Answer 1

So it was a noob error.所以这是一个菜鸟错误。 I'm posting my solution for reference for myself and others.我正在发布我的解决方案供我自己和其他人参考。

As pointed out by @user3689574, I was not returning the debug string in my function.正如@user3689574 所指出的，我没有在我的函数中返回调试字符串。 This fully explains why I was getting the 'None'.这完全解释了为什么我得到“无”。

Next, I was printing the debug outside of the function, meaning it was never part of the foreachRDD.接下来，我在函数外部打印调试，这意味着它永远不是 foreachRDD 的一部分。 Moving it into the function as follows:将其移动到函数中，如下所示：

def fileName(data):
    debug = data.toDebugString()
    print(debug)

This prints the debug information as it should, and continues to listen to the directory, as it should.这将按应有的方式打印调试信息，并按应有的方式继续侦听目录。 Changing that fixed my initial problem.改变它解决了我最初的问题。 In terms of getting the file name, that has become pretty straightforward.在获取文件名方面，这变得非常简单。

The debug string when there is no change in the directory is as follows:目录没有变化时的调试字符串如下：

(0) MapPartitionsRDD[1] at textFileStream at NativeMethodAccessorImpl.java:-2 [] | UnionRDD[0] at textFileStream at NativeMethodAccessorImpl.java:-2 []

Which neatly indicates that there is no file.这整齐地表明没有文件。 When a file is copied into the directory, the debug output is as follows:当一个文件被复制到目录中时，调试输出如下：

(1) MapPartitionsRDD[42] at textFileStream at NativeMethodAccessorImpl.java:-2 [] | UnionRDD[41] at testFileStream at NativeMethodAccessorImpl.java:-2 [] | file:/test/input/test.txt New HadoopRDD[40] at textFileStream at NativeMethodAccessorImpl.java:-2 []

Which, with a quick regex, gives you the file name with little trouble.其中，使用快速的正则表达式，可以轻松地为您提供文件名。 Hope this helps somebody else.希望这对其他人有帮助。

Answer 2

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

def get_file_info(rdd):
    file_content = rdd.collect()
    file_name = rdd.toDebugString()
    print(file_name, file_content)


def main():
    sc = SparkContext("local[2]", "deneme")
    ssc = StreamingContext(sc, 1)  # One DSTREAM in the same time

    lines = ssc.textFileStream('../urne')
    # here is the call
    lines.foreachRDD(lambda rdd: get_file_info(rdd))

    # Split each line into words
    words = lines.flatMap(lambda line: line.split("\n"))

    # Count each word in each batch
    pairs = words.map(lambda word: (word, 1))

    wordCounts = pairs.reduceByKey(lambda x, y: x + y)

    wordCounts.pprint()

    ssc.start()
   
    ssc.awaitTermination()
   

if __name__ == "__main__":
    main()

Then, when you get some result like this: b'(3) MapPartitionsRDD[237] at textFileStream at NativeMethodAccessorImpl.java:0 []\\n |然后，当你得到这样的结果时： b'(3) MapPartitionsRDD[237] at textFileStream at NativeMethodAccessorImpl.java:0 []\\n | UnionRDD[236] at textFileStream at NativeMethodAccessorImpl.java:0 []\\n | UnionRDD[236] at textFileStream at NativeMethodAccessorImpl.java:0 []\\n | file:/some/directory/file0.068513 NewHadoopRDD[231] at textFileStream at NativeMethodAccessorImpl.java:0 []\\n | file:/some/directory/file0.068513 NewHadoopRDD[231] at textFileStream at NativeMethodAccessorImpl.java:0 []\\n | file:/some/directory/file0.069317 NewHadoopRDD[233] at textFileStream at NativeMethodAccessorImpl.java:0 []\\n | file:/some/directory/file0.069317 NewHadoopRDD[233] at textFileStream at NativeMethodAccessorImpl.java:0 []\\n | file:/some/directory/file0.070036 NewHadoopRDD[235] at textFileStream at NativeMethodAccessorImpl.java:0 []' ['6', '3', '4', '3', '6', '0', '1', '7', '10', '2', '0', '0', '1', '1', '10', '8', '7', '7', '0', '8', '8', '9', '7', '2', '9', '1', '5', '8', '9', '9', '0', '6', '0', '4', '3', '4', '8', '5', '8', '10', '5', '2', '3', '6', '10', '2', '1', '0', '4', '3', '1', '8', '2', '10', '4', '0', '4', '4', '1', '4', '3', '1', '2', '5', '5', '3', ]文件：/some/directory/file0.070036 NewHadoopRDD[235] at textFileStream at NativeMethodAccessorImpl.java:0 []' ['6', '3', '4', '3', '6', '0', “1”、“7”、“10”、“2”、“0”、“0”、“1”、“1”、“10”、“8”、“7”、“7”、“0” ', '8', '8', '9', '7', '2', '9', '1', '5', '8', '9', '9', '0', '6', '0', '4', '3', '4', '8', '5', '8', '10', '5', '2', '3', '6' ', '10', '2', '1', '0', '4', '3', '1', '8', '2', '10', '4', '0', '4', '4', '1', '4', '3', '1', '2', '5', '5', '3', ]

Make a regex to get content of the files and their names, spark mark to you that it has 3 files as one DSTREM, so from there you can work制作一个正则表达式以获取文件的内容及其名称，向您标记它有 3 个文件作为一个 DSTREM，因此您可以从那里开始工作

Spark Streaming：如何在 Python 中获取已处理文件的文件名

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-01-18 14:33:23

解决方案2
0 2021-07-27 22:37:09

Spark Streaming：如何在 Python 中获取已处理文件的文件名

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-01-18 14:33:23

解决方案2 0 2021-07-27 22:37:09

解决方案1
3 已采纳 2016-01-18 14:33:23

解决方案2
0 2021-07-27 22:37:09