在pyspark中使用wholeTextFiles但得到内存不足的错误

Question

I have some files (part-00000.gz, part-00001.gz, part-00002.gz, ...) and each part is rather large.我有一些文件（part-00000.gz、part-00001.gz、part-00002.gz，...），每个部分都相当大。 I need to use the filename of each part because it contains time stamp information.我需要使用每个部分的文件名，因为它包含时间戳信息。 As I know, it seems that in pyspark only wholeTextFiles can read input as （filename, content）.据我所知，在 pyspark 中似乎只有 WholeTextFiles 可以将输入读取为（文件名，内容）。 However, i get the error of out of memory when using wholeTextFiles.但是，在使用 WholeTextFiles 时出现内存不足的错误。 So, my guess is that wholeTextFiles reads a whole part as content in mapper without partition operation.所以，我的猜测是，wholeTextFiles 在没有分区操作的情况下读取整个部分作为映射器中的内容。 I also find this answer ( How does the number of partitions affect `wholeTextFiles` and `textFiles`? ).我也找到了这个答案（分区数如何影响 `wholeTextFiles` 和 `textFiles`？）。 If so, how can i get the filename of a rather large part file.如果是这样，我怎样才能获得一个相当大的零件文件的文件名。 Thanks谢谢

Answer 1

You get the error because wholeTextFiles tries to read the entire file into a single RDD.您会收到错误消息，因为wholeTextFiles尝试将整个文件读入单个 RDD。 You're better off reading the file line-by-line, which you can do simply by writing your own generator and using the flatMap function.您最好逐行阅读文件，您只需编写自己的生成器并使用flatMap函数即可完成。 Here's an example of doing that to read a gzip file:这是读取 gzip 文件的示例：

import gzip
def read_fun_generator(filename):
    with gzip.open(filename, 'rb') as f:
        for line in f:
            yield line.strip()

gz_filelist = glob.glob("/path/to/files/*.gz")
rdd_from_bz2 = sc.parallelize(gz_filelist).flatMap(read_fun_generator)

在pyspark中使用wholeTextFiles但得到内存不足的错误

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-03-29 17:23:50

在pyspark中使用wholeTextFiles但得到内存不足的错误

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-03-29 17:23:50

解决方案1
2 已采纳 2016-03-29 17:23:50