简体   繁体   English

在pyspark中使用wholeTextFiles但得到内存不足的错误

[英]using wholeTextFiles in pyspark but get the error of out of memory

I have some files (part-00000.gz, part-00001.gz, part-00002.gz, ...) and each part is rather large.我有一些文件(part-00000.gz、part-00001.gz、part-00002.gz,...),每个部分都相当大。 I need to use the filename of each part because it contains time stamp information.我需要使用每个部分的文件名,因为它包含时间戳信息。 As I know, it seems that in pyspark only wholeTextFiles can read input as (filename, content).据我所知,在 pyspark 中似乎只有 WholeTextFiles 可以将输入读取为(文件名,内容)。 However, i get the error of out of memory when using wholeTextFiles.但是,在使用 WholeTextFiles 时出现内存不足的错误。 So, my guess is that wholeTextFiles reads a whole part as content in mapper without partition operation.所以,我的猜测是,wholeTextFiles 在没有分区操作的情况下读取整个部分作为映射器中的内容。 I also find this answer ( How does the number of partitions affect `wholeTextFiles` and `textFiles`? ).我也找到了这个答案( 分区数如何影响 `wholeTextFiles` 和 `textFiles`? )。 If so, how can i get the filename of a rather large part file.如果是这样,我怎样才能获得一个相当大的零件文件的文件名。 Thanks谢谢

You get the error because wholeTextFiles tries to read the entire file into a single RDD.您会收到错误消息,因为wholeTextFiles尝试将整个文件读入单个 RDD。 You're better off reading the file line-by-line, which you can do simply by writing your own generator and using the flatMap function.您最好逐行阅读文件,您只需编写自己的生成器并使用flatMap函数即可完成。 Here's an example of doing that to read a gzip file:这是读取 gzip 文件的示例

import gzip
def read_fun_generator(filename):
    with gzip.open(filename, 'rb') as f:
        for line in f:
            yield line.strip()

gz_filelist = glob.glob("/path/to/files/*.gz")
rdd_from_bz2 = sc.parallelize(gz_filelist).flatMap(read_fun_generator)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM