简体   繁体   English

从Apache Beam中的多个文件夹读取文件并将输出映射到文件名

[英]Read Files from multiple folders in Apache Beam and map outputs to filenames

Working on reading files from multiple folders and then output the file contents with the file name like (filecontents, filename) to bigquery in apache beam using the python sdk and a dataflow runner. 正在从多个文件夹读取文件,然后使用python sdk和数据流运行程序将文件名(文件内容,文件名)的文件内容输出到apache beam中的bigquery。

Originally thought I could create A pcollection for each file then map the file contents with the filename. 原本以为我可以为每个文件创建一个pcollection,然后将文件内容与文件名映射。

def read_documents(pipeline):
  """Read the documents at the provided uris and returns (uri, line) pairs."""
  pcolls = []
  count = 0
  with open(TESTIN) as uris:
       for uri in uris:
    #print str(uri).strip("[]/'")
         pcolls.append(
         pipeline
         | 'Read: uri' + str(uri)  >>ReadFromText(str(uri).strip("[]/'"), compression_type = 'gzip')
         | 'WithKey: uri'  + str(uri)   >> beam.Map(lambda v, uri: (v, str(uri).strip("[]")), uri) 
         )
       return pcolls | 'FlattenReadPColls' >> beam.Flatten()

This worked fine but was slow and wouldn't work on dataflow cloud after about 10000 files. 这个工作正常,但是很慢,并且在大约10000个文件之后无法在数据流云上工作。 It would suffer from a broken pipe if over 10000 or so files. 如果超过10000个文件,它将遭受管道破裂的困扰。

Currently trying to overload the ReadAllFromText function from Text.io. 当前试图从Text.io重载ReadAllFromText函数。 Text.io is designed to read tons of files quickly from a pcollection of filenames or patterns. Text.io旨在从大量文件名或模式中快速读取大量文件。 There is a bug in this module if reading from Google cloud storage and the file has content encoding. 如果从Google云端存储中读取文件并且文件具有内容编码,则此模块中存在错误。 Google Cloud storage automatically gunzips files and transcodes them but for some reason ReadAllFromText doesn't work with it. Google云端存储会自动对文件进行Gunzip压缩并进行转码,但由于某些原因,ReadAllFromText无法使用它。 You have to change the metadata of the file to remove content encoding and set the compression type on ReadAllFromText to gzip. 您必须更改文件的元数据以删除内容编码,并将ReadAllFromText上的压缩类型设置为gzip。 I'm including this issue url in case anyone else has problems with ReadAllFromText https://issues.apache.org/jira/browse/BEAM-1874 如果其他人对ReadAllFromText有问题,我会提供这个问题的网址https://issues.apache.org/jira/browse/BEAM-1874

My current code looks like this 我当前的代码如下所示

class ReadFromGs(ReadAllFromText):

    def __init__(self):
        super(ReadFromGs, self).__init__(compression_type="gzip")

    def expand(self, pvalue):
        files = self._read_all_files
        return (
            pvalue          
            | 'ReadAllFiles' >> files #self._read_all_files
            | 'Map values' >>  beam.Map( lambda v: (v, filename)) # filename is a placeholder for the input filename that im trying to figure out how to include in the output.
            )

ReadAllFromText is contained in Text.io and calls ReadAllText from filebasedsource.py and inherits from PTransform. ReadAllFromText包含在Text.io中,并从filebasedsource.py调用ReadAllText并从PTransform继承。

I believe i'm just missing something simple missing. 我相信我只是缺少一些简单的缺失。

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py

https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/textio.py https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/textio.py

As you found, ReadFromText doesn't currently support dynamic filenames and you definitely don't want to create individual steps for the each URL. 如您ReadFromTextReadFromText当前不支持动态文件名,并且您绝对不想为每个URL创建单独的步骤。 From your initial sentence I understand you want get the filename and the file content out as one item. 从您的第一句话开始,我了解到您希望将文件名和文件内容作为一项输出。 That means you won't need or benefit from any streaming of parts of the file. 这意味着您将不需要文件部分的任何流或从中受益。 You can simply read the file contents. 您可以简单地读取文件内容。 Something like: 就像是:

import apache_beam as beam
from apache_beam.io.filesystems import FileSystems


def read_all_from_url(url):
    with FileSystems.open(url) as f:
        return f.read()


def read_from_urls(pipeline, urls):
    return (
        pipeline
        | beam.Create(urls)
        | 'Read File' >> beam.Map(lambda url: (
            url,
            read_all_from_url(url)
        ))
    )

You can customise it if you think you're having issues with metadata. 如果您认为元数据有问题,可以自定义它。 The output will be a tuple ( url , file contents ). 输出将是一个元组( url文件内容 )。 If your file contents is very large you might need a slightly different approach depending on your use case. 如果文件内容很大,则可能需要根据使用情况使用稍微不同的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM