简体   繁体   English

如何从谷歌数据流 apache 光束 python 中的 GCS 存储桶中读取多个 JSON 文件

[英]How to read multiple JSON files from GCS bucket in google dataflow apache beam python

I'm having a bucket in GCS that contain list of JSON files.我在 GCS 中有一个包含 JSON 文件列表的存储桶。 I came to extract the list of the file names using我来使用提取文件名列表

def list_blobs(bucket_name):


    storage_client = storage.Client()

    blobs = storage_client.list_blobs(bucket_name)
    json_paths = []
    for blob in blobs:
        json_paths.append(f"gs://{bucket_name}/{blob.name}")
    return json_paths

Now I want to pass this list of filenames to apache beam to read them.现在我想将此文件名列表传递给 apache 光束以读取它们。 I wrote this code, but it doesn't seem a good pattern我写了这段代码,但它似乎不是一个好的模式

for i,file in enumerate(list_files):
        print("parsing file:", file)
        concat_data = (p |'Data {}'.format(i) >> ReadFromText(file)
        )
        final_result.append(concat_data)

Have you faced the same issue before?你以前遇到过同样的问题吗?

In the end I came to use the google-cloud storage as reading API for this.最后我开始使用谷歌云存储作为阅读 API 为此。

Listing all elements of the bucket列出存储桶的所有元素

def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""

storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name)
json_paths = []
for blob in blobs:
    #json_paths.append(f"gs://{bucket_name}/{blob.name}")
    json_paths.append(f"{blob.name}")
return json_paths

and I created this ParDo for reading the content我创建了这个 ParDo 来阅读内容

class ReadFileContent(beam.DoFn):

def setup(self):
    # Called whenever the DoFn instance is deserialized on the worker.
    # This means it can be called more than once per worker because multiple instances of a given DoFn subclass may be created (e.g., due to parallelization, or due to garbage collection after a period of disuse).
    # This is a good place to connect to database instances, open network connections or other resources.
    self.storage_client = storage.Client()

def process(self, file_name, bucket_name):
    bucket = self.storage_client.get_bucket(bucket_name)
    blob = bucket.get_blob(file_name)
    yield blob.download_as_string()

And mu pipeline looked like this: mu 管道看起来像这样:

list_files = list_blobs(bucket_name)

with beam.Pipeline(options=pipeline_options) as p:

    results = (
        p | 'Create' >> beam.Create(list_files)
          | 'Read each file content' >> beam.ParDo(ReadFileContent(), bucket_name)
          | 'next transformation' >> ...

Have a look at this link.看看这个链接。 The ReadFromText is a PTransform that helps reading a text file. ReadFromText 是一个有助于读取文本文件的 PTransform。 On the other hand ReadAllFromText is a PTransform that reads PCollection of text files.另一方面, ReadAllFromText是一个读取PCollection文本文件的 PTransform。 Reads a PCollection of text files or file patterns and produces a PCollection of strings.读取文本文件或文件模式的 PCollection 并生成字符串的 PCollection。

This is another answer that is more optimised if you have few JSONs to process.如果您要处理的 JSON 很少,这是另一个更优化的答案。 You can use ReadFromText transformation with a path pattern:您可以使用带有路径模式的ReadFromText转换:

(p | 'Read content' >> ReadFromText('gs://bucket/folder/*')
   | 'String to JSON' >> beam.Map(lambda k: json.loads(k))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 GCP 数据流和 Apache Beam Python SDK 从 GCS 读取速度非常慢 - Incredibly slow read from GCS with GCP Dataflow & Apache Beam Python SDK apache-beam 从 GCS 桶的多个文件夹中读取多个文件并加载它 biquery python - apache-beam reading multiple files from multiple folders of GCS buckets and load it biquery python 一旦使用 apache 光束 sdk 在 Google Cloud 中创建数据流作业,我们可以从云存储桶中删除 tmp 文件吗? - Once dataflow job is created in Google Cloud using apache beam sdk, can we delete the tmp files from cloud storage bucket? 如何从 apache beam python 读取 s3 文件? - how to read s3 files from apache beam python? apache_beam,在管道期间从 GCS 存储桶中读取数据 - apache_beam, read data from GCS buckets during pipeline Google Dataflow 上的 Apache Beam 示例的权限错误 - Permissions error with Apache Beam example on Google Dataflow Apache 光束中的开窗和水印:Google 数据流 - Windowing and Watermark in Apache beam : Google dataflow Google Dataflow 和 Apache 光束:为什么使用 ValueProvider - Google Dataflow and Apache beam: why ValueProvider 如何使用 Cloud Function 触发器组合 GCS 存储桶中的多个文件 - How to combine multiple files in GCS bucket with Cloud Function trigger 从 GCS 读取 CSV 到 class 数据流 Java - Read CSV to a class Dataflow Java from GCS
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM