![](/img/trans.png)
[英]Incredibly slow read from GCS with GCP Dataflow & Apache Beam Python SDK
[英]How to read multiple JSON files from GCS bucket in google dataflow apache beam python
我在 GCS 中有一个包含 JSON 文件列表的存储桶。 我来使用提取文件名列表
def list_blobs(bucket_name):
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name)
json_paths = []
for blob in blobs:
json_paths.append(f"gs://{bucket_name}/{blob.name}")
return json_paths
现在我想将此文件名列表传递给 apache 光束以读取它们。 我写了这段代码,但它似乎不是一个好的模式
for i,file in enumerate(list_files):
print("parsing file:", file)
concat_data = (p |'Data {}'.format(i) >> ReadFromText(file)
)
final_result.append(concat_data)
你以前遇到过同样的问题吗?
最后我开始使用谷歌云存储作为阅读 API 为此。
列出存储桶的所有元素
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
storage_client = storage.Client()
blobs = storage_client.list_blobs(bucket_name)
json_paths = []
for blob in blobs:
#json_paths.append(f"gs://{bucket_name}/{blob.name}")
json_paths.append(f"{blob.name}")
return json_paths
我创建了这个 ParDo 来阅读内容
class ReadFileContent(beam.DoFn):
def setup(self):
# Called whenever the DoFn instance is deserialized on the worker.
# This means it can be called more than once per worker because multiple instances of a given DoFn subclass may be created (e.g., due to parallelization, or due to garbage collection after a period of disuse).
# This is a good place to connect to database instances, open network connections or other resources.
self.storage_client = storage.Client()
def process(self, file_name, bucket_name):
bucket = self.storage_client.get_bucket(bucket_name)
blob = bucket.get_blob(file_name)
yield blob.download_as_string()
mu 管道看起来像这样:
list_files = list_blobs(bucket_name)
with beam.Pipeline(options=pipeline_options) as p:
results = (
p | 'Create' >> beam.Create(list_files)
| 'Read each file content' >> beam.ParDo(ReadFileContent(), bucket_name)
| 'next transformation' >> ...
看看这个链接。 ReadFromText 是一个有助于读取文本文件的 PTransform。 另一方面, ReadAllFromText是一个读取PCollection文本文件的 PTransform。 读取文本文件或文件模式的 PCollection 并生成字符串的 PCollection。
如果您要处理的 JSON 很少,这是另一个更优化的答案。 您可以使用带有路径模式的ReadFromText
转换:
(p | 'Read content' >> ReadFromText('gs://bucket/folder/*')
| 'String to JSON' >> beam.Map(lambda k: json.loads(k))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.