![](/img/trans.png)
[英]Read and write avro files by inferring schema using Python SDK in Google Cloud Dataflow - Apache Beam
[英]Read a set of xml files using Google Cloud DataFlow python sdk
我正在尝试从GCS存储桶中读取XML文件的集合并处理它们,其中集合中的每个元素都是表示整个文件的字符串,但我找不到关于如何实现此目的的正确示例,我也无法理解它来自Apache Beam文档,主要是关于Java版本。
我当前的管道如下所示:
p = beam.Pipeline(options=PipelineOptions(pipeline_args))
(p
| 'Read from a File' >> beam.io.Read(training_files_folder)
| 'String To BigQuery Row' >> beam.Map(lambda s:
data_ingestion.parse_method(s))
| 'Write to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink(
known_args.output,
schema='title:STRING,text:STRING,id:STRING',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run().wait_until_finish()
我收到的错误信息是:
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.2.1\helpers\pydev\pydevd.py", line 1664, in <module>
main()
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.2.1\helpers\pydev\pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files\JetBrains\PyCharm Community Edition 2018.2.1\helpers\pydev\pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:/Users/Tomer/PycharmProjects/hyperpartisan/cloud-version/data_ingestion.py", line 135, in <module>
run()
File "C:/Users/Tomer/PycharmProjects/hyperpartisan/cloud-version/data_ingestion.py", line 130, in run
p.run().wait_until_finish()
File "C:\Users\Tomer\anaconda\envs\hyperpartisan\lib\site-packages\apache_beam\runners\direct\direct_runner.py", line 421, in wait_until_finish
self._executor.await_completion()
File "C:\Users\Tomer\anaconda\envs\hyperpartisan\lib\site-packages\apache_beam\runners\direct\executor.py", line 398, in await_completion
self._executor.await_completion()
File "C:\Users\Tomer\anaconda\envs\hyperpartisan\lib\site-packages\apache_beam\runners\direct\executor.py", line 444, in await_completion
six.reraise(t, v, tb)
File "C:\Users\Tomer\anaconda\envs\hyperpartisan\lib\site-packages\apache_beam\runners\direct\executor.py", line 341, in call
finish_state)
File "C:\Users\Tomer\anaconda\envs\hyperpartisan\lib\site-packages\apache_beam\runners\direct\executor.py", line 366, in attempt_call
side_input_values)
File "C:\Users\Tomer\anaconda\envs\hyperpartisan\lib\site-packages\apache_beam\runners\direct\transform_evaluator.py", line 109, in get_evaluator
input_committed_bundle, side_inputs)
File "C:\Users\Tomer\anaconda\envs\hyperpartisan\lib\site-packages\apache_beam\runners\direct\transform_evaluator.py", line 283, in __init__
self._source.pipeline_options = evaluation_context.pipeline_options
AttributeError: 'str' object has no attribute 'pipeline_options'
非常感谢任何帮助。 谢谢Tomer
解决了第一个问题:事实证明这不适用于DirectRunner,将运行器更改为DataFlowRunner并用ReadFromText替换Read解决了异常:
p = beam.Pipeline(options = PipelineOptions(pipeline_args))
(p
| 'Read from a File' >> beam.io.ReadFromText(training_files_folder)
| 'String To BigQuery Row' >> beam.Map(lambda s:
data_ingestion.parse_method(s))
| 'Write to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink(
known_args.output,
schema='title:STRING,text:STRING,id:STRING',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run().wait_until_finish()
但是现在我看到这种方法给了我一行来自每个文件作为管道元素,而我希望将整个文件作为字符串作为每个元素。 不知道该怎么做。 我发现这篇文章,但它是在java中,并不确定它如何与python和gcs版本具体。
所以看起来ReadFromText对我的用例不起作用,我不知道如何创建文件管道。
解决方案:感谢Ankur的帮助,我修改了代码以包含从MatchResult对象列表转换所需的步骤,这些对象是GCSFileSystem返回到字符串的pCollection,每个字符串代表一个文件。
p = beam.Pipeline(options=PipelineOptions(pipeline_args))
gcs = GCSFileSystem(PipelineOptions(pipeline_args))
gcs_reader = GCSFileReader(gcs)
(p
| 'Read Files' >> beam.Create([m.metadata_list for m in gcs.match([training_files_folder])])
| 'metadata_list to filepath' >> beam.FlatMap(lambda metadata_list: [metadata.path for metadata in metadata_list])
| 'string To BigQuery Row' >> beam.Map(lambda filepath:
data_ingestion.parse_method(gcs_reader.get_string_from_filepath(filepath)))
| 'Write to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink(
known_args.output,
schema='title:STRING,text:STRING,id:STRING',
# Creates the table in BigQuery if it does not yet exist.
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
# Appends data to the BigQuery table
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)))
p.run().wait_until_finish()
代码使用此帮助程序类来读取gcs文件:
class GCSFileReader:
"""Helper class to read gcs files"""
def __init__(self, gcs):
self.gcs = gcs
def get_string_from_filepath(self,filepath):
with self.gcs.open(filepath) as reader:
res = reader.read()
return res
ReadFromText在给定路径中逐行读取文件。 你想要的是一个文件列表,然后使用GcsFileSystem在ParDo中一次读取一个文件https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/gcsfilesystem.py然后将内容写入BigQuery。
您也可以参考类似主题的邮件主题https://lists.apache.org/thread.html/85da22a845cef8edd942fcc4906a7b47040a4ae8e10aef4ef00be233@%3Cuser.beam.apache.org%3E
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.