简体   繁体   English

如何在apache beam dataflow中将csv转换为字典

[英]How to convert csv into a dictionary in apache beam dataflow

I would like to read a csv file and write it to BigQuery using apache beam dataflow. 我想阅读一个csv文件,并使用apache beam dataflow将其写入BigQuery。 In order to do this I need to present the data to BigQuery in the form of a dictionary. 为了做到这一点,我需要以字典的形式向BigQuery提供数据。 How can I transform the data using apache beam in order to do this? 如何使用apache beam转换数据才能执行此操作?

My input csv file has two columns, and I want to create a subsequent two column table in BigQuery. 我的输入csv文件有两列,我想在BigQuery中创建一个后续的两列表。 I know how to create data in BigQuery, thats straight forward, what I don't know is how to transform the csv into a dictionary. 我知道如何在BigQuery中创建数据,这是直接的,我不知道的是如何将csv转换为字典。 The below code is not correct but should give an idea of what i'm trying to do. 下面的代码不正确,但应该知道我正在尝试做什么。

# Standard imports
import apache_beam as beam
# Create a pipeline executing on a direct runner (local, non-cloud).
p = beam.Pipeline('DirectPipelineRunner')
# Create a PCollection with names and write it to a file.
(p
| 'read solar data' >> beam.Read(beam.io.TextFileSource('./sensor1_121116.csv'))
# How do you do this??
| 'convert to dictionary' >> beam.Map(lambda (k, v): {'luminosity': k, 'datetime': v})
| 'save' >> beam.Write(
   beam.io.BigQuerySink(
   output_table,
   schema='month:INTEGER, tornado_count:INTEGER',
   create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
   write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run()

Edit: as of version 2.12.0, Beam comes with new fileio transforms that allow you to read from CSV without having to reimplement a source. 编辑:为2.12.0版本,梁带有新fileio转换,让您从CSV无需重新实现源读取。 You can do this like so: 你可以这样做:

def get_csv_reader(readable_file):
  # You can return whichever kind of reader you want here
  # a DictReader, or a normal csv.reader.
  if sys.version_info >= (3, 0):
    return csv.reader(io.TextIOWrapper(readable_file.open()))
  else:
    return csv.reader(readable_file.open())

with Pipeline(...) as p:
  content_pc = (p
                | beam.io.fileio.MatchFiles("/my/file/name")
                | beam.io.fileio.ReadMatches()
                | beam.Reshuffle()  # Useful if you expect many matches
                | beam.FlatMap(get_csv_reader))

I recently wrote a test for this for Apache Beam. 我最近为Apache Beam编写了一个测试。 You can take a look on the Github repository . 您可以查看Github存储库


The old answer relied on reimplementing a source. 旧的答案依赖于重新实现一个来源。 This is no longer the main recommended way of doing this : ) 这不再是主要的推荐方式:)

The idea is to have a source that returns parsed CSV rows. 我们的想法是拥有一个返回解析后的CSV行的源代码。 You can do this by subclassing the FileBasedSource class to include CSV parsing. 您可以通过FileBasedSource类来包含CSV解析来完成此操作。 Particularly, the read_records function would look something like this: 特别是, read_records函数看起来像这样:

class MyCsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
  def read_records(self, file_name, range_tracker):
    self._file = self.open_file(file_name)

    reader = csv.reader(self._file)

    for rec in reader:
      yield rec

As a supplement to Pablo's post, I'd like to share a little change I made myself to his sample. 作为Pablo帖子的补充,我想分享一些我自己对他的样本做出的改变。 (+1 for you!) (给你+1!)

Changed: reader = csv.reader(self._file) to reader = csv.DictReader(self._file) 更改: reader = csv.reader(self._file)reader = csv.DictReader(self._file)

The csv.DictReader uses the first row of the CSV file as Dict keys. csv.DictReader使用CSV文件的第一行作为Dict键。 The other rows are used to populate a dict per row with it's values. 其他行用于使用它的值填充每行的dict。 It'll automatically put the right values to the correct keys based on column order. 它会根据列顺序自动将正确的值输入正确的键。

One little detail is that every value in the Dict is stored as string. 一个小细节是Dict中的每个值都存储为字符串。 This may conflict your BigQuery schema if you use eg. 如果您使用eg,这可能会与您的BigQuery架构冲突。 INTEGER for some fields. INTEGER用于某些领域。 So you need to take care of proper casting afterwards. 因此,您需要在事后进行适当的铸造。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何为 apache 光束数据流的输出 csv 添加标头? - How do I add headers for the output csv for apache beam dataflow? 如何在 Apache Beam / Cloud Dataflow 中实现回顾 - How to implement a lookback in Apache Beam / Cloud Dataflow 如何在 Python 的 Apache-Beam DataFlow 中合并解析的文本文件? - How To Combine Parsed TextFiles In Apache-Beam DataFlow in Python? 如何使用 Dataflow 在 apache 光束中跳过 io 级别的错误元素? - How to skip erroneous elements at io level in apache beam with Dataflow? 如何在Google Cloud Dataflow / Apache Beam中并行运行多个WriteToBigQuery? - How to run multiple WriteToBigQuery parallel in google cloud dataflow / apache beam? 如何正确 package Apache Beam 项目在 Google Dataflow 上运行 - How to correctly package Apache Beam Project to run on Google Dataflow 如何在Apache Beam Google DataFlow运行器中使用matplotlib模块 - How to use matplotlib module in Apache Beam Google DataFlow runner 如何在Python 3.x上获取数据流GCP的apache beam - How to get apache beam for dataflow GCP on Python 3.x 如何暂存 GCP/Apache Beam 数据流模板? - How to I stage a GCP/Apache Beam Dataflow template? 如何使用 Apache Beam(数据流)从 API 获取数据? - How to get data from an API using Apache Beam (Dataflow)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM