如何在apache beam dataflow中将csv转换为字典

Question

I would like to read a csv file and write it to BigQuery using apache beam dataflow. 我想阅读一个csv文件，并使用apache beam dataflow将其写入BigQuery。 In order to do this I need to present the data to BigQuery in the form of a dictionary. 为了做到这一点，我需要以字典的形式向BigQuery提供数据。 How can I transform the data using apache beam in order to do this? 如何使用apache beam转换数据才能执行此操作？

My input csv file has two columns, and I want to create a subsequent two column table in BigQuery. 我的输入csv文件有两列，我想在BigQuery中创建一个后续的两列表。 I know how to create data in BigQuery, thats straight forward, what I don't know is how to transform the csv into a dictionary. 我知道如何在BigQuery中创建数据，这是直接的，我不知道的是如何将csv转换为字典。 The below code is not correct but should give an idea of what i'm trying to do. 下面的代码不正确，但应该知道我正在尝试做什么。

# Standard imports
import apache_beam as beam
# Create a pipeline executing on a direct runner (local, non-cloud).
p = beam.Pipeline('DirectPipelineRunner')
# Create a PCollection with names and write it to a file.
(p
| 'read solar data' >> beam.Read(beam.io.TextFileSource('./sensor1_121116.csv'))
# How do you do this??
| 'convert to dictionary' >> beam.Map(lambda (k, v): {'luminosity': k, 'datetime': v})
| 'save' >> beam.Write(
   beam.io.BigQuerySink(
   output_table,
   schema='month:INTEGER, tornado_count:INTEGER',
   create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
   write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run()

Answer 1

Edit: as of version 2.12.0, Beam comes with new fileio transforms that allow you to read from CSV without having to reimplement a source. 编辑：为2.12.0版本，梁带有新fileio转换，让您从CSV无需重新实现源读取。 You can do this like so: 你可以这样做：

def get_csv_reader(readable_file):
  # You can return whichever kind of reader you want here
  # a DictReader, or a normal csv.reader.
  if sys.version_info >= (3, 0):
    return csv.reader(io.TextIOWrapper(readable_file.open()))
  else:
    return csv.reader(readable_file.open())

with Pipeline(...) as p:
  content_pc = (p
                | beam.io.fileio.MatchFiles("/my/file/name")
                | beam.io.fileio.ReadMatches()
                | beam.Reshuffle()  # Useful if you expect many matches
                | beam.FlatMap(get_csv_reader))

I recently wrote a test for this for Apache Beam. 我最近为Apache Beam编写了一个测试。 You can take a look on the Github repository . 您可以查看Github存储库。

The old answer relied on reimplementing a source. 旧的答案依赖于重新实现一个来源。 This is no longer the main recommended way of doing this : ) 这不再是主要的推荐方式：）

The idea is to have a source that returns parsed CSV rows. 我们的想法是拥有一个返回解析后的CSV行的源代码。 You can do this by subclassing the FileBasedSource class to include CSV parsing. 您可以通过FileBasedSource类来包含CSV解析来完成此操作。 Particularly, the read_records function would look something like this: 特别是， read_records函数看起来像这样：

class MyCsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
  def read_records(self, file_name, range_tracker):
    self._file = self.open_file(file_name)

    reader = csv.reader(self._file)

    for rec in reader:
      yield rec

Answer 2

As a supplement to Pablo's post, I'd like to share a little change I made myself to his sample. 作为Pablo帖子的补充，我想分享一些我自己对他的样本做出的改变。 (+1 for you!) （给你+1！）

Changed: reader = csv.reader(self._file) to reader = csv.DictReader(self._file) 更改： reader = csv.reader(self._file)到reader = csv.DictReader(self._file)

The csv.DictReader uses the first row of the CSV file as Dict keys. csv.DictReader使用CSV文件的第一行作为Dict键。 The other rows are used to populate a dict per row with it's values. 其他行用于使用它的值填充每行的dict。 It'll automatically put the right values to the correct keys based on column order. 它会根据列顺序自动将正确的值输入正确的键。

One little detail is that every value in the Dict is stored as string. 一个小细节是Dict中的每个值都存储为字符串。 This may conflict your BigQuery schema if you use eg. 如果您使用eg，这可能会与您的BigQuery架构冲突。 INTEGER for some fields. INTEGER用于某些领域。 So you need to take care of proper casting afterwards. 因此，您需要在事后进行适当的铸造。

如何在apache beam dataflow中将csv转换为字典

问题描述

2 个解决方案

解决方案1
22 已采纳 2016-12-15 19:25:32

解决方案2
2 2017-12-19 14:11:25

如何在apache beam dataflow中将csv转换为字典

问题描述

2 个解决方案

解决方案1 22 已采纳 2016-12-15 19:25:32

解决方案2 2 2017-12-19 14:11:25

解决方案1
22 已采纳 2016-12-15 19:25:32

解决方案2
2 2017-12-19 14:11:25