[英]How to convert csv into a dictionary in apache beam dataflow
I would like to read a csv file and write it to BigQuery using apache beam dataflow. 我想阅读一个csv文件,并使用apache beam dataflow将其写入BigQuery。 In order to do this I need to present the data to BigQuery in the form of a dictionary.
为了做到这一点,我需要以字典的形式向BigQuery提供数据。 How can I transform the data using apache beam in order to do this?
如何使用apache beam转换数据才能执行此操作?
My input csv file has two columns, and I want to create a subsequent two column table in BigQuery. 我的输入csv文件有两列,我想在BigQuery中创建一个后续的两列表。 I know how to create data in BigQuery, thats straight forward, what I don't know is how to transform the csv into a dictionary.
我知道如何在BigQuery中创建数据,这是直接的,我不知道的是如何将csv转换为字典。 The below code is not correct but should give an idea of what i'm trying to do.
下面的代码不正确,但应该知道我正在尝试做什么。
# Standard imports
import apache_beam as beam
# Create a pipeline executing on a direct runner (local, non-cloud).
p = beam.Pipeline('DirectPipelineRunner')
# Create a PCollection with names and write it to a file.
(p
| 'read solar data' >> beam.Read(beam.io.TextFileSource('./sensor1_121116.csv'))
# How do you do this??
| 'convert to dictionary' >> beam.Map(lambda (k, v): {'luminosity': k, 'datetime': v})
| 'save' >> beam.Write(
beam.io.BigQuerySink(
output_table,
schema='month:INTEGER, tornado_count:INTEGER',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE)))
p.run()
Edit: as of version 2.12.0, Beam comes with new fileio
transforms that allow you to read from CSV without having to reimplement a source. 编辑:为2.12.0版本,梁带有新
fileio
转换,让您从CSV无需重新实现源读取。 You can do this like so: 你可以这样做:
def get_csv_reader(readable_file):
# You can return whichever kind of reader you want here
# a DictReader, or a normal csv.reader.
if sys.version_info >= (3, 0):
return csv.reader(io.TextIOWrapper(readable_file.open()))
else:
return csv.reader(readable_file.open())
with Pipeline(...) as p:
content_pc = (p
| beam.io.fileio.MatchFiles("/my/file/name")
| beam.io.fileio.ReadMatches()
| beam.Reshuffle() # Useful if you expect many matches
| beam.FlatMap(get_csv_reader))
I recently wrote a test for this for Apache Beam. 我最近为Apache Beam编写了一个测试。 You can take a look on the Github repository .
您可以查看Github存储库 。
The old answer relied on reimplementing a source. 旧的答案依赖于重新实现一个来源。 This is no longer the main recommended way of doing this : )
这不再是主要的推荐方式:)
The idea is to have a source that returns parsed CSV rows. 我们的想法是拥有一个返回解析后的CSV行的源代码。 You can do this by subclassing the
FileBasedSource
class to include CSV parsing. 您可以通过
FileBasedSource
类来包含CSV解析来完成此操作。 Particularly, the read_records
function would look something like this: 特别是,
read_records
函数看起来像这样:
class MyCsvFileSource(apache_beam.io.filebasedsource.FileBasedSource):
def read_records(self, file_name, range_tracker):
self._file = self.open_file(file_name)
reader = csv.reader(self._file)
for rec in reader:
yield rec
As a supplement to Pablo's post, I'd like to share a little change I made myself to his sample. 作为Pablo帖子的补充,我想分享一些我自己对他的样本做出的改变。 (+1 for you!)
(给你+1!)
Changed: reader = csv.reader(self._file)
to reader = csv.DictReader(self._file)
更改:
reader = csv.reader(self._file)
到reader = csv.DictReader(self._file)
The csv.DictReader
uses the first row of the CSV file as Dict keys. csv.DictReader
使用CSV文件的第一行作为Dict键。 The other rows are used to populate a dict per row with it's values. 其他行用于使用它的值填充每行的dict。 It'll automatically put the right values to the correct keys based on column order.
它会根据列顺序自动将正确的值输入正确的键。
One little detail is that every value in the Dict is stored as string. 一个小细节是Dict中的每个值都存储为字符串。 This may conflict your BigQuery schema if you use eg.
如果您使用eg,这可能会与您的BigQuery架构冲突。 INTEGER for some fields.
INTEGER用于某些领域。 So you need to take care of proper casting afterwards.
因此,您需要在事后进行适当的铸造。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.