[英]Google Cloud Dataflow with Python
Trying to implement an easier form of this example I have and error while insert data to BigQuery 在将数据插入BigQuery时,尝试实现此示例的简单形式时出现错误
This is the code 这是代码
from __future__ import absolute_import
import argparse
import logging
import re
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
class DataIngestion:
def parse_method(self, string_input):
values = re.split(",",re.sub('\r\n', '', re.sub(u'"', '', string_input)))
row = dict(zip('Mensaje',values))
return row
def run(argv=None):
"""The main function which creates the pipeline and runs it."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--input', dest='input', required=False,
help='Input file to read. This can be a local file or '
'a file in a Google Storage Bucket.',
default='C:\XXXX\prueba.csv')
parser.add_argument('--output', dest='output', required=False,
help='Output BQ table to write results to.',
default='PruebasIoT.TablaIoT')
known_args, pipeline_args = parser.parse_known_args(argv)
data_ingestion = DataIngestion()
p = beam.Pipeline(options=PipelineOptions(pipeline_args))
(p
| 'Read from a File' >> beam.io.ReadFromText(known_args.input,
skip_header_lines=1)
| 'String To BigQuery Row' >> beam.Map(lambda s:
data_ingestion.parse_method(s))
| 'Write to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink
(
known_args.output,
schema='Mensaje:STRING'
)
)
)
p.run().wait_until_finish()
if __name__ == '__main__':
# logging.getLogger().setLevel(logging.INFO)
run()
And this is the error: 这是错误:
RuntimeError: Could not successfully insert rows to BigQuery table [XXX]. Errors: [<InsertErrorsValueListEntry
errors: [<ErrorProto
debugInfo: u''
location: u'm'
message: u'no such field.'
reason: u'invalid'>]
index: 0>, <InsertErrorsValueListEntry
errors: [<ErrorProto
debugInfo: u''
location: u'm'
message: u'no such field.'
reason: u'invalid'>]
index: 1>]
I'm new with python and maybe the solutions is quite simple, but how I could do it? 我是python的新手,也许解决方法很简单,但是我该怎么做呢?
It would be possible to pass a single string in String To BigQuery Row instead of 可以将String中的单个字符串传递给BigQuery Row而不是
'String To BigQuery Row' >> beam.Map(lambda s:
data_ingestion.parse_method(s))
This would be the easier way to start better than using csv files and have to translate the file 与使用csv文件相比,这是一种比使用csv文件更好地启动的更简单方法。
I understand you have an input CSV file with a single column, of the form: 我了解您有一个输入CSV文件,该文件的单列格式为:
Message
This is a message
This is another message
I am writing to BQ
If my understanding was correct, you do not need to have the parse_method()
method, because as explained in the sample you shared , this is just a helper method that maps the CSV values to dictionaries (which are accepted by beam.io.BigQuerySink
). 如果我的理解是正确的,则无需使用parse_method()
方法,因为正如您共享的样本中所述 ,这只是一个将CSV值映射到字典的辅助方法( beam.io.BigQuerySink
接受了beam.io.BigQuerySink
)。
Then, you can simply do something like: 然后,您可以简单地执行以下操作:
p = beam.Pipeline(options=PipelineOptions(pipeline_args))
(p
| 'Read from a File' >> beam.io.ReadFromText(known_args.input, skip_header_lines=1)
| 'String To BigQuery Row' >> beam.Map(lambda s: dict(Message = s))
| 'Write to BigQuery' >> beam.io.Write(
beam.io.BigQuerySink(known_args.output, schema='Message:STRING')))
p.run().wait_until_finish()
Note that the only relevant difference is that the " String to BigQuery Row " mapping does not need a complex method anymore, and all it does is create a Python dictionary like {Message: "This is a message"}
, where Message
is the name of the column in your BQ table. 请注意,唯一相关的区别是“ 字符串到BigQuery行 ”的映射不再需要复杂的方法,它要做的只是创建一个Python字典,例如{Message: "This is a message"}
,其中Message
是名称BQ表中该列的位置。 In this mapping, s
is each of the String elements read in the beam.io.ReadFromText
transform, and we apply a lambda function . 在此映射中, s
是beam.io.ReadFromText
转换中读取的每个String元素,并且我们应用了lambda函数 。
To solve using a CSV file with only one value per row I have to use this: 要解决使用每行只有一个值的CSV文件的问题,我必须使用以下命令:
values = re.split(",",re.sub('\r\n', '', re.sub(u'"', '', string_input)))
row = dict(zip(('Name',),values))
I dont know why I have to put the "," after the 'Name' but if I don't do it, the dict(zip(... doesnt work properly 我不知道为什么必须在“名称”后面加上“,”,但如果我不这样做,则dict(zip(...无法正常工作
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.