![](/img/trans.png)
[英]Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam
[英]Dataflow (Apache Beam) can't write on BigQuery
我有一個管道,必須在其最后步驟中在 BigQuery 上寫入兩條記錄,我真的不知道為什么它似乎沒有插入任何內容。 我沒有錯誤,該表存在並且它已經包含記錄,實際上我必須使用 TRUNCATE/INSERT 模式。
有人可以幫我弄清楚為什么它沒有像我預期的那樣工作嗎?
這是我的管道:
p = beam.Pipeline(options=pipeline_options)
(p
| 'Read Configuration Table ' >> beam.io.Read(beam.io.BigQuerySource(config['ENVIRONMENT']['configuration_table']))
| 'Get Files from Server' >> beam.Map(import_file)
| 'Upload files on Bucket' >> beam.Map(upload_file_on_bucket)
| 'Set record update' >> beam.Map(set_last_step)
| 'Update table' >> beam.io.gcp.bigquery.WriteToBigQuery(
table=config['ENVIRONMENT']['configuration_table'],
write_disposition='WRITE_TRUNCATE',
schema=('folder:STRING, last_file:STRING')
)
)
和
def set_last_step(file_list):
logging.info(msg='UPDATE CONFIGURATION TABLE - working on: ' + str(file_list))
folder = ''
if 'original' in file_list:
if '1951' in file_list:
folder = '1951'
else:
folder = '1952'
dic = {'folder': folder, 'last_file': file_list['original']}
logging.info(msg='UPDATE CONFIGURATION TABLE - no work done, reporting original record: ' + str(dic))
else:
folder = list(file_list.keys())[0]
path = list(file_list.values())[0]
dic = {'folder': folder, 'last_file': path}
logging.info(msg='UPDATE CONFIGURATION TABLE - work done, reporting new record: ' + str(dic))
purge(dir=os.path.join(HOME_PATH, 'download'), pattern=folder+"_")
logging.info(msg='UPDATE CONFIGURATION TABLE - record to be updated: ' + str(dic))
return dic
WriteToBigQuery 階段的輸入記錄(顯然是來自“更新表”階段的 output)是:
{'folder': '1952', 'last_file': '1952_2019120617.log.gz'}
{'folder': '1951', 'last_file': '1951_2019120617.log.gz'}
來自 DataFlow 的調試信息是:
2019-12-06 18:09:36 DEBUG Creating or getting table <TableReference
datasetId: 'MYDATASET'
projectId: 'MYPROJECT'
tableId: 'MYTABLE'> with schema {'fields': [{'name': 'folder', 'type': 'STRING', 'mode': 'NULLABLE'}, {'name': 'last_file', 'type': 'STRING', 'mode': 'NULLABLE'}]}.
2019-12-06 18:09:36 DEBUG Created the table with id MYTABLE
2019-12-06 18:09:36 INFO Created table MYPROJECT.MYDATASET.MYTABLE with schema <TableSchema
fields: [<TableFieldSchema
fields: []
mode: 'NULLABLE'
name: 'folder'
type: 'STRING'>, <TableFieldSchema
fields: []
mode: 'NULLABLE'
name: 'last_file'
type: 'STRING'>]>. Result: <Table
creationTime: 1575652176727
etag: '0/GXOOeXPCmYsMfgGNxl2Q=='
id: 'MYPROJECT:MYDATASET.MYTABLE'
kind: 'bigquery#table'
lastModifiedTime: 1575652176766
location: 'EU'
numBytes: 0
numLongTermBytes: 0
numRows: 0
schema: <TableSchema
fields: [<TableFieldSchema
fields: []
mode: 'NULLABLE'
name: 'folder'
type: 'STRING'>, <TableFieldSchema
fields: []
mode: 'NULLABLE'
name: 'last_file'
type: 'STRING'>]>
selfLink: 'https://www.googleapis.com/bigquery/v2/projects/MYPROJECT/datasets/MYDATASET/tables/MYTABLE'
tableReference: <TableReference
datasetId: 'MYDATASET'
projectId: 'MYPROJECT'
tableId: 'MYTABLE'> with schema {'fields': [{'name': 'folder', 'type': 'STRING', 'mode': 'NULLABLE'}, {'name': 'last_file', 'type': 'STRING', 'mode': 'NULLABLE'}]}.
2019-12-06 18:09:36 DEBUG Created the table with id MYTABLE
2019-12-06 18:09:36 INFO Created table MYPROJECT.MYDATASET.MYTABLE with schema <TableSchema
fields: [<TableFieldSchema
fields: []
mode: 'NULLABLE'
name: 'folder'
type: 'STRING'>, <TableFieldSchema
fields: []
mode: 'NULLABLE'
name: 'last_file'
type: 'STRING'>]>. Result: <Table
creationTime: 1575652176727
etag: '0/GXOOeXPCmYsMfgGNxl2Q=='
id: 'MYPROJECT:MYDATASET.MYTABLE'
kind: 'bigquery#table'
lastModifiedTime: 1575652176766
location: 'EU'
numBytes: 0
numLongTermBytes: 0
numRows: 0
schema: <TableSchema
fields: [<TableFieldSchema
fields: []
mode: 'NULLABLE'
name: 'folder'
type: 'STRING'>, <TableFieldSchema
fields: []
mode: 'NULLABLE'
name: 'last_file'
type: 'STRING'>]>
selfLink: 'https://www.googleapis.com/bigquery/v2/projects/MYPROJECT/datasets/MYDATASET/tables/MYTABLE'
tableReference: <TableReference
datasetId: 'MYDATASET'
projectId: 'MYPROJECT'
tableId: 'MYTABLE'>
type: 'TABLE'>.
2019-12-06 18:09:36 WARNING Sleeping for 150 seconds before the write as BigQuery inserts can be routed to deleted table for 2 mins after the delete and create.
2019-12-06 18:12:06 DEBUG Attempting to flush to all destinations. Total buffered: 2
2019-12-06 18:12:06 DEBUG Flushing data to MYPROJECT:MYDATASET.MYTABLE. Total 2 rows.
2019-12-06 18:12:07 DEBUG Passed: True. Errors are []
在此示例中,我將 XML 元素解析為 DF 並將其推送到 GBQ。 希望你能在這里找到有用的東西。
import pandas as pd
import xml.etree.ElementTree as ET
import datetime
import json
import requests
import pandas_gbq
from lxml import etree
# authentication: working now....
login = 'FN.LN@your_email.com'
password = 'your_AS_pswd'
AsOfDate = datetime.datetime.today().strftime('%m-%d-%Y')
#1) SLA=471162: Execute Query
REQUEST_URL = 'https://www.some_data.com'
response = requests.get(REQUEST_URL, auth=(login, password))
xml_data = response.text.encode('utf-8', 'ignore')
#print(response.text)
#tree = etree.parse(xml_data)
root = ET.fromstring(xml_data)
# start collecting root elements and headers for data frame 1
desc = root.get("SLA_Description")
frm = root.get("start_date")
thru = root.get("end_date")
dev = root.get("obj_device")
loc = root.get("locations")
loc = loc[:-1]
df1 = pd.DataFrame([['From:',frm],['Through:',thru],['Object:',dev],['Location:',loc]])
df1.columns = ['SLAs','Analytics']
#print(df1)
# start getting the analytics for data frame 2
data=[['Goal:',root[0][0].text],['Actual:',root[0][1].text],['Compliant:',root[0][2].text],['Errors:',root[0][3].text],['Checks:',root[0][4].text]]
df2 = pd.DataFrame(data)
df2.columns = ['SLAs','Analytics']
#print(df2)
# merge data frame 1 with data frame 2
df3 = df1.append(df2, ignore_index=True)
#print(df3)
# append description and today's date onto data frame
df3['Description'] = desc
df3['AsOfDate'] = AsOfDate
#df3.dtypes
# push from data frame, where data has been transformed, into Google BQ
pandas_gbq.to_gbq(df3, 'website.Metrics', 'your-firm', chunksize=None, reauth=False, if_exists='append', private_key=None, auth_local_webserver=False, table_schema=None, location=None, progress_bar=True, verbose=None)
print('Execute Query, Done!!')
該問題與 Dataflow 在 BigQuery 上使用 stream 方法這一事實有關,這意味着數據不會立即保存在數據庫中,而是在 window 之后保存...所以您只需等待幾分鍾。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.