[英]Pub/Sub to BigQuery (Batch) using Dataflow (Python)
我在 Python 中創建了一個流式數據流管道,只是想澄清我下面的代碼是否按我的預期工作。 這就是我打算做的:
這是 Python 中的代碼片段
options = PipelineOptions(
subnetwork=SUBNETWORK,
service_account_email=SERVICE_ACCOUNT_EMAIL,
use_public_ips=False,
streaming=True,
project=project,
region=REGION,
staging_location=STAGING_LOCATION,
temp_location=TEMP_LOCATION,
job_name=f"pub-sub-to-big-query-xxx-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)
p = beam.Pipeline(DataflowRunner(), options=options)
pubsub = (
p
| "Read Topic" >> ReadFromPubSub(topic=INPUT_TOPIC)
| "To Dict" >> Map(json.loads)
| "Write To BigQuery" >> WriteToBigQuery(table=TABLE, schema=schema, method='FILE_LOADS',
triggering_frequency=60, max_files_per_bundle=1,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_APPEND))
我可以知道上面的代碼是否在做我打算做的事情嗎? 從 Pub/Sub 流式傳輸,每 60 秒,它會批量插入 BigQuery。 我特意將max_files_per_bundle設置為1以防止創建超過 1 個分片,以便每分鍾只加載 1 個文件,但不確定我是否做得對。 Java 版本有 withNumFileShards 選項,但我在 Python 中找不到等效項。 我參考下面的文檔: https : //beam.apache.org/releases/pydoc/2.31.0/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.WriteToBigQuery
只是好奇我是否應該使用窗口來實現我打算做的事情?
options = PipelineOptions(
subnetwork=SUBNETWORK,
service_account_email=SERVICE_ACCOUNT_EMAIL,
use_public_ips=False,
streaming=True,
project=project,
region=REGION,
staging_location=STAGING_LOCATION,
temp_location=TEMP_LOCATION,
job_name=f"pub-sub-to-big-query-xxx-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
)
p = beam.Pipeline(DataflowRunner(), options=options)
pubsub = (
p
| "Read Topic" >> ReadFromPubSub(topic=INPUT_TOPIC)
| "To Dict" >> Map(json.loads)
| 'Window' >> beam.WindowInto(window.FixedWindows(60), trigger=AfterProcessingTime(60),
accumulation_mode=AccumulationMode.DISCARDING)
| "Write To BigQuery" >> WriteToBigQuery(table=TABLE, schema=schema, method='FILE_LOADS',
triggering_frequency=60, max_files_per_bundle=1,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_APPEND))
如果沒有第二種方法中的窗口,第一種方法就足夠好了嗎? 我現在正在使用第一種方法,但我不確定是每分鍾從多個文件進行多次加載,還是實際上將所有發布/訂閱消息合並為 1 並進行一次批量加載?
謝謝!
不是python解決方案,但我最終求助於Java版本
public static PTransform<PCollection<String>, PCollection<TableRow>> jsonToTableRow() {
return new JsonToTableRow();
}
private static class JsonToTableRow
extends PTransform<PCollection<String>, PCollection<TableRow>> {
@Override
public PCollection<TableRow> expand(PCollection<String> stringPCollection) {
return stringPCollection.apply("JsonToTableRow", MapElements.via(
new SimpleFunction<String, TableRow>() {
@Override
public TableRow apply(String json) {
try {
InputStream inputStream =
new ByteArrayInputStream(json.getBytes(StandardCharsets.UTF_8));
return TableRowJsonCoder.of().decode(inputStream, Context.OUTER);
} catch (IOException e) {
throw new RuntimeException("Unable to parse input", e);
}
}
}));
}
}
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
options.setDiskSizeGb(10);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read from PubSub", PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply(jsonToTableRow())
.apply("WriteToBigQuery", BigQueryIO.writeTableRows().to(options.getOutputTable())
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withTriggeringFrequency(Duration.standardMinutes(1))
.withNumFileShards(1)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
pipeline.run();
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.