简体   繁体   English

使用Cloud Dataflow从PubSub将数据流式传输到Google Cloud Storage

[英]Streaming data to Google Cloud Storage from PubSub using Cloud Dataflow

I am listening to data from pub-sub using streaming data in dataflow. 我正在使用数据流中的流数据来监听来自pub-sub的数据。 Then I need to upload to storage, process the data and upload it to bigquery. 然后,我需要上传到存储,处理数据并将其上传到bigquery。

here is my code: 这是我的代码:

public class BotPipline {

public static void main(String[] args) {

    DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
    options.setRunner(BlockingDataflowPipelineRunner.class);
    options.setProject(MY_PROJECT);
    options.setStagingLocation(MY_STAGING_LOCATION);
    options.setStreaming(true);

    Pipeline pipeline = Pipeline.create(options);

    PCollection<String> input = pipeline.apply(PubsubIO.Read.maxNumRecords(1).subscription(MY_SUBSCRIBTION));

    input.apply(TextIO.Write.to(MY_STORAGE_LOCATION));

    input
    .apply(someDataProcessing(...)).named("update json"))
    .apply(convertToTableRow(...)).named("convert json to table row"))
            .apply(BigQueryIO.Write.to(MY_BQ_TABLE).withSchema(tableSchema)
    );
    pipeline.run();
}

} }

when I run the code commenting the Writing to storage the code works well. 当我运行代码注释“编写存储”时,代码运行良好。 but when I try uploading to big query I get this error (which is expected..): 但是当我尝试上传到大型查询时,出现此错误(这是预期的..):

Write can only be applied to a Bounded PCollection

I am not using bound since I need to run this all the time and I need the data to be uploaded immediately . 我没有使用bound,因为我需要一直运行,并且需要立即上传数据。 Any solution? 有什么办法吗?

EDIT: this my desired behavior: 编辑:这是我想要的行为:

I am receiving messages via pubsub. 我正在通过pubsub接收消息。 Each message should be stored in its own file in GCS as rough data, execute some processing on the data, and then save it to big query- having the file name in the data. 每个消息都应作为粗略数据存储在GCS中自己的文件中,对数据执行一些处理,然后将其保存到大查询中-数据中具有文件名。

Data should be seen immediately after received in BQ example : 在收到BQ示例后,应立即查看数据:

data published to pubsub : {a:1, b:2} 
data saved to GCS file UUID: A1F432 
data processing :  {a:1, b:2} -> 
                   {a:11, b: 22} -> 
                   {fileName: A1F432, data: {a:11, b: 22}} 
data in BQ : {fileName: A1F432, data: {a:11, b: 22}} 

the idea is that the processed data is stored in BQ having a link to the Rough data stored in GCS 这个想法是将处理后的数据存储在BQ中,该链接与GCS中存储的Rough数据有链接

Currently we don't support writing unbounded collections in TextIO.Write . 目前,我们不支持在TextIO.Write编写无界集合。 See related question . 相关问题

Could you clarify what you would like the behavior of unbounded TextIO.Write to be? 您能否阐明无限制TextIO.Write的行为是什么? Eg would you like to have one constantly growing file, or one file per window, closed when the window closes, or something else, or does it only matter to you that the total contents of the files written will eventually contain all the PubSub messages but it doesn't matter how the files are structured, etc? 例如,您要关闭一个不断增长的文件,还是每个窗口关闭一个文件,或者在关闭窗口时关闭文件,或者进行其他操作,或者对您而言重要的是写入的文件的总内容最终将包含所有PubSub消息,但文件的结构如何等等都没关系?

As a workaround, you can implement writing to GCS as your own DoFn , using IOChannelFactory to interact with GCS (in fact, TextIO.Write is, under the hood, just a composite transform that a user could have written themselves from scratch). 解决方法是,您可以使用IOChannelFactory与GCS进行交互,以编写GCS作为自己的DoFn (实际上, TextIO.Write只是用户可以从头开始编写的复合转换)。

You can access the window of the data using the optional BoundedWindow parameter on @ProcessElement . 您可以使用可选的访问数据的窗口BoundedWindow的参数@ProcessElement I'd be able to provide more advice if you explain the desired behavior. 如果您解释所需的行为,我将能够提供更多建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM