简体   繁体   中英

Using WriteToBigQuery FILE_LOADS in a streaming pipeline just creates a LOT of temporary tables (python SDK)

I have a streaming pipeline that takes messages from pub/sub, parses them, and writes to BigQuery. The challenge is that each message goes to a different event table based on the event property in the message, and they are not ordered.

This means (I believe) that the WriteToBigQuery method cannot efficiently batch the writes, I am seeing it basically write each message one at a time, and hence it is running too slowly. I have also tried adding a 60-second window and adding a GroupByKey / FlatMap to try to reorder them, with only minimal success at speeding it up.

Using the FILE_LOADS method in WriteToBigQuery with a 60+ second triggering frequency, it APPEARS to work, sending load jobs, which then (at least sometimes) succeed and I see the data go into the correct tables. BUT, the temporary tables that were created never get deleted, so I have hundreds of tables getting created (with names like beam_bq_job_LOAD_AUTOMATIC_JOB_NAME_LOAD_NAME_STEP_756_37417blahblahblah )...which is obviously not sustainable.

Writing via STREAMING_INSERTS works fine, just slowly, this is an attempt to make it more efficient.

If anybody could help me figure out why the tables aren't getting deleted that would I think give me a working, efficient pipeline. I've tried longer triggering frequencies (up to 1 hour) but the same behavior happens.

Here is my main pipeline - again, I don't have any issues with the rest of it, just providing for context.


    events, non_events = (p 
        | 'ReadData' >> beam.io.ReadFromPubSub(subscription = known_args.input_subscription).with_output_types(bytes)
        | 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
        | 'Parse JSON to Dict' >> beam.Map(lambda line: json.loads(line))
        | 'FilterOutNonEvents' >> beam.ParDo(FilterOutNonEvents()).with_outputs('MAIN_OUT', 'non_events')
    )
    
    parsed, missing_tables, _ = (events
        | 'ParseDict' >> beam.ParDo(ParseDict()).with_outputs('MAIN_OUT', 'missing_tables', 'ignore')
    )
    
    results, conversion_errors = (parsed
        | 'ConvertDataTypes' >> beam.ParDo(ConvertDataTypes()).with_outputs('MAIN_OUT', 'error_data')
    )
    
    final = (results
        | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
                table = lambda record: '{project}:{dataset}.{table}'.format(project = known_args.project, dataset = known_args.dataset, table = parse_event_to_dataset_name(patterns, record["event"])),
                schema = lambda tbl: {'fields':[{'name':c.split(':')[0], 'type':c.split(':')[1]} for c in schema_json[tbl.split('.')[-1]].split(',')]},
                create_disposition = beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition = beam.io.BigQueryDisposition.WRITE_APPEND,
                method = 'FILE_LOADS',
                triggering_frequency = 60
        )
    )

The table arg is determined from the event property of the message, and the schema arg is simply a reformatted slice of a global variable (initially read from GCS, again, no problems with this using streaming_inserts).

Thank you to anybody that can help. Been banging my head a lot with this (I'm pretty new to beam/dataflow).

When using LOAD_FILES with multiple partitions and/or dynamic destinations the behavior should be as follows :

'''
2. Multiple partitions and/or Dynamic Destinations:

    When there are multiple partitions of files destined for a single
    destination or when Dynamic Destinations are used, multiple load jobs
    need to be triggered for each partition/destination. Load Jobs are
    triggered to temporary tables, and those are later copied to the actual
    appropriate destination table. This ensures atomicity when only some
    of the load jobs would fail but not other. If any of them fails, then
    copy jobs are not triggered.
'''

In the code also appears that after the load jobs, beam should wait for them to finish, then copy the data from the temp tables and delete them; however, it seems that when used with a streaming pipeline, it doesn't complete these steps. On my reproduction using the DirectRunner it didn't even get to the CopyJob. I suggest to report it to apache beam team here .

N.netheless, for your use case, I would reconsider using the load job approach, because you might hit the quota for load and copy jobs pretty quick; and streaming inserts might be better suited for this scenario, and probably provide a better performance on insertion than the load jobs each 60+ seconds

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM