简体   繁体   中英

Beam: CombinePerKey(max) hang in dataflow job

I am try to load data from GCS through pub\\sub way, and fetch user max level by userid . The following codes runs well in DirectRunner but job is hang in CombinePerKey(max) in dataflow.

Here are the codes

class ParseAndFilterFn(beam.DoFn):
    def process(self, element):
        text_line = element.strip()
        data = {}
        try:
            data = json.loads(text_line.decode('utf-8'))
            if 'user_id' in data and data['user_id'] and 'level' in data and data['level']:
                yield {
                    'user': data['user_id'],
                    'level': data['level'],
                    'ts': data['ts']
                }

def str2timestamp(t, fmt="%Y-%m-%dT%H:%M:%S.%fZ"):
    return time.mktime(datetime.strptime(t, fmt).timetuple())

class FormatFieldValueFn(beam.DoFn):
    def process(self, element):
        yield {
            "field": element[0],
            "value": element[1]
        }

...

        raw_event = (
                    p
                    | "Read Sub Message" >> beam.io.ReadFromPubSub(topic=args.topic)
                    | "Convert Message to JSON" >> beam.Map(lambda message: json.loads(message))
                    | "Extract File Name" >> beam.ParDo(ExtractFileNameFn())
                    | "Read File from GCS" >> beam.io.ReadAllFromText()
            )

        filtered_events = (
            raw_event
            | "ParseAndFilterFn" >> beam.ParDo(ParseAndFilterFn())
        )

        raw_events = (
            filtered_events
            | "AddEventTimestamps" >> beam.Map(lambda elem: beam.window.TimestampedValue(elem, str2timestamp(elem['ts'])))
        )

        window_events = (
            raw_events
            | "UseFixedWindow" >> beam.WindowInto(beam.window.FixedWindows(5 * 60))
        )

        user_max_level = (
            window_events
            | 'Group By User ID' >> beam.Map(lambda elem: (elem['user'], elem['level']))
            | 'Compute Max Level Per User' >> beam.CombinePerKey(max)
        )

        (user_max_level
         | "FormatFieldValueFn" >> beam.ParDo(FormatFieldValueFn())
        )

        p.run().wait_until_finish()        

Then I put one new zip file to GCS , then pipeline of dataflow is running , but hang in Compute Max Level Per User

在此处输入图片说明

Is there anything I am missing?

The root of the problem likely has to do with watermarks and lateness on the Combine transform (you can read a summary of the concept here ). The reason watermarks are possibly an issue is because you set the timestamps for your elements manually, using beam.Map , even though your watermarks have already been set when reading from the PubSub source, since it is an unbounded source which sets its own timestamps.

The ReadFromPubSub transform has a parameter labeled timestamp_attribute which is the intended way to use an attribute timestamp with PubSub. If you set this parameter to ts then ReadFromPubSub should emit elements with timestamps already set to ts , and the watermark should be set appropriately as well.

If that doesn't work, there are other things you can look at. Double checking that the timestamps are being set correctly is a good initial step (compare the timestamps of the elements produced by ReadFromPubSub to the value of ts ). Another possibility is that setting a trigger on the windows might help. A processing time trigger , for example, might be able to prevent windows from waiting forever for the watermark to catch up, although it might not be appropriate based on the needs of your pipeline. As an extra note, the metrics you screenshotted above can sometimes be unreliable for python streaming, so if you need to do fine-grained debugging you may have better luck by making your transforms output logs you can read through.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM