在 Dataflow 流中处理来自 Pub/Sub 消息的文件

Question

我想部署一个正在监听 Pub/Sub 主题的流式数据流作业。

Pub/Sub 消息内容如下所示：

{
   "file_path": "gs://my_bucket_name/my_file.csv",
   "transformations": [
      {
         "column_name": "NAME",
         "transformation": "to_upper"
      },
      {
         "column_name": "SURNAME",
         "transformation": "to_lower"
      }
   ]
}

我的问题是我想处理消息（ file_path ）指定的文件，并对 CSV 文件中的每一列应用给定的转换。

我已经尝试了几种方法来实现这一点，但它们都没有奏效，我想知道这是否根本不可能，或者我错过了一些东西。

第一次尝试：

class ProcessMessage(beam.DoFn):

    def process(self, message):
        from apache_beam.pvalue import TaggedOutput
        try:
            file_path = message.get('file_path')
            yield TaggedOutput('file_path', file_path)
        except Exception as e:
            raise Exception(e)

with beam.Pipeline(options=pipeline_options) as p:
    file_path = (
        p | "Read from Pubsub" >> beam.io.ReadFromPubSub(topic=input_topic,timestamp_attribute='ts')
          | "Parse JSON" >> beam.Map(json.loads)
          | "Process Message" >> beam.ParDo(ProcessMessage).with_outputs('file_path')
    )
    file_content = (
        p
        | "Read file" >> beam.io.ReadFromText(file_path)
    )

这失败了： file_pattern must be of type string or ValueProvider; got <DoOutputsTuple main_tag=None tags=('file_path',) transform=<ParDo(PTransform) label=[ParDo(ProcessMessage)]> at 0x1441f9550> instead file_pattern must be of type string or ValueProvider; got <DoOutputsTuple main_tag=None tags=('file_path',) transform=<ParDo(PTransform) label=[ParDo(ProcessMessage)]> at 0x1441f9550> instead

第二次尝试 -> 使用自定义 csv 阅读器读取文件，然后返回内容：

class ReadFile(beam.DoFn):

    def process(self, element):
        import csv
        import io as io_file

        from apache_beam import io

        file_path = element.get('file_path')

        reader = csv.DictReader(io_file.TextIOWrapper(
            io.filesystems.FileSystems.open(file_path),
            encoding='utf-8'),
            delimiter=';')

        for row in reader:
            yield row

with beam.Pipeline(options=pipeline_options) as p:

    message = (
        p | "Read from Pubsub" >> beam.io.ReadFromPubSub(
            topic=pipeline_config.get('input_topic'),
            timestamp_attribute='ts')
        | "Parse JSON" >> beam.Map(json.loads)
        | "Process message" >> beam.ParDo(ProcessMessage())
    )

    file_content = (
        message
        | beam.ParDo(ReadFile())
        | beam.Map(print)
    )

这不会产生任何错误，也不会打印文件行。

我知道这篇文章有点长，但我希望有人可以帮助我，

谢谢！

Answer 1

第一个解决方案不起作用，因为ReadFromText将作为参数字符串，例如存储桶路径“gs://bucket/file”。 在您的示例中，您插入到此 class PCollection（先前 PTransform 的结果）中 - 所以它不起作用。 相反，您应该使用将ReadAllFromText作为输入的 ReadAllFromText，因此它是先前 PTransform 的结果。

此外，您的代码需要稍作修改：

如果 DoFn class 只返回一种类型的 output，则没有理由使用 TaggedOutput，所以让我们只返回常规迭代器。

class ProcessMessage(beam.DoFn):

    def process(self, message):
        try:
            file_path = message.get('file_path')
            yield file_path 
        except Exception as e:
            raise Exception(e)

接下来， ReadAllFromText应该连接到管道的上一步，而不是p 。

file_content = (
            p 
            | "Read from Pubsub" >> beam.io.ReadFromPubSub(topic=p.options.topic, timestamp_attribute='ts')
            | "Parse JSON" >> beam.Map(json.loads)
            | "Process Message" >> beam.ParDo(ProcessMessage())
            | "Read file" >> beam.io.ReadAllFromText()   
        )

请注意， file_content变量将是元素的 Pcollection，其中每个元素将以字符串形式出现在 CSV 文件的单行中。 因此，为每列轻松应用转换将更加复杂，因为在第一个元素中将是列名，下一个将只是没有应用列名的单行。

您的第二次尝试似乎对此更好：

class ApplyTransforms(beam.DoFn):

    def process(self, element):

        file_path = element.get('file_path')
        transformations = element.get('transformations')

        with beam.io.gcsio.GcsIO().open(file_path) as file:
            reader = csv.DictReader(io.TextIOWrapper(file, encoding="utf-8"), delimiter=';')
            for row in reader:
                for transform in transformations:
                    col_name = transform.get("column_name")
                    transformation = transform.get("transformation")
                    # apply your transform per row 
                yield row

像这样的东西可以工作，但可能更好的主意是将它分成两类 - 一类用于阅读，另一类用于应用转换:)

Answer 2

感谢@Pav3k 的回答，我能够解决问题。 我的代码现在已解耦，如下所示：

class MyMessage(typing.NamedTuple):
    # Simple way to propagate all the needed information from the Pub/Sub message.
    file_path: str
    transformations: dict


class ProcessMessage(beam.DoFn):

    def process(self, message):
        """
        Example of the Pub/Sub message
        {
            "file_path": "gs://my-bucket/file_to_process.csv",
            "transformations": {
                "col_1": "to_upper",
                "col_2": "to_lower"
            }
        }
        """
        yield MyMessage(file_path=message.get('file_path'), 
                        transformations=message.get('transformations'))


class ReadFile(beam.DoFn):

    def process(self, element: MyMessage):
        import csv
        import io as io_file

        from apache_beam import io

        reader = csv.DictReader(io_file.TextIOWrapper(
            io.filesystems.FileSystems.open(MyMessage.file_path),
            encoding='utf-8'),
            delimiter=';')

        for row in reader:
            # Yields both the row to process and the transformations.
            yield (row, MyMessage.transformations)


class Transform(beam.ParDo):

    def to_upper(self, value):
        return value.upper()

    def to_lower(self, value):
        return value.lower()

    def process(self, element):
        """
        Now I now the transformations for each element and may be parallelized.
        """
        row = element[0]
        transformations = element[1]
        transformed_row = {}
        for key in transformations:
            value = row[key]
            transformation = transformations[key]
            transformed_row[key] = getattr(self, transformation)(value)
        yield transformed_row


def main(argv):

    parser = argparse.ArgumentParser()
    parser.add_argument("--topic_name", required=True)
    app_args, pipeline_args = parser.parse_known_args()
    pipeline_options = PipelineOptions(pipeline_args)

    with beam.Pipeline(options=pipeline_options) as p:

        message = (
            p | "Read from Pubsub" >> beam.io.ReadFromPubSub(
                topic=app_args.topic_name,
                timestamp_attribute='ts')
            | "Parse JSON" >> beam.Map(json.loads)
            | "Process message" >> beam.ParDo(ProcessMessage())
        )

        file_content = (
            message
            | beam.ParDo(ReadFile())
            | beam.ParDo(Transform())
            | beam.Map(print)
        )

在 Dataflow 流中处理来自 Pub/Sub 消息的文件

问题描述

2 个解决方案

解决方案1
0 2022-08-05 19:12:04

解决方案2
0 2022-08-17 10:39:48

在 Dataflow 流中处理来自 Pub/Sub 消息的文件

问题描述

2 个解决方案

解决方案1 0 2022-08-05 19:12:04

解决方案2 0 2022-08-17 10:39:48

解决方案1
0 2022-08-05 19:12:04

解决方案2
0 2022-08-17 10:39:48