繁体   English   中英

在 Dataflow 流中处理来自 Pub/Sub 消息的文件

[英]Process file from a Pub/Sub message in Dataflow streaming

我想部署一个正在监听 Pub/Sub 主题的流式数据流作业。

Pub/Sub 消息内容如下所示:

{
   "file_path": "gs://my_bucket_name/my_file.csv",
   "transformations": [
      {
         "column_name": "NAME",
         "transformation": "to_upper"
      },
      {
         "column_name": "SURNAME",
         "transformation": "to_lower"
      }
   ]
}

我的问题是我想处理消息( file_path )指定的文件,并对 CSV 文件中的每一列应用给定的转换。

我已经尝试了几种方法来实现这一点,但它们都没有奏效,我想知道这是否根本不可能,或者我错过了一些东西。

  1. 第一次尝试:
class ProcessMessage(beam.DoFn):

    def process(self, message):
        from apache_beam.pvalue import TaggedOutput
        try:
            file_path = message.get('file_path')
            yield TaggedOutput('file_path', file_path)
        except Exception as e:
            raise Exception(e)

with beam.Pipeline(options=pipeline_options) as p:
    file_path = (
        p | "Read from Pubsub" >> beam.io.ReadFromPubSub(topic=input_topic,timestamp_attribute='ts')
          | "Parse JSON" >> beam.Map(json.loads)
          | "Process Message" >> beam.ParDo(ProcessMessage).with_outputs('file_path')
    )
    file_content = (
        p
        | "Read file" >> beam.io.ReadFromText(file_path)
    )

这失败了: file_pattern must be of type string or ValueProvider; got <DoOutputsTuple main_tag=None tags=('file_path',) transform=<ParDo(PTransform) label=[ParDo(ProcessMessage)]> at 0x1441f9550> instead file_pattern must be of type string or ValueProvider; got <DoOutputsTuple main_tag=None tags=('file_path',) transform=<ParDo(PTransform) label=[ParDo(ProcessMessage)]> at 0x1441f9550> instead

  1. 第二次尝试 -> 使用自定义 csv 阅读器读取文件,然后返回内容:
class ReadFile(beam.DoFn):

    def process(self, element):
        import csv
        import io as io_file

        from apache_beam import io

        file_path = element.get('file_path')

        reader = csv.DictReader(io_file.TextIOWrapper(
            io.filesystems.FileSystems.open(file_path),
            encoding='utf-8'),
            delimiter=';')

        for row in reader:
            yield row

with beam.Pipeline(options=pipeline_options) as p:

    message = (
        p | "Read from Pubsub" >> beam.io.ReadFromPubSub(
            topic=pipeline_config.get('input_topic'),
            timestamp_attribute='ts')
        | "Parse JSON" >> beam.Map(json.loads)
        | "Process message" >> beam.ParDo(ProcessMessage())
    )

    file_content = (
        message
        | beam.ParDo(ReadFile())
        | beam.Map(print)
    )

这不会产生任何错误,也不会打印文件行。

我知道这篇文章有点长,但我希望有人可以帮助我,

谢谢!

第一个解决方案不起作用,因为ReadFromText将作为参数字符串,例如存储桶路径“gs://bucket/file”。 在您的示例中,您插入到此 class PCollection(先前 PTransform 的结果)中 - 所以它不起作用。 相反,您应该使用将ReadAllFromText作为输入的 ReadAllFromText,因此它是先前 PTransform 的结果。

此外,您的代码需要稍作修改:

如果 DoFn class 只返回一种类型的 output,则没有理由使用 TaggedOutput,所以让我们只返回常规迭代器。

class ProcessMessage(beam.DoFn):

    def process(self, message):
        try:
            file_path = message.get('file_path')
            yield file_path 
        except Exception as e:
            raise Exception(e)

接下来, ReadAllFromText应该连接到管道的上一步,而不是p

file_content = (
            p 
            | "Read from Pubsub" >> beam.io.ReadFromPubSub(topic=p.options.topic, timestamp_attribute='ts')
            | "Parse JSON" >> beam.Map(json.loads)
            | "Process Message" >> beam.ParDo(ProcessMessage())
            | "Read file" >> beam.io.ReadAllFromText()   
        )

请注意, file_content变量将是元素的 Pcollection,其中每个元素将以字符串形式出现在 CSV 文件的单行中。 因此,为每列轻松应用转换将更加复杂,因为在第一个元素中将是列名,下一个将只是没有应用列名的单行。

您的第二次尝试似乎对此更好:

class ApplyTransforms(beam.DoFn):

    def process(self, element):

        file_path = element.get('file_path')
        transformations = element.get('transformations')

        with beam.io.gcsio.GcsIO().open(file_path) as file:
            reader = csv.DictReader(io.TextIOWrapper(file, encoding="utf-8"), delimiter=';')
            for row in reader:
                for transform in transformations:
                    col_name = transform.get("column_name")
                    transformation = transform.get("transformation")
                    # apply your transform per row 
                yield row

像这样的东西可以工作,但可能更好的主意是将它分成两类 - 一类用于阅读,另一类用于应用转换:)

感谢@Pav3k 的回答,我能够解决问题。 我的代码现在已解耦,如下所示:

class MyMessage(typing.NamedTuple):
    # Simple way to propagate all the needed information from the Pub/Sub message.
    file_path: str
    transformations: dict


class ProcessMessage(beam.DoFn):

    def process(self, message):
        """
        Example of the Pub/Sub message
        {
            "file_path": "gs://my-bucket/file_to_process.csv",
            "transformations": {
                "col_1": "to_upper",
                "col_2": "to_lower"
            }
        }
        """
        yield MyMessage(file_path=message.get('file_path'), 
                        transformations=message.get('transformations'))


class ReadFile(beam.DoFn):

    def process(self, element: MyMessage):
        import csv
        import io as io_file

        from apache_beam import io

        reader = csv.DictReader(io_file.TextIOWrapper(
            io.filesystems.FileSystems.open(MyMessage.file_path),
            encoding='utf-8'),
            delimiter=';')

        for row in reader:
            # Yields both the row to process and the transformations.
            yield (row, MyMessage.transformations)


class Transform(beam.ParDo):

    def to_upper(self, value):
        return value.upper()

    def to_lower(self, value):
        return value.lower()

    def process(self, element):
        """
        Now I now the transformations for each element and may be parallelized.
        """
        row = element[0]
        transformations = element[1]
        transformed_row = {}
        for key in transformations:
            value = row[key]
            transformation = transformations[key]
            transformed_row[key] = getattr(self, transformation)(value)
        yield transformed_row


def main(argv):

    parser = argparse.ArgumentParser()
    parser.add_argument("--topic_name", required=True)
    app_args, pipeline_args = parser.parse_known_args()
    pipeline_options = PipelineOptions(pipeline_args)

    with beam.Pipeline(options=pipeline_options) as p:

        message = (
            p | "Read from Pubsub" >> beam.io.ReadFromPubSub(
                topic=app_args.topic_name,
                timestamp_attribute='ts')
            | "Parse JSON" >> beam.Map(json.loads)
            | "Process message" >> beam.ParDo(ProcessMessage())
        )

        file_content = (
            message
            | beam.ParDo(ReadFile())
            | beam.ParDo(Transform())
            | beam.Map(print)
        )

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM