简体   繁体   English

在数据流作业中查找重复项 - Python

[英]Find duplicates in a Dataflow job - Python

I want to create a workflow using this example:我想使用这个例子创建一个工作流:

https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloud-composer-examples/composer_dataflow_examples

I want to do the exact same thing and I have created all the scripts but I need to modify a bit the Dataflow job to check if there is any duplicate values in the CSV that I want to ingest into Bigquery.我想做同样的事情,我已经创建了所有脚本,但我需要修改一点 Dataflow 作业,以检查 CSV 中是否有任何重复值,我想将这些值引入 Bigquery。

This is the Dataflow code:这是数据流代码:

    """dataflow_job.py is a Dataflow pipeline which reads a delimited file,
    adds some additional metadata fields and loads the contents to a BigQuery table."""


    import argparse
    import logging
    import ntpath
    import re

    import apache_beam as beam
    from apache_beam.options import pipeline_options


    class RowTransformer(object):
        """A helper class that contains utility methods to parse the delimited file
        and convert every record into a format acceptable to BigQuery. It also contains
        utility methods to add a load_dt and a filename fields as demonstration of
        how records can be enriched as part of the load process."""

        def __init__(self, delimiter, header, filename, load_dt):
            self.delimiter = delimiter

            # Extract the field name keys from the comma separated input.
            self.keys = re.split(',', header)

            self.filename = filename
            self.load_dt = load_dt

        def parse(self, row):
            """This method translates a single delimited record into a dictionary
            which can be loaded into BigQuery. It also adds filename and load_dt
            fields to the dictionary."""

            # Strip out the return characters and quote characters.
            values = re.split(self.delimiter, re.sub(r'[\r\n"]', '', row))

            row = dict(list(zip(self.keys, values)))

            # Add an additional filename field.
            row['filename'] = self.filename

            # Add an additional load_dt field.
            row['load_dt'] = self.load_dt

            return row


    def run(argv=None):
        """The main function which creates the pipeline and runs it."""
        parser = argparse.ArgumentParser()

        # Add the arguments needed for this specific Dataflow job.
        parser.add_argument(
            '--input', dest='input', required=True,
            help='Input file to read.  This can be a local file or '
                'a file in a Google Storage Bucket.')

        parser.add_argument('--output', dest='output', required=True,
                            help='Output BQ table to write results to.')

        parser.add_argument('--delimiter', dest='delimiter', required=False,
                            help='Delimiter to split input records.',
                            default=',')

        parser.add_argument('--fields', dest='fields', required=True,
                            help='Comma separated list of field names.')

        parser.add_argument('--load_dt', dest='load_dt', required=True,
                            help='Load date in YYYY-MM-DD format.')

        known_args, pipeline_args = parser.parse_known_args(argv)
        row_transformer = RowTransformer(delimiter=known_args.delimiter,
                                        header=known_args.fields,
                                        filename=ntpath.basename(known_args.input),
                                        load_dt=known_args.load_dt)

        p_opts = pipeline_options.PipelineOptions(pipeline_args)

        # Initiate the pipeline using the pipeline arguments passed in from the
        # command line.  This includes information including where Dataflow should
        # store temp files, and what the project id is.
        with beam.Pipeline(options=p_opts) as pipeline:
            # Read the file.  This is the source of the pipeline.  All further
            # processing starts with lines read from the file.  We use the input
            # argument from the command line.
            rows = pipeline | "Read from text file" >> beam.io.ReadFromText(known_args.input)

            # This stage of the pipeline translates from a delimited single row
            # input to a dictionary object consumable by BigQuery.
            # It refers to a function we have written.  This function will
            # be run in parallel on different workers using input from the
            # previous stage of the pipeline.
            dict_records = rows | "Convert to BigQuery row" >> beam.Map(
                lambda r: row_transformer.parse(r))

            # This stage of the pipeline writes the dictionary records into
            # an existing BigQuery table. The sink is also configured to truncate
            # the table if it contains any existing records.
            dict_records | "Write to BigQuery" >> beam.io.Write(
                beam.io.BigQuerySink(known_args.output,
                                    create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
                                    write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))


    if __name__ == '__main__':
        logging.getLogger().setLevel(logging.INFO)
        run()

My experience with Apache beam is so limited, how could I add another function to check if there is any duplicates in the value of the dictionary that my Rowtransformer create?我对 Apache 光束的经验是如此有限,我怎么能添加另一个 function 来检查我的 Rowtransformer 创建的字典的值中是否有任何重复项?

You can use the Distinct function to remove duplicates.您可以使用Distinct function 删除重复项。 Details can be found here详细信息可以在这里找到

rows = pipeline | "Read from text file" >> beam.io.ReadFromText(known_args.input)
                | "Remove Duplicates"   >> beam.Distinct()

dict_records = rows | "Convert to BigQuery row" >> beam.Map(
                lambda r: row_transformer.parse(r))

...

Beam provides a drop duplicates method that works like others (Spark/pandas..) Method link Beam 提供了一个 drop duplicates 方法,它与其他方法一样工作(Spark/pandas..) 方法链接

So you can do:所以你可以这样做:

rows = pipeline | "Read from text file" >> beam.io.ReadFromText(known_args.input)
                | "Remove Duplicates"   >> beam.RemoveDuplicates()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM