简体   繁体   English

运行 DataFlow 作业时在 BigQuery 中记录重复

[英]Record Duplication in BigQuery while Running a DataFlow Job

I'm running an hourly dataflow job that reads records from a source table, processes and writes them to a target table.我正在运行一个每小时的数据流作业,该作业从源表中读取记录,处理并将它们写入目标表。 Since some of the records may repeat in the source table, we've created a hash value based on the record fields of interest, append it to the read source table records(in memory), and filter out the existing hashes already stored on the target table(the hash value will be stored in the target table).由于部分记录可能在源表中重复,我们根据感兴趣的记录字段创建了一个hash值,append到读取源表记录(在内存中),并过滤掉已经存储在目标表(hash 值将存储在目标表中)。 This way we aim to avoid duplications from different jobs(triggered at different times).通过这种方式,我们旨在避免不同作业的重复(在不同时间触发)。 In order to avoid duplication on the same job, we're using a GroupByKey apache beam method, where the key is the hash value, and pick only the first element in the list.为了避免在同一个作业上重复,我们使用 GroupByKey apache beam 方法,其中键是 hash 值,并且只选择列表中的第一个元素。 However, the duplication in bigquery still persists.但是,bigquery 中的重复仍然存在。 My only hunch is that maybe, due to multiple workers handling the same job, they might be out of sync and process the same data, but since I'm using pipelines all the way, this assumption sounds unreasonable(at least to me..).我唯一的预感是,也许由于多个工人处理相同的工作,他们可能不同步并处理相同的数据,但由于我一直在使用管道,这个假设听起来不合理(至少对我来说.. ). Does any of you have an idea why the problem still persists?你们中有人知道为什么问题仍然存在吗?

Here's the job which creates the duplication:这是创建重复的工作:

with beam.Pipeline(options=options) as p:
# read fields of interest from the source table
    records = p | 'Read Records from BigQuery' >> beam.io.Read(
            beam.io.ReadFromBigQuery(query=read_from_source_query, use_standard_sql=True))

#step 1 - filter already existing records

# read existing hashes from the target table        
    hashes = p | 'read existing hashes from the target table' >> \
                 beam.io.Read(beam.io.ReadFromBigQuery(
                     query=select_hash_value_from_target_table,
                     use_standard_sql=True)) | \
                 'Get vals' >> beam.Map(lambda hash: hash['HashValue'])

# add hash value to each record and filter out the ones which already exist in the target table
    hashed_records = (
                records
                | 'Add Hash Column in Memory to Each source table Record' >> beam.Map(lambda record: add_hash_field(record))
                | 'Filter Existing Hashes' >> beam.Filter(lambda record,
                                                                 hashes: record['HashValue'] not in hashes,
                                                          hashes=beam.pvalue.AsIter(hashes))
        )

# step 2 - filter duplicated hashes created on the same job
    key_val_records = (
                hashed_records | 'Create a Key Value Pair' >> beam.Map(lambda record: (record['HashValue'], record))
        )

# combine elements with the same key and get only one of them
    unique_hashed_records = (
                key_val_records | 'Combine the Same Hashes' >> beam.GroupByKey()
                | 'Get First Element in Collection' >> beam.Map(lambda element: element[1][0])
        )

    records_to_store = unique_hashed_records | 'Create Records to Store' >> beam.ParDo(CreateTargetTableRecord(gal_options))

    records_to_store | 'Write to target table' >> beam.io.WriteToBigQuery(
                target_table)

As the code above suggested, i've expected to have no duplicates in the target table, but i'm still getting正如上面的代码所建议的那样,我预计目标表中没有重复项,但我仍然得到

Just to elaborate a possible solution on how to de-duplicate data using RANK() as @Bruno Volpato suggests.正如@Bruno Volpato 建议的那样,只是详细说明如何使用RANK()删除重复数据的可能解决方案。

In the below mock data set, each id may have different versions denoted by version column.在下面的模拟数据集中,每个id可能有不同的版本,由version列表示。 Field insert_time is when record was inserted into BigQuery.字段insert_time是将记录插入 BigQuery 的时间。

Record #3 is duplicate of #2 and should not be included in result.记录#3 与#2 重复,不应包含在结果中。
Record #4 is an outdated version of record #2 and should not be included in result.记录 #4 是记录 #2 的过时版本,不应包含在结果中。 The below query will filter away these two records and return a result with 3 records.下面的查询将过滤掉这两条记录并返回包含 3 条记录的结果。

The inner-query approach is used because main table may be quite complex.使用内部查询方法是因为主表可能非常复杂。

DECLARE id_1 default GENERATE_UUID();
DECLARE id_2 default GENERATE_UUID();
DECLARE id_3 default GENERATE_UUID();

WITH my_data AS (
    SELECT id_1 as id, TIMESTAMP '2016-10-18 2:51:45' as version, TIMESTAMP '2016-10-18 2:51:45.001' as insert_time, 'Some other data #1' as some_other_data
    UNION ALL SELECT id_2, TIMESTAMP '2016-10-18 2:54:11', TIMESTAMP '2016-10-18 2:56:11.002', 'Some other data #2'
    UNION ALL SELECT id_2, TIMESTAMP '2016-10-18 2:54:11', TIMESTAMP '2016-10-18 2:55:11.003', 'Some other data #2'
    UNION ALL SELECT id_2, TIMESTAMP '2016-10-18 1:54:11', TIMESTAMP '2016-10-18 1:56:11.004', 'Some outdated data #2'
    UNION ALL SELECT id_3, TIMESTAMP '2016-10-18 2:59:01', TIMESTAMP '2016-10-18 2:59:01.005', 'Some other data #3'
),
ranked_data as (
  select
    id,
    version,
    insert_time,
    RANK() OVER (PARTITION BY id ORDER BY version desc, insert_time asc) AS record_rank
  FROM 
    my_data
)

select
  my_data.*
from
  my_data
  join ranked_data on
    my_data.id = ranked_data.id 
    and my_data.version = ranked_data.version
    and my_data.insert_time = ranked_data.insert_time     
where
  ranked_data.record_rank = 1

If there's no versioning:如果没有版本控制:

select
  my_data.*
from
    my_data
    join (
      select 
        id, 
        min(insert_time) as min_insert_time
      from
        my_data
      group by 
        id
    ) as join_table on 
          join_table.id = my_data.id 
          and join_table.min_insert_time = my_data.insert_time

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM