简体   繁体   English

Apache Beam:如何在使用重复数据删除时解决“ParDo 需要确定性密钥编码器才能使用 state 和计时器”function

[英]Apache Beam: How to solve "ParDo requires a deterministic key coder in order to use state and timers" while using Deduplication function

I'm trying to deduplicate input messages from Google Cloud Pubsub using deduplication function of Apache beam.我正在尝试使用 Apache 光束的重复数据删除 function 对来自 Google Cloud Pubsub 的输入消息进行重复数据删除。 However, I run into an error after creating KV<String, MyModel> pair and passing it to Deduplicate transform.但是,在创建KV<String, MyModel>对并将其传递给Deduplicate转换后,我遇到了错误。

Error:错误:

ParDo requires a deterministic key coder in order to use state and timers

Code:代码:

PCollection<KV<String, MyModel>> deduplicatedEvents =
    messages
        .apply(
            "CreateKVPairs",
            ParDo.of(
                new DoFn<MyModel, KV<String, MyModel>>() {
                  @ProcessElement
                  public void processElement(ProcessContext c) {
                    c.output(KV.of(c.element().getUniqueKey(),c.element()));
                  }
                }))
        .apply(
            "Deduplicate",
            Deduplicate.<KV<String, MyModel>>values());

How should I create deterministic coder which can encode/decode string as key, to make this work?我应该如何创建可以将字符串编码/解码为密钥的确定性编码器,以使其工作?

Any input would be really helpful.任何输入都会非常有帮助。

The Deduplicate transform works by putting the whole element into the key and then doing a key grouping operation (in this case a stateful ParDo). Deduplicate转换的工作原理是将整个元素放入键中,然后执行键分组操作(在本例中为有状态 ParDo)。 Because Beam is language-independent, grouping by key is done using the encoded form of elements.因为 Beam 是独立于语言的,所以使用元素的编码形式进行按键分组。 Two elements that encode to the same bytes are "equal" while two elements that encode to different bytes are "unequal".编码为相同字节的两个元素是“相等的”,而编码为不同字节的两个元素是“不相等的”。

A deterministic coder is a concept about how equality in a language (like Java) relates to Beam equality.确定性编码器是关于语言(如 Java)中的相等性如何与 Beam 相等性相关的概念。 It means that if two Java objects are equal according to Java equals() then they must have the same encoded bytes.这意味着如果两个 Java 对象根据 Java equals()相等,那么它们必须具有相同的编码字节。 For simple data like strings, numbers, arrays, this is easy.对于像字符串、数字、arrays 这样的简单数据,这很容易。 It is helpful to think about what makes a coder non -deterministic.思考是什么让编码器成为确定性的是很有帮助的。 For example, when encoding two Map instances, they may be equals() at the Java level but the key-value pairs are encoded in a different order making them unequal for Beam.例如,当对两个Map实例进行编码时,它们可能是 Java 级别的equals() ,但键值对以不同的顺序编码,这使得它们对于 Beam 不相等。

If you have a nondeterministic coder for MyModel , then Deduplicate will not work right and you will end up with duplicates because Beam considers the differently encoded objects to be unequal.如果您的MyModel有一个不确定的编码器,那么Deduplicate将无法正常工作,您最终会得到重复项,因为 Beam 认为不同编码的对象是不相等的。

Probably the easiest way to automatically get a high quality deterministic coder is to leverage Beam's schema inference: https://beam.apache.org/documentation/programming-guide/#schemas-for-pl-types .自动获得高质量确定性编码器的最简单方法可能是利用 Beam 的模式推断: https://beam.apache.org/documentation/programming-guide/#schemas-for-pl-types You will need to ensure that all the fields can also be encoded deterministically.您将需要确保所有字段也可以确定地编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache 光束 ParDo 滤波器 Go - Apache Beam ParDo Filter in Go Apache Beam Python:使用 ParDo 返回条件语句 class - Apache Beam Python: returning conditional statement using ParDo class 我们可以在 apache-beam 的批处理管道中使用 Windows + GroupBy 或 State &amp; timely 打破 fusion b/w ParDo 吗? - Can we break fusion b/w ParDo using Windows + GroupBy or State & timely in batch pipeline of apache-beam? 如何解决 Apache-Beam 中的 BeamDeprecationWarning - How to solve BeamDeprecationWarning in Apache-Beam 如何使用 Dataflow 在 Apache Beam 中使用 CoGroupByKey 接收器到 BigQuery - How to use CoGroupByKey sink to BigQuery in Apache Beam using Dataflow 如何使用 Apache Bean Java 获取具有多个 ParDo 的数据流中的管道状态 - How to get the Pipeline status in Dataflow with multiple ParDo using Apache Bean Java 具有相同密钥 apache beam 的多个 CoGroupByKey - Multiple CoGroupByKey with same key apache beam apache 光束与 gcp 云 function - apache beam with gcp cloud function 使用哪个 apache-beam 功能来读取管道中的第一个 function 并获取 output - Which apache-beam feature to use to just read a function as first in the pipeline and take the output 使用带有 Apache Beam 的句子转换器模型 - Use sentence transformers models with Apache Beam
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM