简体   繁体   English

如何为 PCollection 设置编码器<List<String> &gt; 在 Apache Beam 中?

[英]How do I set the coder for a PCollection<List<String>> in Apache Beam?

I'm teaching myself Apache Beam, specifically for using in parsing JSON.我正在自学 Apache Beam,专门用于解析 JSON。 I was able to create a simple example that parsed JSON to a POJO and POJO to CSV.我能够创建一个简单的示例,将 JSON 解析为 POJO,将 POJO 解析为 CSV。 It required that I use .setCoder() for my simple POJO class.它要求我将.setCoder()用于我的简单 POJO 类。

        pipeline
            .apply("Read source JSON file.", TextIO.read().from(options.getInput()))
            .apply("Parse to POJO matching schema", ParseJsons.of(Person.class))
            .setCoder(SerializableCoder.of(Person.class))
            .apply("Create comma delimited string", new PersonToCsvRow())
            .apply("Write out to file", TextIO.write().to(options.getOutput())
                .withoutSharding());

The problem问题

Now I am trying to skip the POJO step of parsing using some custom transforms.现在我试图跳过使用一些自定义转换进行解析的 POJO 步骤。 My pipeline looks like this:我的管道如下所示:

        pipeline
            .apply("Read Json", TextIO.read().from("src/main/resources/family_tree.json"))
            .apply("Traverse Json tree", new JSONTreeToPaths())
            .apply("Format tree paths", new PathsToCSV())
            .apply("Write to CSV", TextIO.write().to("src/main/resources/paths.csv")
                .withoutSharding());

This pipeline is supposed to take a heavily nested JSON structure and print each individual path through the tree.该管道应该采用高度嵌套的 JSON 结构并打印通过树的每个单独路径。 I'm getting the same error I did in the POJO example above:我在上面的 POJO 示例中遇到了同样的错误:

Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for Traverse Json tree/MapElements/Map/ParMultiDo(Anonymous).output [PCollection@331122245]. Correct one of the following root causes:
  No Coder has been manually specified;  you may do so using .setCoder().

What I tried我试过的

So I tried to add a coder in a few different ways:所以我尝试以几种不同的方式添加编码器:

.setCoder(SerializableCoder.of(List<String>.class))

Results in "Cannot select from parameterized type".导致“无法从参数化类型中选择”。 I found another instance of this error generated by a different use case here , but the accepted answer seemed only be applicable to that use case.我发现不同的使用情况产生这个错误的另一个实例在这里,但接受的答案似乎只适用于该用例。

So then I started perusing the Beam docs and found ListCoder.of() which has (literally) no description.然后我开始仔细阅读 Beam 文档,发现ListCoder.of() ) 没有(字面上)没有描述。 But it looked promising, so I tried it:但它看起来很有希望,所以我试了一下:

.setCoder(ListCoder.of(SerializableCoder.of(String.class)))

But this takes me back to the initial error of not having manually set a coder.但这让我回到了没有手动设置编码器的初始错误。

The question问题

How do I satisfy this requirement to set a coder for a List<String> object?如何满足为List<String>对象设置编码器的要求?

Code代码

The transform that is causing the setCoder error is this one:导致setCoder错误的转换是这样的:

package transforms;

import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.PCollection;

import java.util.ArrayList;
import java.util.List;

public class JSONTreeToPaths extends PTransform<PCollection<String>, PCollection<List<String>>> {

    public static class ExtractPathsFromTree extends SimpleFunction<JsonNode, List<String>> {
        public List<String> apply(JsonNode root) {
            List<String> pathContainer = new ArrayList<>();
            getPaths(root, "", pathContainer);
            return pathContainer;
        }
    }

    public static class GetRootNode extends SimpleFunction<String, JsonNode> {
        public JsonNode apply(String jsonString) {
            try {
                return getRoot(jsonString);
            } catch (JsonProcessingException e) {
               e.printStackTrace();
               return null;
            }
        }
    }

    @Override
    public PCollection<List<String>> expand(PCollection<String> input) {
        return input
            .apply(MapElements.via(new GetRootNode()))
            .apply(MapElements.via(new ExtractPathsFromTree()));
    }

    private static JsonNode getRoot(String jsonString) throws JsonProcessingException {
        ObjectMapper mapper = new ObjectMapper();
        return mapper.readTree(jsonString);
    }

    private static void getPaths(JsonNode node, String currentPath, List<String> paths) {
        //check if leaf:
        if (node.path("children").isMissingNode()) {
            currentPath += node.get("Id");
            paths.add(currentPath);
            System.out.println(currentPath);
            return;
        }

        // recursively iterate over children
        currentPath += (node.get("Id") + ",");
        for (JsonNode child : node.get("children")) {
            getPaths(child, currentPath, paths);
        }
    }
}



While the error message seems to imply that the list of strings is what needs encoding, it is actually the JsonNode .虽然错误消息似乎暗示字符串列表需要编码,但它实际上是JsonNode I just had to read a little further down in the error message, as the opening statement is a bit deceiving as to where the issue is:我只需要进一步阅读错误消息中的内容,因为开头声明对于问题出在哪里有点欺骗:

Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for Traverse Json tree/MapElements/Map/ParMultiDo(Anonymous).output [PCollection@1324829744]. 
...
...
Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder 
for com.fasterxml.jackson.databind.JsonNode.
Building a Coder using a registered CoderProvider failed.

Once I discovered this, I solved the problem by extending Beam's CustomCoder class.一旦我发现了这一点,我就通过扩展 Beam 的CustomCoder类解决了这个问题。 This abstract class is nice because you only have to write the code to serialize and deserialize the object:这个抽象类很好,因为您只需要编写代码来序列化和反序列化对象:

public class JsonNodeCoder extends CustomCoder<JsonNode> {

    @Override
    public void encode(JsonNode node, OutputStream outStream) throws IOException {
        ObjectMapper mapper = new ObjectMapper();
        String nodeString = mapper.writeValueAsString(node);
        outStream.write(nodeString.getBytes());
    }

    @Override
    public JsonNode decode(InputStream inStream) throws IOException {
        byte[] bytes = IOUtils.toByteArray(inStream);
        ObjectMapper mapper = new ObjectMapper();
        String json = new String(bytes);
        return mapper.readTree(json);
    }
}

Hopes this helps some other Beam newbie out there.希望这可以帮助其他一些 Beam 新手。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Apache Beam 中为我的 PCollection 使用 AutoValue 数据类型? - How do I use an AutoValue data type for my PCollection in Apache Beam? Apache Beam:扁平化 PCollection <List<Foo> &gt; 到 PCollection<Foo> - Apache Beam: Flattening PCollection<List<Foo>> to PCollection<Foo> 如何在 PCollection 中组合数据 - Apache Beam - How to combine Data in PCollection - Apache beam 如何在处理PCollection中的元素时将元素发布到kafka主题 <KV<String,String> &gt;在apache梁中的ParDo功能? - How to publish elements to a kafka topic while processing the elements in the PCollection<KV<String,String>> in ParDo function in apache beam? Apache Beam TextIO.ReadAll如何发出KeyValue而不是Pcollection的字符串 - Apache Beam TextIO.ReadAll How to emit KeyValue instead of String of Pcollection 如何使用 Apache Beam 中的流输入 PCollection 请求 Redis 服务器? - How to request Redis server using a streaming input PCollection in Apache Beam? 如何将 JSON Array 反序列化为 Apache beam PCollection<javaobject></javaobject> - How to deserialize JSON Array to Apache beam PCollection<javaObject> 如何从 PCollection 获取所有文件元数据<string>在光束中</string> - How to get all file metadata from PCollection<string> in beam 编码器问题Apache Beam和CombineFn - Coder issues with Apache Beam and CombineFn 如何创建 PCollection<Row> 来自 PCollection<String> 用于执行梁 SQL 转换 - How to create PCollection<Row> from PCollection<String> for performing beam SQL Trasforms
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM