[英]How do I set the coder for a PCollection<List<String>> in Apache Beam?
I'm teaching myself Apache Beam, specifically for using in parsing JSON.我正在自学 Apache Beam,专门用于解析 JSON。 I was able to create a simple example that parsed JSON to a POJO and POJO to CSV.我能够创建一个简单的示例,将 JSON 解析为 POJO,将 POJO 解析为 CSV。 It required that I use .setCoder()
for my simple POJO class.它要求我将.setCoder()
用于我的简单 POJO 类。
pipeline
.apply("Read source JSON file.", TextIO.read().from(options.getInput()))
.apply("Parse to POJO matching schema", ParseJsons.of(Person.class))
.setCoder(SerializableCoder.of(Person.class))
.apply("Create comma delimited string", new PersonToCsvRow())
.apply("Write out to file", TextIO.write().to(options.getOutput())
.withoutSharding());
Now I am trying to skip the POJO step of parsing using some custom transforms.现在我试图跳过使用一些自定义转换进行解析的 POJO 步骤。 My pipeline looks like this:我的管道如下所示:
pipeline
.apply("Read Json", TextIO.read().from("src/main/resources/family_tree.json"))
.apply("Traverse Json tree", new JSONTreeToPaths())
.apply("Format tree paths", new PathsToCSV())
.apply("Write to CSV", TextIO.write().to("src/main/resources/paths.csv")
.withoutSharding());
This pipeline is supposed to take a heavily nested JSON structure and print each individual path through the tree.该管道应该采用高度嵌套的 JSON 结构并打印通过树的每个单独路径。 I'm getting the same error I did in the POJO example above:我在上面的 POJO 示例中遇到了同样的错误:
Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for Traverse Json tree/MapElements/Map/ParMultiDo(Anonymous).output [PCollection@331122245]. Correct one of the following root causes:
No Coder has been manually specified; you may do so using .setCoder().
So I tried to add a coder in a few different ways:所以我尝试以几种不同的方式添加编码器:
.setCoder(SerializableCoder.of(List<String>.class))
Results in "Cannot select from parameterized type".导致“无法从参数化类型中选择”。 I found another instance of this error generated by a different use case here , but the accepted answer seemed only be applicable to that use case.我发现不同的使用情况产生这个错误的另一个实例在这里,但接受的答案似乎只适用于该用例。
So then I started perusing the Beam docs and found ListCoder.of()
which has (literally) no description.然后我开始仔细阅读 Beam 文档,发现ListCoder.of()
) 没有(字面上)没有描述。 But it looked promising, so I tried it:但它看起来很有希望,所以我试了一下:
.setCoder(ListCoder.of(SerializableCoder.of(String.class)))
But this takes me back to the initial error of not having manually set a coder.但这让我回到了没有手动设置编码器的初始错误。
How do I satisfy this requirement to set a coder for a List<String>
object?如何满足为List<String>
对象设置编码器的要求?
The transform that is causing the setCoder
error is this one:导致setCoder
错误的转换是这样的:
package transforms;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.SimpleFunction;
import org.apache.beam.sdk.values.PCollection;
import java.util.ArrayList;
import java.util.List;
public class JSONTreeToPaths extends PTransform<PCollection<String>, PCollection<List<String>>> {
public static class ExtractPathsFromTree extends SimpleFunction<JsonNode, List<String>> {
public List<String> apply(JsonNode root) {
List<String> pathContainer = new ArrayList<>();
getPaths(root, "", pathContainer);
return pathContainer;
}
}
public static class GetRootNode extends SimpleFunction<String, JsonNode> {
public JsonNode apply(String jsonString) {
try {
return getRoot(jsonString);
} catch (JsonProcessingException e) {
e.printStackTrace();
return null;
}
}
}
@Override
public PCollection<List<String>> expand(PCollection<String> input) {
return input
.apply(MapElements.via(new GetRootNode()))
.apply(MapElements.via(new ExtractPathsFromTree()));
}
private static JsonNode getRoot(String jsonString) throws JsonProcessingException {
ObjectMapper mapper = new ObjectMapper();
return mapper.readTree(jsonString);
}
private static void getPaths(JsonNode node, String currentPath, List<String> paths) {
//check if leaf:
if (node.path("children").isMissingNode()) {
currentPath += node.get("Id");
paths.add(currentPath);
System.out.println(currentPath);
return;
}
// recursively iterate over children
currentPath += (node.get("Id") + ",");
for (JsonNode child : node.get("children")) {
getPaths(child, currentPath, paths);
}
}
}
While the error message seems to imply that the list of strings is what needs encoding, it is actually the JsonNode
.虽然错误消息似乎暗示字符串列表需要编码,但它实际上是JsonNode
。 I just had to read a little further down in the error message, as the opening statement is a bit deceiving as to where the issue is:我只需要进一步阅读错误消息中的内容,因为开头声明对于问题出在哪里有点欺骗:
Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for Traverse Json tree/MapElements/Map/ParMultiDo(Anonymous).output [PCollection@1324829744].
...
...
Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder
for com.fasterxml.jackson.databind.JsonNode.
Building a Coder using a registered CoderProvider failed.
Once I discovered this, I solved the problem by extending Beam's CustomCoder
class.一旦我发现了这一点,我就通过扩展 Beam 的CustomCoder
类解决了这个问题。 This abstract class is nice because you only have to write the code to serialize and deserialize the object:这个抽象类很好,因为您只需要编写代码来序列化和反序列化对象:
public class JsonNodeCoder extends CustomCoder<JsonNode> {
@Override
public void encode(JsonNode node, OutputStream outStream) throws IOException {
ObjectMapper mapper = new ObjectMapper();
String nodeString = mapper.writeValueAsString(node);
outStream.write(nodeString.getBytes());
}
@Override
public JsonNode decode(InputStream inStream) throws IOException {
byte[] bytes = IOUtils.toByteArray(inStream);
ObjectMapper mapper = new ObjectMapper();
String json = new String(bytes);
return mapper.readTree(json);
}
}
Hopes this helps some other Beam newbie out there.希望这可以帮助其他一些 Beam 新手。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.