繁体   English   中英

如何使用 Apache Arrow 的 Java 库编写结构向量列表?

[英]How do you write a list of struct vectors using Apache Arrow's Java Libs?

我这几天一直在苦苦挣扎,试图用 Apache 箭头写一个结构向量列表。

我基本上是在尝试按照以下方式构建一些东西:

[[{"key1", "value1"},{"key1", "value1"},...],[{"key1", "value1"}, {"key1", "value1"}...]...]

我尝试了许多变体,但这是我认为应该在下面工作的一个版本,用于结构向量列表,其中每个结构向量包含几个 varchar 和 dateday 字段的集合,以及一个 int 字段:

ListVector listVector = (ListVector) root.getVector("units");
listVector.allocateNew();

UnionListWriter listWriter = listVector.getWriter();

for (int i = 0; i < allUnits.size(); i++) {
    listWriter.setPosition(i);
    listWriter.startList();

    BaseWriter.StructWriter structWriter = listWriter.struct("unit");
    StructVector structVector = 
          (StructVector) structWriter.getField()
                                     .createVector(allocator);
    structVector.allocateNew();

    // using this alternative below, I can see the StructVector filling up, but still nothing in the ListVector
    // StructVector structVector = 
    //       (StructVector)listVector.getChildrenFromFields().get(0);
    // structVector.allocateNew();
    // BaseWriter.StructWriter structWriter = structVector.getWriter();

    ArrayNode units = allUnits.get(i);

    // "accn", "form", "fp", "fy", "type" -> field names of 'varchar' type
    for (int x = 0; x < units.size(); x++) {
        structWriter.start();
        structWriter.setPosition(x);

        JsonNode unitNode = units.get(x);

        for (String varCharFieldName : UNIT_VARCHAR_FIELDS) {
              bytes;
             String varCharVal = unitNode.get(varCharFieldName).asText();
             byte[] bytes = varCharVal.getBytes();
             try(ArrowBuf tempBuf = allocator.buffer(bytes.length)) {
                tempBuf.setBytes(0, bytes, 0, bytes.length);
                structWriter.varChar(varCharFieldName).writeVarChar(0, bytes.length, tempBuf);
             }
        }

        // "end", "filed" -> field names of 'dateday' type
        for (String dateFieldName : UNIT_DATE_FIELDS) {
            LocalDate date = 
                  LocalDate.parse(unitNode.get(dateFieldName).asText(), ISO_LOCAL_DATE);
            structWriter.dateDay(dateFieldName)
                        .writeDateDay(Long.valueOf(date.toEpochDay()).intValue());
        }

        structWriter.bigInt("val").writeBigInt(unitNode.get("val").asInt());
        structVector.setIndexDefined(x);
        structWriter.end();
    }

    structVector.setValueCount(units.size()); 
    listWriter.endList();

}

listVector.setValueCount(allUnits.size());

我可以看到 structVector 正在填充给定“单元”结构向量的数据,但写入不会传播到“单元”结构向量列表,即“单元”列表字段本身。

下面是一个 Google Colab 笔记本的要点,它将或多或少地运行该示例。 最好使用该代码并在您选择的 IDE 中运行它,并在此笔记本中指定 maven 依赖项。

https://gist.github.com/gmsharpe/52aee837db9ebcacaf87a7ac07667bac

我从您的listWriter -循环中删除structVectorstructWriter.close() ,以便它现在填充units数组。 也许你可以从这里继续:

ListVector listVector = (ListVector) root.getVector("units"); 
UnionListWriter listWriter = listVector.getWriter();
listWriter.allocate();
listVector.allocateNew();

List<ArrayNode> allUnits = nodes.stream()
                                .map(n -> (ArrayNode)(n.get("units").get("USD")))
                                .collect(Collectors.toList());

for (int i = 0; i < allUnits.size(); i++) {
    listWriter.setPosition(i);
    listWriter.startList();

    BaseWriter.StructWriter structWriter = listWriter.struct();

    ArrayNode units = allUnits.get(i);

    // "accn", "form", "fp", "fy", "type"
    for (int x = 0; x < units.size(); x++) {
        structWriter.start();
        JsonNode unitNode = units.get(x);

        for (String varCharFieldName : UNIT_VARCHAR_FIELDS) {
            byte[] bytes;
            if (varCharFieldName.equals("type")) {
                bytes = "USD".getBytes();
            } else {
                String varCharVal = unitNode.get(varCharFieldName).asText();
                bytes = varCharVal.getBytes();
            }
            ArrowBuf tempBuf = allocator.buffer(bytes.length);
            tempBuf.setBytes(0, bytes);
            structWriter.varChar(varCharFieldName).writeVarChar(0, bytes.length, tempBuf);

        }

        // "end", "filed"
        for (String dateFieldName : UNIT_DATE_FIELDS) {
            LocalDate date = LocalDate.parse(unitNode.get(dateFieldName).asText(),
                                              DateTimeFormatter.ISO_LOCAL_DATE);
            structWriter.dateDay(dateFieldName).writeDateDay(Long.valueOf(date.toEpochDay()).intValue());
        }

        structWriter.bigInt("val").writeBigInt(unitNode.get("val").asInt());
        structWriter.end();
    }
    listWriter.setValueCount(units.size());
    listWriter.endList();
}
listVector.setValueCount(allUnits.size());
root.contentToTSVString();

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM