简体   繁体   English

如何使用 Apache Arrow 的 Java 库编写结构向量列表?

[英]How do you write a list of struct vectors using Apache Arrow's Java Libs?

I've been struggling for a few days now, trying to write a list of struct vectors using Apache Arrow.我这几天一直在苦苦挣扎,试图用 Apache 箭头写一个结构向量列表。

I'm basically trying to construct something along the lines of the following:我基本上是在尝试按照以下方式构建一些东西:

[[{"key1", "value1"},{"key1", "value1"},...],[{"key1", "value1"}, {"key1", "value1"}...]...]

I've tried numerous variations, but here's one version of what I think should work below, for a list of struct vectors where each struct vector contains a collection of several varchar & dateday fields, as well as one int field:我尝试了许多变体,但这是我认为应该在下面工作的一个版本,用于结构向量列表,其中每个结构向量包含几个 varchar 和 dateday 字段的集合,以及一个 int 字段:

ListVector listVector = (ListVector) root.getVector("units");
listVector.allocateNew();

UnionListWriter listWriter = listVector.getWriter();

for (int i = 0; i < allUnits.size(); i++) {
    listWriter.setPosition(i);
    listWriter.startList();

    BaseWriter.StructWriter structWriter = listWriter.struct("unit");
    StructVector structVector = 
          (StructVector) structWriter.getField()
                                     .createVector(allocator);
    structVector.allocateNew();

    // using this alternative below, I can see the StructVector filling up, but still nothing in the ListVector
    // StructVector structVector = 
    //       (StructVector)listVector.getChildrenFromFields().get(0);
    // structVector.allocateNew();
    // BaseWriter.StructWriter structWriter = structVector.getWriter();

    ArrayNode units = allUnits.get(i);

    // "accn", "form", "fp", "fy", "type" -> field names of 'varchar' type
    for (int x = 0; x < units.size(); x++) {
        structWriter.start();
        structWriter.setPosition(x);

        JsonNode unitNode = units.get(x);

        for (String varCharFieldName : UNIT_VARCHAR_FIELDS) {
              bytes;
             String varCharVal = unitNode.get(varCharFieldName).asText();
             byte[] bytes = varCharVal.getBytes();
             try(ArrowBuf tempBuf = allocator.buffer(bytes.length)) {
                tempBuf.setBytes(0, bytes, 0, bytes.length);
                structWriter.varChar(varCharFieldName).writeVarChar(0, bytes.length, tempBuf);
             }
        }

        // "end", "filed" -> field names of 'dateday' type
        for (String dateFieldName : UNIT_DATE_FIELDS) {
            LocalDate date = 
                  LocalDate.parse(unitNode.get(dateFieldName).asText(), ISO_LOCAL_DATE);
            structWriter.dateDay(dateFieldName)
                        .writeDateDay(Long.valueOf(date.toEpochDay()).intValue());
        }

        structWriter.bigInt("val").writeBigInt(unitNode.get("val").asInt());
        structVector.setIndexDefined(x);
        structWriter.end();
    }

    structVector.setValueCount(units.size()); 
    listWriter.endList();

}

listVector.setValueCount(allUnits.size());

I can see that the structVector is populating with data for a given 'unit' struct vector, but the writes do not propagate to the list of 'unit' struct vectors, the 'units' list field itself.我可以看到 structVector 正在填充给定“单元”结构向量的数据,但写入不会传播到“单元”结构向量列表,即“单元”列表字段本身。

Below is a gist of a Google Colab notebook that will run the example, more or less.下面是一个 Google Colab 笔记本的要点,它将或多或少地运行该示例。 It's better to take that code and run it in an IDE of your choice with the maven dependencies also specified in this notebook.最好使用该代码并在您选择的 IDE 中运行它,并在此笔记本中指定 maven 依赖项。

https://gist.github.com/gmsharpe/52aee837db9ebcacaf87a7ac07667bac https://gist.github.com/gmsharpe/52aee837db9ebcacaf87a7ac07667bac

I removed structVector and structWriter.close() from your listWriter -loop, so that it now populates the units array.我从您的listWriter -循环中删除structVectorstructWriter.close() ,以便它现在填充units数组。 Maybe you can continue from here:也许你可以从这里继续:

ListVector listVector = (ListVector) root.getVector("units"); 
UnionListWriter listWriter = listVector.getWriter();
listWriter.allocate();
listVector.allocateNew();

List<ArrayNode> allUnits = nodes.stream()
                                .map(n -> (ArrayNode)(n.get("units").get("USD")))
                                .collect(Collectors.toList());

for (int i = 0; i < allUnits.size(); i++) {
    listWriter.setPosition(i);
    listWriter.startList();

    BaseWriter.StructWriter structWriter = listWriter.struct();

    ArrayNode units = allUnits.get(i);

    // "accn", "form", "fp", "fy", "type"
    for (int x = 0; x < units.size(); x++) {
        structWriter.start();
        JsonNode unitNode = units.get(x);

        for (String varCharFieldName : UNIT_VARCHAR_FIELDS) {
            byte[] bytes;
            if (varCharFieldName.equals("type")) {
                bytes = "USD".getBytes();
            } else {
                String varCharVal = unitNode.get(varCharFieldName).asText();
                bytes = varCharVal.getBytes();
            }
            ArrowBuf tempBuf = allocator.buffer(bytes.length);
            tempBuf.setBytes(0, bytes);
            structWriter.varChar(varCharFieldName).writeVarChar(0, bytes.length, tempBuf);

        }

        // "end", "filed"
        for (String dateFieldName : UNIT_DATE_FIELDS) {
            LocalDate date = LocalDate.parse(unitNode.get(dateFieldName).asText(),
                                              DateTimeFormatter.ISO_LOCAL_DATE);
            structWriter.dateDay(dateFieldName).writeDateDay(Long.valueOf(date.toEpochDay()).intValue());
        }

        structWriter.bigInt("val").writeBigInt(unitNode.get("val").asInt());
        structWriter.end();
    }
    listWriter.setValueCount(units.size());
    listWriter.endList();
}
listVector.setValueCount(allUnits.size());
root.contentToTSVString();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM