简体   繁体   中英

How do you write a list of struct vectors using Apache Arrow's Java Libs?

I've been struggling for a few days now, trying to write a list of struct vectors using Apache Arrow.

I'm basically trying to construct something along the lines of the following:

[[{"key1", "value1"},{"key1", "value1"},...],[{"key1", "value1"}, {"key1", "value1"}...]...]

I've tried numerous variations, but here's one version of what I think should work below, for a list of struct vectors where each struct vector contains a collection of several varchar & dateday fields, as well as one int field:

ListVector listVector = (ListVector) root.getVector("units");
listVector.allocateNew();

UnionListWriter listWriter = listVector.getWriter();

for (int i = 0; i < allUnits.size(); i++) {
    listWriter.setPosition(i);
    listWriter.startList();

    BaseWriter.StructWriter structWriter = listWriter.struct("unit");
    StructVector structVector = 
          (StructVector) structWriter.getField()
                                     .createVector(allocator);
    structVector.allocateNew();

    // using this alternative below, I can see the StructVector filling up, but still nothing in the ListVector
    // StructVector structVector = 
    //       (StructVector)listVector.getChildrenFromFields().get(0);
    // structVector.allocateNew();
    // BaseWriter.StructWriter structWriter = structVector.getWriter();

    ArrayNode units = allUnits.get(i);

    // "accn", "form", "fp", "fy", "type" -> field names of 'varchar' type
    for (int x = 0; x < units.size(); x++) {
        structWriter.start();
        structWriter.setPosition(x);

        JsonNode unitNode = units.get(x);

        for (String varCharFieldName : UNIT_VARCHAR_FIELDS) {
              bytes;
             String varCharVal = unitNode.get(varCharFieldName).asText();
             byte[] bytes = varCharVal.getBytes();
             try(ArrowBuf tempBuf = allocator.buffer(bytes.length)) {
                tempBuf.setBytes(0, bytes, 0, bytes.length);
                structWriter.varChar(varCharFieldName).writeVarChar(0, bytes.length, tempBuf);
             }
        }

        // "end", "filed" -> field names of 'dateday' type
        for (String dateFieldName : UNIT_DATE_FIELDS) {
            LocalDate date = 
                  LocalDate.parse(unitNode.get(dateFieldName).asText(), ISO_LOCAL_DATE);
            structWriter.dateDay(dateFieldName)
                        .writeDateDay(Long.valueOf(date.toEpochDay()).intValue());
        }

        structWriter.bigInt("val").writeBigInt(unitNode.get("val").asInt());
        structVector.setIndexDefined(x);
        structWriter.end();
    }

    structVector.setValueCount(units.size()); 
    listWriter.endList();

}

listVector.setValueCount(allUnits.size());

I can see that the structVector is populating with data for a given 'unit' struct vector, but the writes do not propagate to the list of 'unit' struct vectors, the 'units' list field itself.

Below is a gist of a Google Colab notebook that will run the example, more or less. It's better to take that code and run it in an IDE of your choice with the maven dependencies also specified in this notebook.

https://gist.github.com/gmsharpe/52aee837db9ebcacaf87a7ac07667bac

I removed structVector and structWriter.close() from your listWriter -loop, so that it now populates the units array. Maybe you can continue from here:

ListVector listVector = (ListVector) root.getVector("units"); 
UnionListWriter listWriter = listVector.getWriter();
listWriter.allocate();
listVector.allocateNew();

List<ArrayNode> allUnits = nodes.stream()
                                .map(n -> (ArrayNode)(n.get("units").get("USD")))
                                .collect(Collectors.toList());

for (int i = 0; i < allUnits.size(); i++) {
    listWriter.setPosition(i);
    listWriter.startList();

    BaseWriter.StructWriter structWriter = listWriter.struct();

    ArrayNode units = allUnits.get(i);

    // "accn", "form", "fp", "fy", "type"
    for (int x = 0; x < units.size(); x++) {
        structWriter.start();
        JsonNode unitNode = units.get(x);

        for (String varCharFieldName : UNIT_VARCHAR_FIELDS) {
            byte[] bytes;
            if (varCharFieldName.equals("type")) {
                bytes = "USD".getBytes();
            } else {
                String varCharVal = unitNode.get(varCharFieldName).asText();
                bytes = varCharVal.getBytes();
            }
            ArrowBuf tempBuf = allocator.buffer(bytes.length);
            tempBuf.setBytes(0, bytes);
            structWriter.varChar(varCharFieldName).writeVarChar(0, bytes.length, tempBuf);

        }

        // "end", "filed"
        for (String dateFieldName : UNIT_DATE_FIELDS) {
            LocalDate date = LocalDate.parse(unitNode.get(dateFieldName).asText(),
                                              DateTimeFormatter.ISO_LOCAL_DATE);
            structWriter.dateDay(dateFieldName).writeDateDay(Long.valueOf(date.toEpochDay()).intValue());
        }

        structWriter.bigInt("val").writeBigInt(unitNode.get("val").asInt());
        structWriter.end();
    }
    listWriter.setValueCount(units.size());
    listWriter.endList();
}
listVector.setValueCount(allUnits.size());
root.contentToTSVString();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM