[英]How do you write a list of struct vectors using Apache Arrow's Java Libs?
我这几天一直在苦苦挣扎,试图用 Apache 箭头写一个结构向量列表。
我基本上是在尝试按照以下方式构建一些东西:
[[{"key1", "value1"},{"key1", "value1"},...],[{"key1", "value1"}, {"key1", "value1"}...]...]
我尝试了许多变体,但这是我认为应该在下面工作的一个版本,用于结构向量列表,其中每个结构向量包含几个 varchar 和 dateday 字段的集合,以及一个 int 字段:
ListVector listVector = (ListVector) root.getVector("units");
listVector.allocateNew();
UnionListWriter listWriter = listVector.getWriter();
for (int i = 0; i < allUnits.size(); i++) {
listWriter.setPosition(i);
listWriter.startList();
BaseWriter.StructWriter structWriter = listWriter.struct("unit");
StructVector structVector =
(StructVector) structWriter.getField()
.createVector(allocator);
structVector.allocateNew();
// using this alternative below, I can see the StructVector filling up, but still nothing in the ListVector
// StructVector structVector =
// (StructVector)listVector.getChildrenFromFields().get(0);
// structVector.allocateNew();
// BaseWriter.StructWriter structWriter = structVector.getWriter();
ArrayNode units = allUnits.get(i);
// "accn", "form", "fp", "fy", "type" -> field names of 'varchar' type
for (int x = 0; x < units.size(); x++) {
structWriter.start();
structWriter.setPosition(x);
JsonNode unitNode = units.get(x);
for (String varCharFieldName : UNIT_VARCHAR_FIELDS) {
bytes;
String varCharVal = unitNode.get(varCharFieldName).asText();
byte[] bytes = varCharVal.getBytes();
try(ArrowBuf tempBuf = allocator.buffer(bytes.length)) {
tempBuf.setBytes(0, bytes, 0, bytes.length);
structWriter.varChar(varCharFieldName).writeVarChar(0, bytes.length, tempBuf);
}
}
// "end", "filed" -> field names of 'dateday' type
for (String dateFieldName : UNIT_DATE_FIELDS) {
LocalDate date =
LocalDate.parse(unitNode.get(dateFieldName).asText(), ISO_LOCAL_DATE);
structWriter.dateDay(dateFieldName)
.writeDateDay(Long.valueOf(date.toEpochDay()).intValue());
}
structWriter.bigInt("val").writeBigInt(unitNode.get("val").asInt());
structVector.setIndexDefined(x);
structWriter.end();
}
structVector.setValueCount(units.size());
listWriter.endList();
}
listVector.setValueCount(allUnits.size());
我可以看到 structVector 正在填充给定“单元”结构向量的数据,但写入不会传播到“单元”结构向量列表,即“单元”列表字段本身。
下面是一个 Google Colab 笔记本的要点,它将或多或少地运行该示例。 最好使用该代码并在您选择的 IDE 中运行它,并在此笔记本中指定 maven 依赖项。
https://gist.github.com/gmsharpe/52aee837db9ebcacaf87a7ac07667bac
我从您的listWriter
-循环中删除structVector
和structWriter.close()
,以便它现在填充units
数组。 也许你可以从这里继续:
ListVector listVector = (ListVector) root.getVector("units");
UnionListWriter listWriter = listVector.getWriter();
listWriter.allocate();
listVector.allocateNew();
List<ArrayNode> allUnits = nodes.stream()
.map(n -> (ArrayNode)(n.get("units").get("USD")))
.collect(Collectors.toList());
for (int i = 0; i < allUnits.size(); i++) {
listWriter.setPosition(i);
listWriter.startList();
BaseWriter.StructWriter structWriter = listWriter.struct();
ArrayNode units = allUnits.get(i);
// "accn", "form", "fp", "fy", "type"
for (int x = 0; x < units.size(); x++) {
structWriter.start();
JsonNode unitNode = units.get(x);
for (String varCharFieldName : UNIT_VARCHAR_FIELDS) {
byte[] bytes;
if (varCharFieldName.equals("type")) {
bytes = "USD".getBytes();
} else {
String varCharVal = unitNode.get(varCharFieldName).asText();
bytes = varCharVal.getBytes();
}
ArrowBuf tempBuf = allocator.buffer(bytes.length);
tempBuf.setBytes(0, bytes);
structWriter.varChar(varCharFieldName).writeVarChar(0, bytes.length, tempBuf);
}
// "end", "filed"
for (String dateFieldName : UNIT_DATE_FIELDS) {
LocalDate date = LocalDate.parse(unitNode.get(dateFieldName).asText(),
DateTimeFormatter.ISO_LOCAL_DATE);
structWriter.dateDay(dateFieldName).writeDateDay(Long.valueOf(date.toEpochDay()).intValue());
}
structWriter.bigInt("val").writeBigInt(unitNode.get("val").asInt());
structWriter.end();
}
listWriter.setValueCount(units.size());
listWriter.endList();
}
listVector.setValueCount(allUnits.size());
root.contentToTSVString();
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.