简体   繁体   English

Apache 箭头从 Python 中的 Java 获取向量,零拷贝

[英]Apache Arrow getting vectors from Java in Python with zero copy

I use Apache Arrow libraries in java ( arrow-vector , arrow-memory-unsafe ) and python ( pyarrow ) in different processes我在不同的进程中使用 java ( arrow-vectorarrow-memory-unsafe )和 python ( pyarrow中的Apache 箭头库

I try to implement in memory zero copy DataFrame, but I can't find appropriate API in java libraries to get memory address of arrow vectors from python. I try to implement in memory zero copy DataFrame, but I can't find appropriate API in java libraries to get memory address of arrow vectors from python. I have found that API in pyarrow library, but not in java libraries.我在pyarrow库中发现了 API,但在 java 库中没有。

What I need:我需要的:

  1. create vector in java, collect data in memory using arrow as memory map API create vector in java, collect data in memory using arrow as memory map API
  2. get memory address or descriptor of VectorSchemaRoot or field vectors in java获取 memory 地址或 VectorSchemaRoot 描述符或VectorSchemaRoot中的字段向量
  3. pass it to the python library pyarrow将其传递给 python 库pyarrow
  4. read apache arrow vector data读取 apache 箭头矢量数据

I have problem in the point 2我在第2点有问题

Do you know how can I do that?你知道我该怎么做吗? Thank you!谢谢!

There is the pyarrow.jvm module for this.为此有pyarrow.jvm模块。 The following code should be sufficient to turn a VectorSchemaRoot into a RecordBatch :以下代码足以将VectorSchemaRoot转换为RecordBatch

import pyarrow.jvm

vs_root = <VectorSchemaRoot>
rb = pyarrow.jvm.record_batch(vs_root)

This is how it works if you have a Python reference to the Java VectorSchemaRoot object, eg by using jpype (see also https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html for a full use of that for JDBC). This is how it works if you have a Python reference to the Java VectorSchemaRoot object, eg by using jpype (see also https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html for a full use of JDBC)。

If you use a different approach, you will need to iterate over the arrays of the VectorSchemaRoot and then of the buffers of them to get the individual memory addresses of all buffers.如果您使用不同的方法,您将需要遍历 VectorSchemaRoot 的VectorSchemaRoot和它们的缓冲区,以获取所有缓冲区的各个 memory 地址。 These can then be used to construct Buffer objects on the pyarrow side and in return pyarrow.Array instances.然后可以使用这些在pyarrow端构造 Buffer 对象并返回pyarrow.Array实例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM