简体   繁体   中英

Apache Arrow getting vectors from Java in Python with zero copy

I use Apache Arrow libraries in java ( arrow-vector , arrow-memory-unsafe ) and python ( pyarrow ) in different processes

I try to implement in memory zero copy DataFrame, but I can't find appropriate API in java libraries to get memory address of arrow vectors from python. I have found that API in pyarrow library, but not in java libraries.

What I need:

  1. create vector in java, collect data in memory using arrow as memory map API
  2. get memory address or descriptor of VectorSchemaRoot or field vectors in java
  3. pass it to the python library pyarrow
  4. read apache arrow vector data

I have problem in the point 2

Do you know how can I do that? Thank you!

There is the pyarrow.jvm module for this. The following code should be sufficient to turn a VectorSchemaRoot into a RecordBatch :

import pyarrow.jvm

vs_root = <VectorSchemaRoot>
rb = pyarrow.jvm.record_batch(vs_root)

This is how it works if you have a Python reference to the Java VectorSchemaRoot object, eg by using jpype (see also https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html for a full use of that for JDBC).

If you use a different approach, you will need to iterate over the arrays of the VectorSchemaRoot and then of the buffers of them to get the individual memory addresses of all buffers. These can then be used to construct Buffer objects on the pyarrow side and in return pyarrow.Array instances.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM