Apache Arrow getting vectors from Java in Python with zero copy

Question

I use Apache Arrow libraries in java ( arrow-vector , arrow-memory-unsafe ) and python ( pyarrow ) in different processes

I try to implement in memory zero copy DataFrame, but I can't find appropriate API in java libraries to get memory address of arrow vectors from python. I have found that API in pyarrow library, but not in java libraries.

What I need:

create vector in java, collect data in memory using arrow as memory map API
get memory address or descriptor of VectorSchemaRoot or field vectors in java
pass it to the python library pyarrow
read apache arrow vector data

I have problem in the point 2

Do you know how can I do that? Thank you!

Answer 1

There is the pyarrow.jvm module for this. The following code should be sufficient to turn a VectorSchemaRoot into a RecordBatch :

import pyarrow.jvm

vs_root = <VectorSchemaRoot>
rb = pyarrow.jvm.record_batch(vs_root)

This is how it works if you have a Python reference to the Java VectorSchemaRoot object, eg by using jpype (see also https://uwekorn.com/2020/12/30/fast-jdbc-revisited.html for a full use of that for JDBC).

If you use a different approach, you will need to iterate over the arrays of the VectorSchemaRoot and then of the buffers of them to get the individual memory addresses of all buffers. These can then be used to construct Buffer objects on the pyarrow side and in return pyarrow.Array instances.

Apache Arrow getting vectors from Java in Python with zero copy

Question

1 answers

solution1
0 2020-12-30 16:08:14

Apache Arrow getting vectors from Java in Python with zero copy

Question

1 answers

solution1 0 2020-12-30 16:08:14

solution1
0 2020-12-30 16:08:14