简体   繁体   English

读取Python之外的numpy数组

[英]Reading numpy arrays outside of Python

In a recent question I asked about the fastest way to convert a large numpy array to a delimited string. 在最近的一个问题中,我询问了将大型numpy数组转换为分隔字符串的最快方法。 My reason for asking was because I wanted to take that plain text string and transmit it (over HTTP for instance) to clients written in other programming languages. 我之所以提出这个问题,是因为我想把这个纯文本字符串转换成(例如通过HTTP)发送给用其他编程语言编写的客户端。 A delimited string of numbers is obviously something that any client program can work with easily. 一个分隔的数字字符串显然是任何客户端程序可以轻松使用的东西。 However, it was suggested that because string conversion is slow, it would be faster on the Python side to do base64 encoding on the array and send it as binary. 但是,有人建议,因为字符串转换很慢,所以在Python端对数组进行base64编码并将其作为二进制发送会更快。 This is indeed faster. 这确实更快。

My question now is, (1) how can I make sure my encoded numpy array will travel well to clients on different operating systems and different hardware, and (2) how do I decode the binary data on the client side. 我现在的问题是,(1)如何确保我的编码numpy阵列能够很好地运行到不同操作系统和不同硬件上的客户端,以及(2)如何在客户端解码二进制数据。

For (1), my inclination is to do something like the following 对于(1),我倾向于做类似以下的事情

import numpy as np
import base64
x = np.arange(100, dtype=np.float64)
base64.b64encode(x.tostring())

Is there anything else I need to do? 还有什么我需要做的吗?

For (2), I would be happy to have an example in any programming language, where the goal is to take the numpy array of floats and turn them into a similar native data structure. 对于(2),我很乐意在任何编程语言中都有一个例子,其目标是获取numpy浮点数并将它们转换为类似的本机数据结构。 Assume we have already done base64 decoding and have a byte array, and that we also know the numpy dtype, dimensions, and any other metadata which will be needed. 假设我们已经完成了base64解码并且有一个字节数组,并且我们也知道numpy dtype,维度以及将需要的任何其他元数据。

Thanks. 谢谢。

You should really look into OPeNDAP to simplify all aspects of scientific data networking. 您应该真正研究OPeNDAP以简化科学数据网络的所有方面。 For Python, check out Pydap . 对于Python,请查看Pydap

You can directly store your NumPy arrays into HDF5 format via h5py (or NetCDF), then stream the data to clients over HTTP using OPeNDAP. 您可以通过h5py (或NetCDF)将NumPy阵列直接存储为HDF5格式,然后使用OPeNDAP通过HTTP将数据流传输到客户端。

For something a little lighter-weight than HDF (though admittedly also more ad-hoc), you could also use JSON: 对于比HDF轻一点的东西(虽然不可否认也是特别的),你也可以使用JSON:

import json
import numpy as np

x = np.arange(100, dtype=np.float64)

print json.dumps(dict(data=x.tostring(),
                      shape=x.shape,
                      dtype=str(x.dtype)))

This would free your clients from needing to install HDF wrappers, at the expense of having to deal with a nonstandard protocol for data exchange (and possibly also needing to install JSON bindings !). 这将使您的客户无需安装HDF包装器,代价是必须处理非标准协议进行数据交换(并且可能还需要安装JSON绑定!)。

The tradeoff would be up to you to evaluate for your situation. 权衡取决于您根据自己的情况进行评估。

I'd recommend using an existing data format for interchange of scientific data/arrays, such as NetCDF or HDF . 我建议使用现有的数据格式来交换科学数据/数组,例如NetCDFHDF In Python, you can use the PyNIO library which has numpy bindings, and there are several libraries for other languages. 在Python中,您可以使用具有numpy绑定的PyNIO库,并且有几个用于其他语言的库。 Both formats are built for handling large data and take care of language, machine representation problems, etc. They also work well with message passing, for example in parallel computing, so I suspect your use case is satisfied. 这两种格式都是为处理大数据和处理语言,机器表示问题等而构建的。它们也适用于消息传递,例如在并行计算中,所以我怀疑你的用例是满意的。

What the tostring method of numpy arrays does is basically give you a dump of the memory used by the array's data (not the object wrapper for Python, but just the data of the array.) This is similar to the struct stdlib module. numpy数组的tostring方法的作用基本上是为了转储数组数据所使用的内存(不是Python的对象包装器,而只是数组的数据。)这类似于struct stdlib模块。 Base64-encoding that string and sending it across should be quite good enough, although you may also need to send across the actual datatype used, as well as the dimensions if it's a multidimensional array, as you won't be able to tell those just from the data. Base64编码该字符串并将其发送到应该是非常好的,尽管您可能还需要发送所使用的实际数据类型,以及尺寸,如果它是一个多维数组,因为您将无法告诉那些只是从数据。

On the other side, how to read the data depends a little on the language. 另一方面,如何阅读数据取决于语言。 Most languages have a way of addressing such a block of memory as a particular type of array. 大多数语言都有一种方法可以将这样的内存块作为特定类型的数组来处理。 For example, in C, you could simply base64-decode the string, assign it to (in the case of your example) a float64 * and index away. 例如,在C中,您可以简单地对字符串进行base64解码,将其分配给(在您的示例中) float64 *并将其索引。 This doesn't give you any of the built-in safeguards and functions and other operations that numpy arrays have in Python, but that's because C is quite a different language in that respect. 这并没有给你任何内置的安全措施和函数以及numpy数组在Python中的其他操作,但那是因为C在这方面是一种完全不同的语言。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM