简体   繁体   English

在Lua中读取大型JSON文件

[英]Read large JSON file in Lua

I am trying to read some data in a JSON file in order to use it from lua. 我试图读取JSON文件中的一些数据,以便从lua使用它。 The data are sound files that have been preprocessed in python and stored in JSON for easier access. 数据是声音文件,已在python中进行了预处理,并存储在JSON中,以便于访问。

The file is roughly 800Mb. 该文件大约为800Mb。 When I try to read the read the entire file with file:read("*all") , I get back a not enough memory response. 当我尝试使用file:read("*all")读取读取的整个文件时,返回的not enough memory响应not enough memory The libraries I have looked at are lua-json , lua-cjson and luajson . 我看过的库是lua-jsonlua-cjsonluajson The first two don't provide a method to access files directly, the third one does, however is just a wrapper that calls f:read() . 前两个没有提供直接访问文件的方法,第三个没有,但是只是一个调用f:read()的包装器。

My ultimate goal is to use torch to train some models on some audio data, but I want to keep the processing of the raw signals in python. 我的最终目标是使用手电筒在一些音频数据上训练一些模型,但是我想将原始信号的处理保持在python中。 I chose JSON over other formats for convenience, so if you think there is a format that would work better, I am open for ideas. 为了方便起见,我选择了JSON而不是其他格式,因此,如果您认为有一种更好用的格式,则欢迎您提出建议。

I'm not sure json is the best format for storing audio data, but it seems like in this situation you'll need to write your own json parser that will read the file, parse the data, and pass them through your training process without storing the entire data set in memory. 我不确定json是存储音频数据的最佳格式,但是在这种情况下,您似乎需要编写自己的json解析器,该解析器将读取文件,解析数据并将其通过您的训练过程而无需将整个数据集存储在内存中。

Since the json format is fairly simple and you can limit the processing to just handle your format, it should be relatively straightforward to write SAX-like parser that will generate events you need. 由于json格式非常简单,您可以将处理限制为仅处理您的格式,因此编写类似SAX的解析器来生成所需事件应该相对简单。 This SO answer may be a good starting point (or at least give you ideas on what keywords to search for). 这样的答案可能是一个很好的起点(或者至少给您关于要搜索哪些关键字的想法)。

You have two options: 您有两种选择:

Option 1: Install torch with Lua52 instead of LuaJIT. 选项1:使用Lua52代替LuaJIT安装割炬。 Nothing changes, everything works as expected, and you can now load your json file and decode it without memory issues. 无需进行任何更改,一切都会按预期工作,现在您可以加载json文件并对其进行解码,而不会出现内存问题。 To do this: 去做这个:

cd ~/torch
./clean.sh
TORCH_LUA_VERSION=LUA52 ./install.sh

Option 2: Use HDF5 to save your python pre-processed files, and use torch-hdf5 to load them. 选项2:使用HDF5保存您的python预处理文件,然后使用torch-hdf5加载它们。 HDF5 is much more suited for your data than JSON anyways. 无论如何,HDF5比JSON更适合您的数据。

Instead of using json, you could also try npy4th , and you can save data as "npz" file. 除了使用json外,您还可以尝试使用npy4th ,并且可以将数据另存为“ npz”文件。

Another option is to use lutorpy , a library which allows you run lua/torch in python and provide convenient utilities for converting between numpy array and torch tensor, the advantage is memory copy or disk copy is not necessary, they share the underlying memory, so it's very fast. 另一个选择是使用lutorpy ,该库允许您在python中运行lua / torch,并提供方便的实用程序在numpy数组和割炬张量之间进行转换,其优点是不需要内存副本或磁盘副本,它们共享底层内存,因此非常快。 Check the website for more information. 检查网站以获取更多信息。

A basic example: 一个基本的例子:

import lutorpy as lua
import numpy as np

## use require("MODULE") to import lua modules
require("nn")

## run lua code in python with minimal modification:  replace ":" to "._"
t = torch.DoubleTensor(10,3)
print(t._size()) # the corresponding lua version is t:size()

## or, you can use numpy array
xn = np.random.randn(100)
## convert the numpy array into torch tensor
xt = torch.fromNumpyArray(xn)

## convert torch tensor to numpy array
### Note: the underlying object are sharing the same memory, so the conversion is instant
arr = xt.asNumpyArray()
print(arr.shape)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM