简体   繁体   中英

Database or Table Solution for Temporary Numpy Arrays

I am creating a Python desktop application that allows users to select different distributional forms to model agricultural yield data. I have the time series agricultural data - close to a million rows - saved in a SQLite database (although this is not set in stone if someone knows of a better choice). Once the user selects some data, say corn yields from 1990-2010 in Illinois, I want them to select a distributional form from a drop-down. Next, my function fits the distribution to the data and outputs 10,000 points drawn from that fitted distributional form in a Numpy array. I would like this data to be temporary during the execution of the program.

In an attempt to be efficient, I would only like to make this fit and the subsequent drawing of numbers one time for a specified region and distribution. I have been researching temporary files in Python, but I am not sure that is the best approach for saving many different Numpy arrays. PyTables also looks like an interesting approach and seems to be compatible with Numpy, but I am not sure it is good for handling temporary data. No SQL solutions, like MongoDB, seem to be very popular these days as well, which also interests me from a resume building perspective.

Edit: After reading the comment below and researching it, I am going to go with PyTables, but I am trying to find the best way to tackle this. Is it possible to create a table like below, where instead of Float32Col I can use createTimeSeriesTable() from the scikits time series class or do I need to create a datetime column for the date and a boolean column for the mask, in addition to the Float32Col below to hold the data. Or is there a better way to be going about this problem?

class Yield(IsDescription):
    geography_id = UInt16Col()
    data = Float32Col(shape=(50, 1)) # for 50 years of data

Any help on the matter would be greatly appreciated.

What's your use case for the temporary data? Are you just going to be reading it all in at once (and never wanting to just read in a subset)?

If so, just save the array to a temporary file (eg with numpy.save , or equivalently, pickle with a binary protocol). There's no need for fancier solutions in that case.

On a side note, I'd highly recommend PyTables over SQLite for storing your original time series data.

Based on what it sounds like you're doing, you're not going to need the "relational" parts of a relational database (eg joins). If you don't need to join or relate tables, you just need fast simple queries, and you want the data in memory as a numpy array, PyTables is an excellent option. PyTables uses HDF to store your data, which can be much more compact on disk than a SQLite database. PyTables is also considerably faster for loading large chunks of data into memory as numpy arrays.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM