简体   繁体   中英

Real-time access to simple but large data set with Python

I am currently facing the problem of having to frequently access a large but simple data set on a smallish (700 Mhz) device in real time. The data set contains around 400,000 mappings from abbreviations to abbreviated words, eg "frgm" to "fragment". Reading will happen frequently when the device is used and should not require more than 15-20ms.

My first attempt was to utilize SQLite in order to create a simple data base which merely contains a single table where two strings constitute a data set:

CREATE TABLE WordMappings (key text, word text)

This table is created once and although alterations are possible, only read-access is time critical.

Following this guide , my SELECT statement looks as follows:

def databaseQuery(self, query_string):
    self.cursor.execute("SELECT word FROM WordMappings WHERE key=" + query_string + " LIMIT 1;")
    result = self.cursor.fetchone()

    return result[0]

However, using this code on a test data base with 20,000 abbreviations, I am unable to fetch data quicker than ~60ms, which is far to slow.

Any suggestions on how to improve performance using SQLite or would another approach yield more promising results?

You can speed up lookups on the key column by creating an index for it:

CREATE INDEX kex_index ON WordMappings(key);

To check whether a query uses an index or scans the entire table, use EXPLAIN QUERY PLAN .

A long time ago I tried to use SQLite for sequential data and it was not fast enough for my needs. At the time, I was comparing it against an existing in-house binary format, which I ended up using.

I have not personally used, but a friend uses PyTables for large time-series data; maybe it's worth looking into.

It turns out that defining a primary key speeds up individual queries by an factor order of magnitude.

Individual queries on a test table with 400,000 randomly created entries (10/20 characters long) took no longer than 5ms which satisfies the requirements.

The table is now created as follows:

CREATE TABLE WordMappings (key text PRIMARY KEY, word text)

A primary key is used because

  • It is implicitly unique, which is a property of the abbreviations stored
  • It cannot be NULL, so the rows containing it must not be NULL. In our case, if they were, the database would be corrupt

Other users have suggested using an index, however, they are not necessarily unique and according to the accept answer to this question , they unnecessarily slow down update/insert/delete performance. Nevertheless, using an index may as well increase performance. This has, however not been tested by the original author, although not tested by the original author.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM