简体   繁体   中英

Python text file instead of dictionary

I'm working on a project where I crawl and re-organize huge number of data into a result text file. Previously I used dictionary to store temporary data, but as the data volume increased the process slowed down because of memory usage and dictionary got useless.

Since process speed is not so important in my case, I'm trying to replace dictionary to file but I'm not sure how can I easily move file pointer to appropriate position and read up required data. In dictionary I can easily refer to any data. I would like to achieve the same but in file.

I'm thinking to use mmap and write my own functions to move file pointer where I want. Does Python have a built-in or 3rd party module for such operations?

Any other theoretical approach is welcome.

I think you are now trying to reinvent a key-value database.

Maybe the easiest thing would be to check if sqlite3 module would offer you what you need. Using a readymade database is easier than rolling your own!

Of course, sqlite3 is not a key-value DB (on the surface), so if you need something even simpler, have a look at LMDB and its Python bindings: http://lmdb.readthedocs.org/en/release/

It is as lightweight and fast as it gets. It is probably close to the fastest way to achieve what you want.


It should be noted that there is no such thing as an optimal key-value database. There are several aspects to consider. At least:

  • Do you read a lot or write a lot?
  • What are the key and value sizes?
  • Do you need transactions/crash-proof?
  • Do you have duplicate keys (one key, several values)?
  • Do you want to have sorted keys?
  • Do you want to read the data out in the same order it is inserted?
  • What is your database size (MB, GB, TB, PB)?
  • Are you constrained on IO or CPU?

For example, LMDB I suggested above is very good in read-intensive tasks, not so much in write-intensive tasks. It offers transactions, keeps keys in sorted order and is crash-proof (limited by the underlying file system). However, if you need to write your database often, LMDB may not be the best choice.

On the other hand, SQLite is not the perfect choice to this task - theoretically speking. In practice, it is built in into the standard Python distribution and is thus easy to use. It may provide adequate performance, and it may thus be the best choice.

There are numerous high-quality databases out there. By not mentioning them I do not try to give the impression that the DBs mentioned in this answer are the only good alternatives. Most database managers have a very good reason for their existence. While there are some that are a bit outdated, most have their own sweet spots in the application area.

The field is constantly changing. There are both completely new databases available and old database systems are updated. This should be kept in mind when reading old benchmarks. Also, the type of HW used has its impact; a computer with a SSD disk, a cloud computing instance, and a traditional computer with a HDD behave quite differently performance-wise.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM