简体   繁体   English

随着时间的流逝,Python失去了写入磁盘的性能

[英]Python losing performance writing to disk over time

I have written a code to take an input from a file of very big data, perform some simple processing on it and then store it in a shelve dictionary format. 我编写了一个代码,以从非常大的数据文件中获取输入,对其执行一些简单的处理,然后将其存储为搁置的字典格式。 I have 41 million entries to process. 我有4100万个条目需要处理。 However, after I write 35 million entries to the shelve dict, performance suddenly drops and eventually completely halts. 但是,在我为该搁置命令写了3500万个条目之后,性能突然下降并最终完全停止。 Any idea what I can do to avoid this? 知道我该怎么做才能避免这种情况吗?

My data is on twitter and it maps user screen names to their ID's. 我的数据在twitter上,它将用户屏幕名称映射到其ID。 Like so: 像这样:

Jack 12
Mary 13
Bob 15

I need to access each of these by name very quickly. 我需要非常快速地访问每个名称。 Like: when I give my_dict[Jack] it returns 12 . 像:当我给my_dict[Jack]它返回12

Consider using something more low-level. 考虑使用更底层的东西。 Shelve performance can be quite low, unfortunately. 不幸的是,货架性能可能会很低。 This doesn't explain the slow-down you are seeing though. 但这并不能解释您所看到的速度变慢。

For many disk-based indexes it helps if you can initialize them with an expected size, so they do not need to reorganize themselves on the fly. 对于许多基于磁盘的索引它帮助 ,如果你能与预期的大小对它们进行初始化,所以他们并不需要重新组织自己的飞行。 I've seen this with a huge performance impact for on-disk hash-tables in various libraries. 我已经看到,这对各种库中的磁盘哈希表具有巨大的性能影响。

As for your actual goal, have a look at: 至于您的实际目标,请看以下内容:

http://docs.python.org/library/persistence.html http://docs.python.org/library/persistence.html

in particular the gdbm, dbhash, bsddb, dumbdbm and sqlite3 modules. 特别是gdbm, dbhash, bsddb, dumbdbmsqlite3模块。

sqlite3 is probably not the fastest, but the easiest to use one. sqlite3可能不是最快的,但最容易使用的一种。 After all, it has a command line SQL client. 毕竟,它具有命令行SQL客户端。 bsddb probably is faster, in particular if you tune nelem and similar parameters for you data size. bsddb可能更快,特别是如果您针对数据大小调整nelem和类似参数时。 And it does have a lot of language bindings, too; 而且它确实具有许多语言绑定; likely even more than sqlite. 甚至可能超过sqlite。

Try to create your database with an initial size of 41 million, so it can optimize for this size! 尝试创建初始大小为4100万的数据库,以便可以对此大小进行优化!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM