简体   繁体   English

Python:如何将列表写入文件,然后将其拉回 memory(dict 表示为字符串转换为 dict)?

[英]Python: How do I write a list to file and then pull it back into memory (dict represented as a string convert to dict) later?

More specific dupe of 875228—Simple data storing in Python .更具体的875228 复制品——简单数据存储在 Python 中

I have a rather large dict (6 GB) and I need to do some processing on it.我有一个相当大的字典(6 GB),我需要对其进行一些处理。 I'm trying out several document clustering methods, so I need to have the whole thing in memory at once.我正在尝试几种文档聚类方法,因此我需要立即将所有内容都放入 memory 中。 I have other functions to run on this data, but the contents will not change.我还有其他功能可以在此数据上运行,但内容不会改变。

Currently, every time I think of new functions I have to write them, and then re-generate the dict.目前,每当我想到新功能时,我都必须编写它们,然后重新生成 dict。 I'm looking for a way to write this dict to a file, so that I can load it into memory instead of recalculating all it's values.我正在寻找一种将此字典写入文件的方法,以便我可以将它加载到 memory 而不是重新计算它的所有值。

to oversimplify things it looks something like: {((('word','list'),(1,2),(1,3)),(...)):0.0, ....}过度简化它看起来像这样的东西:{((('word','list'),(1,2),(1,3)),(...)):0.0, ....}

I feel that python must have a better way than me looping around through some string looking for: and ( trying to parse it into a dictionary.我觉得 python 一定有比我循环遍历一些字符串寻找 : 和 ( 试图将其解析到字典中更好的方法。

Why not use python pickle ? 为什么不使用python pickle Python has a great serializing module called pickle it is very easy to use. Python有一个很棒的序列化模块叫做pickle,它很容易使用。

import cPickle
cPickle.dump(obj, open('save.p', 'wb')) 
obj = cPickle.load(open('save.p', 'rb'))

There are two disadvantages with pickle: 泡菜有两个缺点:

  • It's not secure against erroneous or maliciously constructed data. 它对于错误或恶意构造的数据是不安全的。 Never unpickle data received from an untrusted or unauthenticated source. 切勿取消从不受信任或未经身份验证的来源收到的数据。
  • The format is not human readable. 格式不是人类可读的。

If you are using python 2.6 there is a builtin module called json . 如果您使用的是python 2.6,则会有一个名为json的内置模块。 It is as easy as pickle to use: 它就像泡菜一样简单:

import json
encoded = json.dumps(obj)
obj = json.loads(encoded)

Json format is human readable and is very similar to the dictionary string representation in python. Json格式是人类可读的,与python中的字典字符串表示非常相似。 And doesn't have any security issues like pickle. 而且没有像泡菜这样的安全问题。 But might be slower than cPickle. 但可能比cPickle慢。

I'd use shelve , json , yaml , or whatever, as suggested by other answers. 如其他答案所示,我会使用shelvejsonyaml等等。

shelve is specially cool because you can have the dict on disk and still use it. shelve特别酷,因为你可以在磁盘上使用dict并仍然使用它。 Values will be loaded on-demand. 值将按需加载。

But if you really want to parse the text of the dict , and it contains only str ings, int s and tuple s like you've shown, you can use ast.literal_eval to parse it. 但是如果你真的想要解析dict的文本,并且它只包含你已经显示的strinttuple ,你可以使用ast.literal_eval来解析它。 It is a lot safer, since you can't eval full expressions with it - It only works with str ings, numbers, tuple s, list s, dict s, bool eans, and None : 这是一个很多更安全,因为你不能用它的eval充分表达-它仅适用于str英格斯,数字, tuple S, list S, dict S, bool是指合同和None

>>> import ast
>>> print ast.literal_eval("{12: 'mydict', 14: (1, 2, 3)}")
{12: 'mydict', 14: (1, 2, 3)}

I would suggest that you use YAML for your file format so you can tinker with it on the disc 我建议您使用YAML作为文件格式,以便在光盘上修改它

How does it look:
  - It is indent based
  - It can represent dictionaries and lists
  - It is easy for humans to understand
An example: This block of code is an example of YAML (a dict holding a list and a string)
Full syntax: http://www.yaml.org/refcard.html

To get it in python, just easy_install pyyaml. 要在python中获取它,只需easy_install pyyaml。 See http://pyyaml.org/ http://pyyaml.org/

It comes with easy file save / load functions, that I can't remember right this minute. 它带有简单的文件保存/加载功能,我记不清楚这一分钟。

Here are a few alternatives depending on your requirements: 以下是一些替代方案,具体取决于您的要求:

  • numpy stores your plain data in a compact form and performs group/mass operations well numpy以简洁的形式存储您的简单数据,并很好地执行组/批量操作

  • shelve is like a large dict backed up by a file shelve就像一个由文件支持的大型字典

  • some 3rd party storage module, eg stash , stores arbitrary plain data 某些第三方存储模块,例如stash ,存储任意明文数据

  • proper database, eg mongodb for hairy data or mysql or sqlite plain data and faster retrieval 适当的数据库,例如用于毛发数据或mysql或sqlite普通数据的mongodb和更快的检索

This solution at SourceForge uses only standard Python modules: SourceForge上的此解决方案仅使用标准Python模块:

y_serial.py module :: warehouse Python objects with SQLite y_serial.py module ::使用SQLite仓库Python对象

"Serialization + persistance :: in a few lines of code, compress and annotate Python objects into SQLite; then later retrieve them chronologically by keywords without any SQL. Most useful "standard" module for a database to store schema-less data." “序列化+持久性::在几行代码中,将Python对象压缩并注释为SQLite;然后通过关键字按时间顺序检索它们,而不使用任何SQL。最有用的”标准“模块,用于存储无模式数据的数据库。”

http://yserial.sourceforge.net http://yserial.sourceforge.net

The compression bonus will probably reduce your 6GB dictionary to 1GB. 压缩加值可能会将您的6GB字典减少到1GB。 If you do not want a store a series of dictionaries, the module also contains a file.gz solution which might be more suitable given your dictionary size. 如果您不希望商店出现一系列词典,该模块还包含一个file.gz解决方案,根据您的字典大小,该解决方案可能更合适。

For Unicode characters use:对于 Unicode 个字符,请使用:

data = [{'key': 1, 'text': 'some text'}]
f = open(path_to_file, 'w', encoding='utf8')
json.dump(data, f, ensure_ascii=False)
f.close()

f = open(path_to_file, encoding="utf8")
data = json.load(f)

print(data)

[{'key': 1, 'text': 'some text'}] [{'key': 1, 'text': 'some text'}]

以序列化格式写出来,例如pickle(用于序列化的python标准库模块),或者可能使用JSON(这是一种可以被唤醒以再次产生内存表示的表示)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM