簡體   English   中英

如何在泡菜中保存字典

[英]how to save a dictionary in pickle

我正在嘗試使用Pickle將字典保存在文件中。 保存字典的代碼運行沒有任何問題,但是當我嘗試從Python Shell中的文件中檢索字典時,出現EOF錯誤:

>>> import pprint
>>> pkl_file = open('data.pkl', 'rb')
>>> data1 = pickle.load(pkl_file)
 Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/usr/lib/python2.7/pickle.py", line 1378, in load
     return Unpickler(file).load()
     File "/usr/lib/python2.7/pickle.py", line 858, in load
      dispatch[key](self)
      File "/usr/lib/python2.7/pickle.py", line 880, in load_eof
      raise EOFError
      EOFError

我的代碼如下。

它計算每個單詞的頻率和數據的日期(日期是文件名。),然后將單詞保存為字典的鍵,將(freq,date)的元組保存為每個鍵的值。 現在,我想將此字典用作工作另一部分的輸入:

def pathFilesList():
    source='StemmedDataset'
    retList = []
    for r,d,f in os.walk(source):
        for files in f:
            retList.append(os.path.join(r, files))
    return retList

def parsing():
    fileList = pathFilesList()
    for f in fileList:
        print "Processing file: " + str(f)
        fileWordList = []
        fileWordSet = set()
        fw=codecs.open(f,'r', encoding='utf-8')
        fLines = fw.readlines()
        for line in fLines:
            sWord = line.strip()
            fileWordList.append(sWord)
            if sWord not in fileWordSet:
                fileWordSet.add(sWord)
        for stemWord in fileWordSet:
            stemFreq = fileWordList.count(stemWord)
            if stemWord not in wordDict:
                wordDict[stemWord] = [(f[15:-4], stemFreq)]
            else:
                wordDict[stemWord].append((f[15:-4], stemFreq))
        fw.close()

if __name__ == "__main__":
    parsing()
    output = open('data.pkl', 'wb')
    pickle.dump(wordDict, output)
    output.close()

您認為問題是什么?

由於這是Python2,因此您通常必須更加明確地說明源代碼的編碼方式。所引用的PEP-0263對此進行了詳細說明。 我的建議是,您嘗試將以下內容添加到unpickle.py前兩行

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# The rest of your code....

順便說一句,如果您要使用非ASCII字符進行大量工作,最好改用Python3。

# Added some code and comments.  To make the code more complete.
# Using collections.Counter to count words.

import os.path
import codecs
import pickle
from collections import Counter

wordDict = {}

def pathFilesList():
    source='StemmedDataset'
    retList = []
    for r, d, f in os.walk(source):
        for files in f:
            retList.append(os.path.join(r, files))
    return retList

# Starts to parse a corpus, it counts the frequency of each word and
# the date of the data (the date is the file name.) then saves words
# as keys of dictionary and the tuple of (freq,date) as values of each
# key.
def parsing():
    fileList = pathFilesList()
    for f in fileList:
        date_stamp = f[15:-4]
        print "Processing file: " + str(f)
        fileWordList = []
        fileWordSet = set()
        # One word per line, strip space. No empty lines.
        fw = codecs.open(f, mode = 'r' , encoding='utf-8')
        fileWords = Counter(w for w in fw.read().split())
        # For each unique word, count occurance and store in dict.
        for stemWord, stemFreq in fileWords.items():
            if stemWord not in wordDict:
                wordDict[stemWord] = [(date_stamp, stemFreq)]
            else:
                wordDict[stemWord].append((date_stamp, stemFreq))
        # Close file and do next.
        fw.close()


if __name__ == "__main__":
    # Parse all files and store in wordDict.
    parsing()

    output = open('data.pkl', 'wb')

    # Assume wordDict is global.
    print "Dumping wordDict of size {0}".format(len(wordDict))
    pickle.dump(wordDict, output)

    output.close()

如果您正在尋找可以將大量數據字典保存到磁盤或數據庫中並且可以利用酸洗和編碼(編解碼器和哈希圖)的工具,那么您可能想要看看klepto

klepto提供了字典抽象,用於寫入數據庫,包括將文件系統視為數據庫(即將整個字典寫入單個文件,或將每個條目寫入其自己的文件)。 對於大數據,我經常選擇將字典表示為文件系統上的目錄,並將每個條目都作為一個文件。 klepto還提供了緩存算法,因此,如果您在字典中使用文件系統后端,則可以通過使用內存緩存來避免某些速度損失。

>>> from klepto.archives import dir_archive
>>> d = {'a':1, 'b':2, 'c':map, 'd':None}
>>> # map a dict to a filesystem directory
>>> demo = dir_archive('demo', d, serialized=True) 
>>> demo['a']
1
>>> demo['c']
<built-in function map>
>>> demo          
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> # is set to cache to memory, so use 'dump' to dump to the filesystem 
>>> demo.dump()
>>> del demo
>>> 
>>> demo = dir_archive('demo', {}, serialized=True)
>>> demo
dir_archive('demo', {}, cached=True)
>>> # demo is empty, load from disk
>>> demo.load()
>>> demo
dir_archive('demo', {'a': 1, 'c': <built-in function map>, 'b': 2, 'd': None}, cached=True)
>>> demo['c']
<built-in function map>
>>> 

klepto還具有如其它標志compressionmemmode可用於定制數據是如何被存儲(例如壓縮級別,存儲器映射模式,等等)。 使用(MySQL等)數據庫作為后端而不是文件系統同樣容易(完全相同的接口)。 您還可以關閉內存緩存,因此每次讀取/寫入都直接通過設置cached=False直接進入存檔。

klepto通過構建自定義keymap來提供對自定義編碼的訪問。

>>> from klepto.keymaps import *
>>> 
>>> s = stringmap(encoding='hex_codec')
>>> x = [1,2,'3',min]
>>> s(x)
'285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c29'
>>> p = picklemap(serializer='dill')
>>> p(x)
'\x80\x02]q\x00(K\x01K\x02U\x013q\x01c__builtin__\nmin\nq\x02e\x85q\x03.'
>>> sp = s+p
>>> sp(x)
'\x80\x02UT28285b312c20322c202733272c203c6275696c742d696e2066756e6374696f6e206d696e3e5d2c292c29q\x00.' 

在此處獲取kleptohttps : //github.com/uqfoundation

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM