如何使用python将包含大量数据的字典写入csv文件？

Question

我正在尝试将大量数据从 dict 写入 csv 文件，但写入大约一百万行数据后停止。 以下是代码：

import os
from nltk import ngrams

with open('four_grams.csv', 'w') as f:
for i in os.listdir(r'C:\Users\rocki\Downloads\Compressed\train'):
    if i.endswith('.bytes'):
        with open(i) as file:
            content=file.read()
            new_content = ' '.join([w for w in content.split() if len(w)<3])
            four_grams=ngrams(new_content.split(), 4)
            grams_dict={}
            for grams in four_grams:
                gram=' '.join(grams)
                if gram not in grams_dict:
                    grams_dict[gram]=1
                else:
                    grams_dict[gram]=grams_dict[gram]+1                    
                for key in grams_dict.keys():
                    f.write("%s,%s\n"%(key,grams_dict[key]))

关于如何实现这一目标的任何建议？

Answer 1

我认为您会想要使用Pandas来编写 csv。 此代码假定每个grams_dict的结构相同。 我还没有让大熊猫因大型 csv 写入而窒息。 希望它对你来说运行良好！

import pandas as pd

saved_dfs = [] # Create an empty list where we will save each new dataframe (grams_dict) created.

for i in os.listdir(r'C:\Users\rocki\Downloads\Compressed\train'):
    if i.endswith('.bytes'):
        with open(i) as file:
            content=file.read()
            new_content = ' '.join([w for w in content.split() if len(w)<3])
            four_grams=ngrams(new_content.split(), 4)
            grams_dict={}
            for grams in four_grams:
                gram=' '.join(grams)
                if gram not in grams_dict:
                    grams_dict[gram]=1
                else:
                    grams_dict[gram]=grams_dict[gram]+1
            df = pd.DataFrame(data=grams_dict) # create a new DataFrame for each file opened
            saved_dfs.append(df)

final_grams_dict = pd.concat(saved_dfs) # Combine all of the saved grams_dict's into one DataFrame Object

final_grams_dict.to_csv('path.csv')

祝你好运！

Answer 2

你确定你知道代码卡在哪里（或文件查看器）？ 您正在谈论数百万行，您的代码很可能会阻塞.split()的列表。 列表在变大时非常缓慢。 没有您的实际数据的任何提示，就无法知道。

无论如何，这是一个限制列表大小的版本。 为了使它成为一个可运行的示例，您的实际 io 被替换为一些假行。

import os
from nltk import ngrams
from io import StringIO
from collections import defaultdict

string_file = """
1 2 3 a b c ab cd ef
4 5 6 g h i gh ij kl
abcde fghijkl
"""

read_lines = 2 # choose something that does not make too long lists for .split()
csvf = StringIO()
#with open('four_grams.csv', 'wb') as csvf:
if True: # just for indention from with...
#    for i in os.listdir(r'C:\Users\rocki\Downloads\Compressed\train'):
    for i in range(1): # for the indention
#        if i.endswith('.bytes'):
#            with open(i) as bfile:
                bfile = StringIO(string_file)
                # get hold of line count
                chunks = bfile.read().count('\n') // read_lines
                bfile.seek(0)
                memory_line = ''
                grams_dict = defaultdict(int)
                for j in range(chunks):
                    tmp = bfile.readlines(read_lines)
                    content = ' '.join([memory_line] + tmp)
                    memory_line = tmp[-1]
                    new_content = ' '.join([w for w in content.split() if len(w)<3])
                    four_grams = ngrams(new_content.split(), 4)
                    for grams in four_grams:
                        #print(grams, len(grams_dict))
                        gram=' '.join(grams)
                        grams_dict[gram] += 1
                for k, v in grams_dict.items():
                    # assuming that it's enough to write the dict
                    # when it's filled rather than duplicating info
                    # in the resulting csv
                    csvf.write("%s\t%s\n"%(k, v))
                csvf.flush() # writes buffer if anything there
#print(grams_dict)

如果确实是您的 dict 太大，您也应该将其分开。 这样做的一种方法是制作一个 2 级 dict 并使用string.ascii_letters作为第一个键，作为第二级，您将grams_dict 仅保存以相应单个字符开头的键。

最后，可以跳过memory_line的使用，当它在那里时，它将对那里的任何内容进行重复计数，但如果您的read_lines是一个相当大的数字，我不会为此烦恼。

Answer 3

原来，不是程序写不成功，而是excel文件无法完全加载如此庞大的数据。 使用分隔试验来检查数据是否完全按照需要写入。

Answer 4

看起来你一次写每一行。 这可能会导致 I/O 问题。

尝试每次写几行而不是一次写一行。 每次尝试写 2 行，如果停止则添加一行。

如何使用python将包含大量数据的字典写入csv文件？

问题描述

4 个解决方案

解决方案1
0 2019-01-11 22:05:10

解决方案2
0 2019-01-12 01:47:19

解决方案3
0 已采纳 2019-01-12 07:29:51

解决方案4
-1 2019-01-11 18:24:48

如何使用python将包含大量数据的字典写入csv文件？

问题描述

4 个解决方案

解决方案1 0 2019-01-11 22:05:10

解决方案2 0 2019-01-12 01:47:19

解决方案3 0 已采纳 2019-01-12 07:29:51

解决方案4 -1 2019-01-11 18:24:48

解决方案1
0 2019-01-11 22:05:10

解决方案2
0 2019-01-12 01:47:19

解决方案3
0 已采纳 2019-01-12 07:29:51

解决方案4
-1 2019-01-11 18:24:48