用動態文件名寫入文件？

Question

我需要在DATE之前分割一個大型txt文件（大約100GB， 10億行 ）。 該文件看起來像這樣

ID*DATE*company
1111*201101*geico
1234*201402*travelers
3214*201003*statefarm
...

基本上有60個月，所以我應該得到60個子文件。 我的Python腳本是

with open("myBigFile.txt") as f:
    for line in f: 
        claim = line.split("*")
        with open("DATE-"+str(claim[1])+".txt", "a") as fy:
            fy.write(claim[0]+"*"+claim[2]+"\n")

現在，由於記錄數量巨大，因此運行速度太慢，因為它需要為每一行打開/關閉文件。 因此，我正在考慮先打開60個子文件，然后掃描文件，將每一行寫入相應的子文件。 直到所有行都被掃描，子文件才會關閉 。 但是，由於python每當刪除引用后都會自動關閉文件（ http://blog.lerner.co.il/dont-use-python-close-files-answer-depends/ ），因此我必須使用一些動態文件名，就像是

claim[1].write(claim[0]+"*"+claim[2]+"\n")

請注意，您無法命名fy和fy.write(claim[0]+"*"+claim[2]+"\\n")因為只要更改了fy ，文件就會關閉。 這在Python中可行嗎？ 謝謝！

Answer 1

這樣的事情怎么樣：

with open("myBigFile.txt") as f:
subfiles = {}
for line in f: 
    claim = line.split("*")
    if not str(claim[1]) in subfiles:
        subfiles[str(claim[1])] = open("DATE-" + str(claim[1]) + ".txt", "a")
    subfile[str(claim[1])].write(claim[0]+"*"+claim[2]+"\n")

我相信這應該做到。

僅提及一下，我目前還沒有限制在給定時刻打開的文件數量。 要實現該功能，只需使用“ len（）”檢查列表的大小並關閉所有文件或幾個文件。

Answer 2

您可以使用csv模塊進行一些簡化，並使用字典來存儲文件對象：

import csv

with open("myBigFile.txt") as big_file:
    reader = csv.reader(big_file, delimiter='*')
    subfiles = {}

    for id, date, company in reader:
        try:
            subfile = subfiles[date]
        except KeyError:
            subfile = open('DATE-{}.txt'.format(date), 'a')
            subfiles[date] = subfile

        subfile.write('{}*{}\n'.format(id, company))

    for subfile in subfiles.values():
        subfile.close()

Answer 3

這是一個解決方案，它將作為上下文管理器的一部分關閉文件句柄，這與其他答案不同，這也會在發生錯誤時關閉子文件:-)

from contextlib import contextmanager

@contextmanager
def file_writer():
    fp = {}

    def write(line):
        id, date, company = line.split('*')
        outdata = "{}*{}\n".format(id, company)
        try:
            fp[date].write(outdata)
        except KeyError:
            fname = 'DATE-{}.txt'.format(date)
            fp[date] = open(fname, 'a')    # should it be a+?
            fp[date].write(outdata)

    yield write

    for f in fp.values():
        f.close()


def process():
    with open("myBigFile.txt") as f:
        with file_writer() as write:
            for i, line in enumerate(f):
                try:
                    write(line)
                except:
                    print('the error happened on line %d [%s]' % (i, line))

我不知道在單個處理器/磁盤上是否還有其他可以快速完成的事情。 您始終可以將文件分成n個塊，並使用n個進程分別處理一個塊（其中n是您可用的獨立磁盤數。）

用動態文件名寫入文件？

問題描述

3 個解決方案

解決方案1
0 已采納 2017-11-06 23:51:09

解決方案2
0 2017-11-07 00:10:37

解決方案3
0 2017-11-07 00:28:04

用動態文件名寫入文件？

問題描述

3 個解決方案

解決方案1 0 已采納 2017-11-06 23:51:09

解決方案2 0 2017-11-07 00:10:37

解決方案3 0 2017-11-07 00:28:04

解決方案1
0 已采納 2017-11-06 23:51:09

解決方案2
0 2017-11-07 00:10:37

解決方案3
0 2017-11-07 00:28:04