Python中的快速方法，使用行數作為輸入變量來分割大文本文件

Question

我使用行數作為變量來分割文本文件。 我寫了這個函數，以便在臨時目錄中保存spitted文件。 每個文件有4百萬行期望最后一個文件。

import tempfile
from itertools import groupby, count

temp_dir = tempfile.mkdtemp()

def tempfile_split(filename, temp_dir, chunk=4000000):
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
            for line in group:
                with open(output_name, 'a') as outfile:
                    outfile.write(line)

主要問題是這個功能的速度。 為了在400萬行的兩個文件中拆分一個800萬行的文件，時間超過了我的Windows操作系統和Python 2.7的30分鍾。

Answer 1

       for line in group:
            with open(output_name, 'a') as outfile:
                outfile.write(line)

正在打開文件，並為組中的每一行寫一行。 這很慢。

相反，每組寫一次。

            with open(output_name, 'a') as outfile:
                outfile.write(''.join(group))

Answer 2

剛剛對800萬行文件（正常運行時間線）進行了快速測試，以運行文件的長度並將文件分成兩半。 基本上，一次通過獲取行計數，第二次通過進行拆分寫入。

在我的系統上，執行第一次傳遞所花費的時間大約是2-3秒。 要完成分割文件的運行和寫入，總時間不到21秒。

沒有在OP的帖子中實現lamba函數。 使用的代碼如下：

#!/usr/bin/env python

import sys
import math

infile = open("input","r")

linecount=0

for line in infile:
    linecount=linecount+1

splitpoint=linecount/2

infile.close()

infile = open("input","r")
outfile1 = open("output1","w")
outfile2 = open("output2","w")

print linecount , splitpoint

linecount=0

for line in infile:
    linecount=linecount+1
    if ( linecount <= splitpoint ):
        outfile1.write(line)
    else:
        outfile2.write(line)

infile.close()
outfile1.close()
outfile2.close()

不，它不會贏得任何性能或代碼優雅測試。 :)但是缺少其他東西是性能瓶頸，lambda函數導致文件緩存在內存中並強制交換問題，或者文件中的行非常長，我不明白為什么需要30讀取/拆分800萬行文件的分鍾數。

編輯：

我的環境：Mac OS X，存儲是一個連接到FW800的硬盤。 文件是新創建的，以避免文件系統緩存的好處。

Answer 3

您可以直接在上下文管理器中使用tempfile.NamedTemporaryFile ：

import tempfile
import time
from itertools import groupby, count

def tempfile_split(filename, temp_dir, chunk=4*10**6):
    fns={}
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            with tempfile.NamedTemporaryFile(delete=False,
                           dir=temp_dir,prefix='{}_'.format(str(k))) as outfile:
                outfile.write(''.join(group))
                fns[k]=outfile.name   
    return fns                     

def make_test(size=8*10**6+1000):
    with tempfile.NamedTemporaryFile(delete=False) as fn:
        for i in xrange(size):
            fn.write('Line {}\n'.format(i))

    return fn.name        

fn=make_test()
t0=time.time()
print tempfile_split(fn,tempfile.mkdtemp()),time.time()-t0

在我的計算機上， tempfile_split部分在3.6秒內運行。 它是OS X.

Answer 4

如果你在linux或unix環境中，你可以欺騙一點，並使用python內部的split命令。 對我來說是訣竅，也非常快：

def split_file(file_path, chunk=4000):

    p = subprocess.Popen(['split', '-a', '2', '-l', str(chunk), file_path,
                          os.path.dirname(file_path) + '/'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    p.communicate()
    # Remove the original file if required
    try:
        os.remove(file_path)
    except OSError:
        pass
    return True

Python中的快速方法，使用行數作為輸入變量來分割大文本文件

問題描述

4 個解決方案

解決方案1
6 已采納 2013-03-26 19:53:02

解決方案2
1 2013-03-26 20:18:48

解決方案3
1 2013-03-26 20:24:06

解決方案4
0 2018-02-02 07:44:31

Python中的快速方法，使用行數作為輸入變量來分割大文本文件

問題描述

4 個解決方案

解決方案1 6 已采納 2013-03-26 19:53:02

解決方案2 1 2013-03-26 20:18:48

解決方案3 1 2013-03-26 20:24:06

解決方案4 0 2018-02-02 07:44:31

解決方案1
6 已采納 2013-03-26 19:53:02

解決方案2
1 2013-03-26 20:18:48

解決方案3
1 2013-03-26 20:24:06

解決方案4
0 2018-02-02 07:44:31