简体   繁体   English

Python中的快速方法,使用行数作为输入变量来分割大文本文件

[英]fast method in Python to split a large text file using number of lines as input variable

I am splitting a text file using the number of lines as variable. 我使用行数作为变量来分割文本文件。 I wrote this function in order to save in a temporary directory the spitted files. 我写了这个函数,以便在临时目录中保存spitted文件。 Each file has 4 millions of lines expect the last file. 每个文件有4百万行期望最后一个文件。

import tempfile
from itertools import groupby, count

temp_dir = tempfile.mkdtemp()

def tempfile_split(filename, temp_dir, chunk=4000000):
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
            for line in group:
                with open(output_name, 'a') as outfile:
                    outfile.write(line)

the main problem is the speed of this function. 主要问题是这个功能的速度。 In order to split one file of 8 million of lines in two files of 4 millions of line the time is than more of 30 min of my windows OS and Python 2.7 为了在400万行的两个文件中拆分一个800万行的文件,时间超过了我的Windows操作系统和Python 2.7的30分钟。

       for line in group:
            with open(output_name, 'a') as outfile:
                outfile.write(line)

is opening the file, and writing one line, for each line in group. 正在打开文件,并组中的每一行写一行。 This is slow. 这很慢。

Instead, write once per group. 相反,每组写一次。

            with open(output_name, 'a') as outfile:
                outfile.write(''.join(group))

Just did a quick test with an 8million line file(uptime lines) to run the length of the file and split the file in half. 刚刚对800万行文件(正常运行时间线)进行了快速测试,以运行文件的长度并将文件分成两半。 Basically, one pass to get the line count, second pass to do the split write. 基本上,一次通过获取行计数,第二次通过进行拆分写入。

On my system, the time it took to perform the first pass run was about 2-3 seconds. 在我的系统上,执行第一次传递所花费的时间大约是2-3秒。 To complete the run and the write of the split file(s), total time took was under 21 seconds. 要完成分割文件的运行和写入,总时间不到21秒。

Did not implement the lamba functions in the OP's post. 没有在OP的帖子中实现lamba函数。 Code used below: 使用的代码如下:

#!/usr/bin/env python

import sys
import math

infile = open("input","r")

linecount=0

for line in infile:
    linecount=linecount+1

splitpoint=linecount/2

infile.close()

infile = open("input","r")
outfile1 = open("output1","w")
outfile2 = open("output2","w")

print linecount , splitpoint

linecount=0

for line in infile:
    linecount=linecount+1
    if ( linecount <= splitpoint ):
        outfile1.write(line)
    else:
        outfile2.write(line)

infile.close()
outfile1.close()
outfile2.close()

No, it's not going to win any performance or code elegance tests. 不,它不会赢得任何性能或代码优雅测试。 :) But short of something else being a performance bottleneck, the lambda functions causing the file to be cached in memory and forcing a swap issue, or that the lines in the file are extremely long, I don't see why it would take 30 minutes to read/split the 8million line file. :)但是缺少其他东西是性能瓶颈,lambda函数导致文件缓存在内存中并强制交换问题,或者文件中的行非常长,我不明白为什么需要30读取/拆分800万行文件的分钟数。

EDIT: 编辑:

My environment: Mac OS X, storage was a single FW800 connected hard drive. 我的环境:Mac OS X,存储是一个连接到FW800的硬盘。 File was created fresh to avoid filesystem caching benefits. 文件是新创建的,以避免文件系统缓存的好处。

You can use tempfile.NamedTemporaryFile directly in the context manager: 您可以直接在上下文管理器中使用tempfile.NamedTemporaryFile

import tempfile
import time
from itertools import groupby, count

def tempfile_split(filename, temp_dir, chunk=4*10**6):
    fns={}
    with open(filename, 'r') as datafile:
        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:
            with tempfile.NamedTemporaryFile(delete=False,
                           dir=temp_dir,prefix='{}_'.format(str(k))) as outfile:
                outfile.write(''.join(group))
                fns[k]=outfile.name   
    return fns                     

def make_test(size=8*10**6+1000):
    with tempfile.NamedTemporaryFile(delete=False) as fn:
        for i in xrange(size):
            fn.write('Line {}\n'.format(i))

    return fn.name        

fn=make_test()
t0=time.time()
print tempfile_split(fn,tempfile.mkdtemp()),time.time()-t0   

On my computer, the tempfile_split part runs in 3.6 seconds. 在我的计算机上, tempfile_split部分在3.6秒内运行。 It is OS X. 它是OS X.

If you're in a linux or unix environment you could cheat a little and use the split command from inside python. 如果你在linux或unix环境中,你可以欺骗一点,并使用python内部的split命令。 Does the trick for me, and very fast too: 对我来说是诀窍,也非常快:

def split_file(file_path, chunk=4000):

    p = subprocess.Popen(['split', '-a', '2', '-l', str(chunk), file_path,
                          os.path.dirname(file_path) + '/'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    p.communicate()
    # Remove the original file if required
    try:
        os.remove(file_path)
    except OSError:
        pass
    return True

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM