合并文件的最快方法是什么，拆分数组的最快方法是什么？

Question

what's the quickest way to take a list of files and a name of an output file and merge them into a single file while removing duplicate lines? 获取文件列表和输出文件名称并将其合并为单个文件，同时删除重复行的最快方法是什么？ something like 就像是

cat file1 file2 file3 | cat file1 file2 file3 | sort -u > out.file 排序-u> out.file

in python. 在python中。

prefer not to use system calls. 不想使用系统调用。

AND: 和：

what's the quickest way to split a list in python into X chunks (list of lists) as equal as possible? 将python中的列表尽可能地分成X个块（列表列表）的最快方法是什么？ (given a list and X.) （给出列表和X。）

Answer 1

First: 第一：

lines = set()
for filename in filenames:
    with open(filename) as inF:
        lines.update(inF)
with open(outfile, 'w') as outF:
    outF.write(''.join(lines))

Second: 第二：

def chunk(bigList, x):
    chunklen = len(bigList) / x
    for i in xrange(0, len(bigList), chunklen):
        yield bigList[i:i+chunklen]

listOfLists = list(chunk(bigList, x))

Answer 2

For the first: 为了第一：

lines = []
for filename in filenames:
    f = open(filename)
    lines.extend(f.read().split('\n')
    f.close()
lines = list(set(lines)) #remove duplicates
f = open(outfile_name, 'w')
f.write(''.join(lines))

assuming that the files are a reasonable length as all the data from the files will be stored in memory simultaneously. 假设文件长度合理，因为文件中的所有数据将同时存储在内存中。 If you want to preserve the side effect of sort ordering the lines, then just add lines.sort() before the file is written. 如果要保留对行进行sort的副作用，则只需在写入文件之前添加lines.sort() 。

And the second: 第二个：

step_size = len(orig_list)/num_chunks
split_list = [orig_list[i:i+step_size] for i in range(0, len(orig_list), step_size)]

合并文件的最快方法是什么，拆分数组的最快方法是什么？

问题描述

2 个解决方案

解决方案1
2 2010-10-08 19:39:51

解决方案2
-1 2010-10-08 19:33:32

合并文件的最快方法是什么，拆分数组的最快方法是什么？

问题描述

2 个解决方案

解决方案1 2 2010-10-08 19:39:51

解决方案2 -1 2010-10-08 19:33:32

解决方案1
2 2010-10-08 19:39:51

解决方案2
-1 2010-10-08 19:33:32