简体   繁体   English

在每个类别的文件中对文本文件的行进行分组-最有效的方法

[英]Group lines of a text file in files per category - Most efficient way

I have a huge text file (enough to fill my computer's memory) that I need to separate in smaller files. 我有一个巨大的文本文件(足以填满我的计算机的内存),需要将其分成较小的文件。

The file contains CSV lines, where the first line is an ID: 该文件包含CSV行,其中第一行是ID:

ID1, val11, val12, val13
ID2, val21, val22, val23
ID1, val31, val32, val33
ID3, val41, val42, val43

I want to read each line, or groups of lines from the source file and create smaller files grouping the lines per ID: 我想从源文件中读取每一行或几行,并创建将每个ID的行分组的较小文件:

File1:
    val11, val12, val13
    val31, val32, val33
File2:
    val21, val22, val23
File3:
    val41, val42, val43

So far I can do it with the following code, but that is taking reeeeally long (I don't have 10 days to do this). 到目前为止,我可以使用以下代码来完成此操作,但是这花了很多时间(我没有10天这样做)。

def groupIDs(fileName,folder):
    loadedFile = open(fileName, 'r')
    firstLine = loadedFile.readline() #skip titles

    folder += "/"

    count = 0;

    for line in loadedFile:

        elems = line.split(',')
        id = elems[0]

        rest = ""
        for elem in elems[1:]:
            rest+=elem + ","

        with open(folder+id,'a') as f:
            f.write(rest[:-1])

        #printing progress
        count+=1
        if count % 50000 == 0:
            print(count)

    loadedFile.close()

The bottleneck seems to be HD performance, as shown by resource monitor (CPU usage is below 20%, memory is barely touched) 如资源监视器所示,瓶颈似乎是高清性能(CPU使用率低于20%,几乎没有占用内存)

How can I improve this for best performance? 我该如何改善它以获得最佳性能?

You could keep it in ram and only flush out every few 1000 lines or when the ram is filled to a degree selectable by you. 您可以将其保留在柱塞中,并且每隔几千行才冲洗一次,也可以将柱塞填充到您可以选择的程度。

You should also use context managers with files and use the os.path or pathlib modules from the std lib instead of manually using strings as pathes. 您还应该将上下文管理器与文件一起使用,并使用std lib中的os.pathpathlib模块,而不是手动使用字符串作为路径。

Here is a solution that saves every 10000 lines, adjust as fits your problem: 这是一个解决方案,每10,000行保存一次,根据您的问题进行调整:

import os
from glob import iglob
from collections import defaultdict


def split_files_into_categories(inputfiles, outputdir):
    count = 0
    categories = defaultdict(bytearray)

    for inputfile in inputfiles:
        with open(inputfile, 'rb')  as f:
            next(f) # skip first line

            for line in f:

                if count % 10000 == 0:
                    save_results(categories, outputdir)
                    categories.clear()

                category, _, rest = line.partition(b',')

                categories[category] += rest
                count += 1

    save_results(categories, outputdir)


def save_results(categories, outputdir):
    for category, data in categories.items():
        with open(os.path.join(outputdir, category.decode() + '.csv'), 'ab') as f:
            f.write(data)


if __name__ == '__main__':
    # run on all csvs in the data folder
    split_files_into_categories(iglob('data/*.csv'), 'by_category')

A few explanations: 一些解释:

  • I open the files in binary mode and use a bytearray , this prevents copying the data. 我以二进制模式打开文件并使用bytearray ,这可以防止复制数据。 In python strings are immutable, so += creates a new string and reassigns it. 在python中,字符串是不可变的,因此+=创建一个新字符串并重新分配它。

  • defaultdict(bytearray) will create an empty bytearray for every new category, as soon as it is accessed for the first time. 第一次访问defaultdict(bytearray)它将为每个新类别创建一个空的bytearray

  • You could replace the if count % 100000 == 0 with checking memory consumption like this: 您可以将if count % 100000 == 0替换为检查内存消耗,如下所示:

     import os import psutil process = psutil.Process(os.getpid()) 

    and then check 然后检查

     # save results if process uses more than 1GB of ram if process.memory_info().rss > 1e9: 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM