简体   繁体   English

减少运行时,文件读取,每行的字符串操作和文件写入

[英]Reduce runtime, file reading, string manipulation of every line and file writing

I'm writing on a script that reads all lines from multiple files, reads in a number at the beginning of each block and puts that number in front of every line of the block until the next number and so on. 我正在写一个脚本,它读取多个文件中的所有行,在每个块的开头读取一个数字,并将该数字放在块的每一行前面,直到下一个数字,依此类推。 Afterwards it writes all read lines into a single .csv file. 然后,它将所有读取行写入单个.csv文件。

The files I am reading look like this: 我正在阅读的文件如下所示:

13368:
2385003,4,2004-07-08
659432,3,2005-03-16
13369:
751812,2,2002-12-16
2625420,2,2004-05-25

And the output file should look like this: 输出文件应如下所示:

13368,2385003,4,2004-07-08
13368,659432,3,2005-03-16
13369,751812,2,2002-12-16
13369,2625420,2,2004-05-25

Currently my script is this: 目前我的脚本是这样的:

from asyncio import Semaphore, ensure_future, gather, run
import time

limit = 8

async def read(file_list):
    tasks = list()
    result = None

    sem = Semaphore(limit)

    for file in file_list:
        task = ensure_future(read_bounded(file,sem))
        tasks.append(task)

        result = await gather(*tasks)

    return result

async def read_bounded(file,sem):
    async with sem:
        return await read_one(file)

async def read_one(filename):
    result = list()
    with open(filename) as file:
        dataList=[]
        content = file.read().split(":")
        file.close()
        j=1
        filmid=content[0]
        append=result.append
        while j<len(content):
            for entry in content[j].split("\n"):
                if len(entry)>10:
                    append("%s%s%s%s" % (filmid,",",entry,"\n"))
                else:
                    if len(entry)>0:
                        filmid=entry
            j+=1
    return result

if __name__ == '__main__':
    start=time.time()
    write_append="w"
    files = ['combined_data_1.txt', 'combined_data_2.txt', 'combined_data_3.txt', 'combined_data_4.txt']

    res = run(read(files))

    with open("output.csv",write_append) as outputFile:
        for result in res:
            outputFile.write(''.join(result))
            outputFile.flush()
    outputFile.close()
    end=time.time()
    print(end-start)

It has a runtime of about 135 Seconds (The 4 files that are read are each 500MB big and the output file has 2.3GB). 它的运行时间约为135秒(读取的4个文件各500MB,输出文件为2.3GB)。 Running the script takes about 10GB of RAM. 运行该脚本需要大约10GB的RAM。 I think this might be a problem. 我认为这可能是一个问题。 The biggest amount of time is needed to create the list of all lines, I think. 我认为,创建所有行列表需要最长的时间。 I would like to reduce the runtime of this program, but I am new to python and not sure how to do this. 我想减少这个程序的运行时间,但我是python的新手,不知道如何做到这一点。 Can you give me some advice? 你能给我一些建议吗?

Thanks 谢谢

Edit: 编辑:

I measured the times for the following commands in cmd (I have only Windows installed on my Computer, so I used hopefully equivalent cmd-Commands): 我在cmd中测量了以下命令的时间(我的计算机上只安装了Windows,因此我使用了希望等效的cmd-Commands):

sequential writing to NUL 顺序写入NUL

timecmd "type combined_data_1.txt combined_data_2.txt combined_data_3.txt combined_data_4.txt  > NUL"

combined_data_1.txt


combined_data_2.txt


combined_data_3.txt


combined_data_4.txt

command took 0:1:25.87 (85.87s total)

sequential writing to file 顺序写入文件

timecmd "type combined_data_1.txt combined_data_2.txt combined_data_3.txt combined_data_4.txt  > test.csv"

combined_data_1.txt
combined_data_2.txt
combined_data_3.txt
combined_data_4.txt

command took 0:2:42.93 (162.93s total)

parallel 平行

timecmd "type combined_data_1.txt > NUL & type combined_data_2.txt > NUL & type combined_data_3.txt >NUL & type combined_data_4.txt  > NUL"
command took 0:1:25.51 (85.51s total)

In this case you're not gaining anything by using asyncio for two reasons: 在这种情况下,您没有通过使用asyncio获得任何东西,原因有两个:

  • asyncio is single-threaded and doesn't parallelize processing (and, in Python, neither can threads ) asyncio是单线程的,不会并行化处理(在Python中, 不能并行处理)
  • the IO calls access the file system, which asyncio doesn't cover - it is primarily about network IO IO调用访问文件系统,asyncio没有涵盖 - 它主要是关于网络IO

The giveaway that you're not using asyncio correctly is the fact that your read_one coroutine doesn't contain a single await . 您没有正确使用asyncio的赠品是您的read_one协程不包含单个await的事实。 That means that it never suspends execution, and that it will run to completion before ever yielding to another coroutine. 这意味着它永远不会暂停执行,并且它会在屈服于另一个协程之前运行完成。 Making it an ordinary function (and dropping asyncio altogether) would have the exact same result. 使它成为一个普通的函数(并完全删除asyncio)将得到完全相同的结果。

Here is a rewritten version of the script with the following changes: 以下是脚本的重写版本,其中包含以下更改:

  • byte IO throughout, for efficiency 整个字节IO,以提高效率
  • iterates through the file rather than loading all at once 迭代文件而不是一次加载所有文件
  • sequential code 顺序代码
import sys

def process(in_filename, outfile):
    with open(in_filename, 'rb') as r:
        for line in r:
            if line.endswith(b':\n'):
                prefix = line[:-2]
                continue
            outfile.write(b'%s,%s' % (prefix, line))

def main():
    in_files = sys.argv[1:-1]
    out_file = sys.argv[-1]
    with open(out_file, 'wb') as out:
        for fn in in_files:
            process(fn, out)

if __name__ == '__main__':
    main()

On my machine and Python 3.7, this version performs at approximately 22 s/GiB, tested on four randomly generated files, of 550 MiB each. 在我的机器和Python 3.7上,此版本在大约22 s / GiB下执行,在四个随机生成的文件上测试,每个文件550 MiB。 It has a negligible memory footprint because it never loads the whole file into memory. 它具有可忽略的内存占用,因为它永远不会将整个文件加载到内存中。

The script runs on Python 2.7 unchanged, where it clocks at 27 s/GiB. 该脚本在Python 2.7上运行不变,其时钟频率为27秒/ GiB。 Pypy (6.0.0) runs it much faster, taking only 11 s/GiB. Pypy(6.0.0)运行速度更快,只需11秒/ GiB。

Using concurrent.futures in theory ought to allow processing in one thread while another is waiting for IO, but the result ends up being significantly slower than the simplest sequential approach. 在理论上使用concurrent.futures应该允许在一个线程中处理而另一个在等待IO,但结果最终比最简单的顺序方法得多。

You want to read 2 GiB and write 2 GiB with low elapsed time and low memory consumption. 您希望读取2 GiB并写入2 GiB,并且耗时较短且内存消耗较低。 Parallelism, for core and for spindle, matters. 对于核心和主轴而言,并行性至关重要。 Ideally you would tend to keep all of them busy. 理想情况下,你会倾向于让所有人都忙碌。 I assume you have at least four cores available. 我假设您至少有四个核心可用。 Chunking your I/O matters, to avoid excessive malloc'ing. 整理你的I / O很重要,以避免过多的malloc'ing。

Start with the simplest possible thing. 从最简单的事情开始。 Please make some measurements and update your question to include them. 请进行一些测量并更新您的问题以包含它们。

sequential 顺序

Please make sequential timing measurements of 请进行顺序定时测量

$ cat combined_data_[1234].csv > /dev/null

and

$ cat combined_data_[1234].csv > big.csv

I assume you will have low CPU utilization, and thus will be measuring read & write I/O rates. 我假设您的CPU利用率很低,因此将测量读写I / O速率。

parallel 平行

Please make parallel I/O measurements: 请进行并行I / O测量:

cat combined_data_1.csv > /dev/null &
cat combined_data_2.csv > /dev/null &
cat combined_data_3.csv > /dev/null &
cat combined_data_4.csv > /dev/null &
wait

This will let you know if overlapping reads offers a possibility for speedup. 这将让您知道重叠读取是否提供加速的可能性。 For example, putting the files on four different physical filesystems might allow this -- you'd be keeping four spindles busy. 例如,将文件放在四个不同的物理文件系统上可能允许这样做 - 您将保持四个主轴忙碌。

async 异步

Based on these timings, you may choose to ditch async I/O, and instead fork off four separate python interpreters. 基于这些时间,您可以选择抛弃异步I / O,而是分叉四个单独的python解释器。

logic 逻辑

        content = file.read().split(":")

This is where much of your large memory footprint comes from. 这是您大量内存占用的来源。 Rather than slurping in the whole file at once, consider reading by lines, or in chunks. 不要一次在整个文件中啜饮,而是考虑按行或按块读取。 A generator might offer you a convenient API for that. 生成器可能会为您提供方便的API。

EDIT: 编辑:

compression 压缩

It appears that you are I/O bound -- you have idle cycles while waiting on the disk. 您似乎是I / O绑定 - 您在磁盘上等待时有空闲周期。 If the final consumer of your output file is willing to do decompression, then consider using gzip , xz/lzma , or snappy . 如果输出文件的最终使用者愿意进行解压缩,那么请考虑使用gzipxz / lzmasnappy The idea is that most of the elapsed time is spent on I/O, so you want to manipulate smaller files to do less I/O. 我们的想法是大部分时间花在I / O上,因此您希望操作较小的文件以减少I / O. This benefits your script when writing 2 GiB of output, and may also benefit the code that consumes that output. 在编写2 GiB输出时,这有利于您的脚本,并且还可能使消耗该输出的代码受益。

As a separate item, you might possibly arrange for the code that produces the four input files to produce compressed versions of them. 作为单独的项目,您可能会安排生成四个输入文件的代码来生成它们的压缩版本。

I have tried to solve your problem. 我试图解决你的问题。 I think this is very easy and simple way if don't have any prior knowledge of any special library. 如果对任何特殊库没有任何先验知识,我认为这是非常简单和简单的方法。

I just took 2 input files named input.txt & input2.txt with following contents. 我刚刚接受了两个名为input.txtinput2.txt输入文件, input2.txt包含以下内容。

Note: All files are in same directory. 注意:所有文件都在同一目录中。

input.txt input.txt中

13368:
2385003,4,2004-07-08
659432,3,2005-03-16
13369:
751812,2,2002-12-16
2625420,2,2004-05-25

input2.txt input2.txt

13364:
2385001,5,2004-06-08
659435,1,2005-03-16
13370:
751811,2,2023-12-16
2625220,2,2015-05-26

I have written the code in modular way so that you could easily import and use it in your project. 我已经以模块化方式编写代码,以便您可以轻松地在项目中导入和使用它。 Once you run the below code from terminal using python3 csv_writer.py , it will read all the files provided in list file_names and generate output.csv will the result that you're looking for. 一旦你使用python3 csv_writer.py从终端运行下面​​的代码,它将读取列表file_names提供的所有文件,并生成output.csv将是你正在寻找的结果。

csv_writer.py csv_writer.py

# https://stackoverflow.com/questions/55226823/reduce-runtime-file-reading-string-manipulation-of-every-line-and-file-writing
import re

def read_file_and_get_output_lines(file_names):
    output_lines = []

    for file_name in file_names:
        with open(file_name) as f:
            lines = f.readlines()
            for new_line in lines:
                new_line = new_line.strip()

                if not re.match(r'^\d+:$', new_line):
                    output_line = [old_line]
                    output_line.extend(new_line.split(","))
                    output_lines.append(output_line)
                else:
                    old_line = new_line.rstrip(":")

    return output_lines

def write_lines_to_csv(output_lines, file_name):
    with open(file_name, "w+") as f:
        for arr in output_lines:
            line = ",".join(arr)
            f.write(line + '\n')

if __name__ == "__main__":
    file_names = [
        "input.txt",
        "input2.txt"
    ]

    output_lines = read_file_and_get_output_lines(file_names)
    print(output_lines)
    # [['13368', '2385003', '4', '2004-07-08'], ['13368', '659432', '3', '2005-03-16'], ['13369', '751812', '2', '2002-12-16'], ['13369', '2625420', '2', '2004-05-25'], ['13364', '2385001', '5', '2004-06-08'], ['13364', '659435', '1', '2005-03-16'], ['13370', '751811', '2', '2023-12-16'], ['13370', '2625220', '2', '2015-05-26']]

    write_lines_to_csv(output_lines, "output.csv")

output.csv output.csv

13368,2385003,4,2004-07-08
13368,659432,3,2005-03-16
13369,751812,2,2002-12-16
13369,2625420,2,2004-05-25
13364,2385001,5,2004-06-08
13364,659435,1,2005-03-16
13370,751811,2,2023-12-16
13370,2625220,2,2015-05-26

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM