简体   繁体   English

Python子进程-将输出保存在新文件中

[英]Python subprocess - saving output in a new file

I use the following command to reformat a file and it creates a new file: 我使用以下命令重新格式化文件,并创建一个新文件:

sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' toto> toto.json

It works fine on the command line. 它在命令行上工作正常。

I try to use it through a python script, but it does not create a new file. 我尝试通过python脚本使用它,但它不会创建新文件。

I try: 我尝试:

subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1], " > ",sys.argv[2]]) 

The issue is: it gives me the output in the stdout and raise an error: 问题是:它使我在标准输出中输出并引发错误:

sed: can't read >: No such file or directory
Traceback (most recent call last):
File "test.py", line 14, in <module>
subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/", 
sys.argv[1], ">",sys.argv[2])
File "C:\Users\Anaconda3\lib\subprocess.py", line 291, in 
check_call raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sed', '-e', '1s/^/[/', '-e', 
's/$/,/', '-e', '$s/,$/]/', 'toto.txt, '>', 'toto.json']' returned non-zero 
exit status 2.

I read the other issues with the subprocess and try other commands with the option shell=True but, it did not work either. 我阅读了子流程的其他问题,并尝试使用选项shell = True的其他命令,但是它也不起作用。 I use python 3.6 我使用python 3.6

For information, the command add a bracket in the first and last line and add a comma at the end of each line except the last one. 有关信息,该命令在第一行和最后一行添加一个括号,并在除最后一行之外的每行末尾添加一个逗号。 So, it does: 因此,它可以:

from
a
b
c

to: 至:

[a,
b,
c]

On Linux and other Unix systems, the redirection characters are not part of the command but are interpreted by the shell, so it does not make sense to pass it as parameters to a subprocess. 在Linux和其他Unix系统上,重定向字符不是命令的一部分,而是由外壳程序解释的,因此将其作为参数传递给子进程没有任何意义。

Hopefully, subprocess.call allows the stdout parameter to be a file object. 希望subprocess.call允许stdout参数成为文件对象。 So you should do: 因此,您应该执行以下操作:

subprocess.call(["sed", "-e","1s/^/[/","-e", "s/$/,/","-e","$s/,$/]/ ",sys.argv[1]],
    stdout=open(sys.argv[2], "w"))

Don't do that. 不要那样做 Don't use any OS calls if you can avoid it. 如果可以避免,请勿使用任何操作系统调用。

If you are using Python, just do pythonic Python script. 如果您使用的是Python,则只需执行pythonic Python脚本即可。

Something like: 就像是:

input_filename = 'toto'
output_filename = 'toto.json'

with open(input_filename, 'r') as inputf:
    lines = ['{},\n'.format(line.rstrip()) for line in inputf]
    lines = ['['] + lines + [']']

    with open(output_filename, 'w') as outputf:
        outputf.writelines(lines)

It basically does the same as your command line. 它基本上与您的命令行相同。

Trusts this piece of code is kind of dirty and only for example purpose. 相信这段代码是肮脏的,仅供参考。 I advise you to do your own and avoid oneliners like I did. 我建议您自己动手,避免像我一样单身。

I had a hunch that Python can do this much faster than sed but I didn't have the time to check until now, so... Based on your comment to Arount's answer: 我有一种预感,Python可以比sed快得多,但是直到现在我还没有时间检查,所以...根据您对Arount答案的评论:

my real file is actually quite big, the command line is way faster than a python script 我的真实文件实际上很大,命令行比python脚本快得多

That's not necessarily true and in fact, in your case, I suspected that Python could do it many, many times faster than sed because with Python you're not limited to iterating over your file through a line buffer nor you need a full blown regex engine just to get the line separators. 不一定是正确的,实际上,在您的情况下,我怀疑Python可以比sed很多倍,因为使用Python时,您不仅限于通过行缓冲区遍历文件,也不需要完整的正则表达式引擎只是为了获得行分隔符。

I'm not sure how big your file is, but I generated my test example as: 我不确定您的文件有多大,但是我生成的测试示例为:

with open("example.txt", "w") as f:
    for i in range(10**8):  # I would consider 100M lines as "big" enough for testing
        print(i, file=f)

Which essentially creates a 100M lines long (888.9MB) file with a different number on each line. 这实际上创建了一个100M行长(888.9MB)的文件,每行都有一个不同的编号。

Now, timing your sed command alone, running at the highest priority ( chrt -f 99 ) results in: 现在,以最高优先级( chrt -f 99 )运行时, chrt -f 99 sed命令计时会导致:

[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> sed -e '1s/^/[/' -e 's/$/,/' -e '$s/,$/]/' example.txt > output.txt
    Command being timed: "sed -e 1s/^/[/ -e s/$/,/ -e $s/,$/]/ example.txt"
    User time (seconds): 56.89
    System time (seconds): 1.74
    Percent of CPU this job got: 98%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:59.28
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 1044
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 1
    Minor (reclaiming a frame) page faults: 313
    Voluntary context switches: 7
    Involuntary context switches: 29
    Swaps: 0
    File system inputs: 1140560
    File system outputs: 1931424
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

The result would be even worse if you were actually to call it from Python as it would also come with the subprocess and STDOUT redirecting overheads. 如果您实际上是从Python调用它的话,结果会更糟,因为它还会伴随subprocess和STDOUT重定向开销。

However, if we leave it to Python to do all the work instead of sed : 但是,如果我们将其留给Python代替sed来完成所有工作:

import sys

CHUNK_SIZE = 1024 * 64  # 64k, tune this to the FS block size / platform for best performance

with open(sys.argv[2], "w") as f_out:  # open the file from second argument for writing
    f_out.write("[")  # start the JSON array
    with open(sys.argv[1], "r") as f_in:  # open the file from the first argument for reading
        chunk = None
        last_chunk = ''  # keep a track of the last chunk so we can remove the trailing comma
        while True:
            chunk = f_in.read(CHUNK_SIZE)  # read the next chunk
            if chunk:
                f_out.write(last_chunk)  # write out the last chunk
                last_chunk = chunk.replace("\n", ",\n")  # process the new chunk
            else:  # EOF
                break
    last_chunk = last_chunk.rstrip()  # clear out the trailing whitespace
    if last_chunk[-1] == ",":  # clear out the trailing comma
        last_chunk = last_chunk[:-1]
    f_out.write(last_chunk)  # write the last chunk
    f_out.write("]")  # end the JSON array

without ever touching the shell results in: 不接触外壳会导致:

[zwer@testbed ~]$ sudo chrt -f 99 /usr/bin/time --verbose \
> python process_file.py example.txt output.txt
    Command being timed: "python process_file.py example.txt output.txt"
    User time (seconds): 1.75
    System time (seconds): 0.72
    Percent of CPU this job got: 93%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.65
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 4716
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 3
    Minor (reclaiming a frame) page faults: 14835
    Voluntary context switches: 16
    Involuntary context switches: 0
    Swaps: 0
    File system inputs: 3120
    File system outputs: 1931424
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And given the utilization, the bottleneck is actually I/O, left to its own devices (or working from a very fast storage instead of a virtualized HDD as on my testbed) Python could do it even faster. 并且考虑到利用率,瓶颈实际上是I / O,留给自己的设备使用(或在非常快速的存储中工作,而不是在我的测试台上使用虚拟HDD进行工作),Python可以更快地做到这一点。

So, it took sed 32.5 times longer to do the same task that Python did. 所以,花了sed 长32.5倍做了Python做了同样的任务。 Even if you were to optimize your sed a bit, Python will still work faster because sed is limited to a line buffer so a lot of time will be wasted on the input I/O (compare the numbers in the above benchmark) and there's no (easy) way around that. 即使您稍微优化了sed ,Python仍然可以更快地工作,因为sed限于行缓冲区,因此会浪费大量时间在输入I / O上(比较上述基准测试中的数字),并且没有(简单)的方法。

Conclusion: Python is way faster than sed for this particular task. 结论: Python速度sed

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM