简体   繁体   English

将标头添加到python中的子进程的stdout

[英]add header to stdout of a subprocess in python

I am merging several dataframes into one and sorting them using unix sort . 我将几个数据帧合并为一个,并使用unix sort对其进行unix sort Before I write the final sorted data I would like to add a prefix/header to that output. 在写入最终排序的数据之前,我想在该输出中添加一个前缀/标头。

So, my code is something like: 所以,我的代码是这样的:

my_cols =  '\t'.join(['CHROM', 'POS', "REF" ....])

my_cmd = ["sort", "-k1,2", "-V", "final_merged.txt"]

with open(output + 'mergedAndSorted.txt', 'w') as sort_data:
    sort_data.write(my_cols + '\n')  
    subprocess.run(my_cmd, stdout=sort_data)

But, this above doe puts my_cols at the end of the final output file (ie mergedAndSorted.txt ) 但是,上面的doe将my_cols放在最终输出文件的末尾(即mergedAndSorted.txt

I also tried substituting: 我也尝试替换:

sort_data=io.StringIO(my_cols)  

but this gives me an error as I had expected. 但这给了我一个我所期望的错误。


How can I add that header to the begining of the subprocess output. 如何将该标头添加到子流程输出的开头。 I believe this can be achieved by a simple code change. 我相信可以通过简单的代码更改来实现。

The problem with your code is a matter of buffering; 您的代码的问题是缓冲问题。 the tldr is that you can fix it like this: tldr是您可以像这样修复它:

sort_data.write(my_cols + '\n')
sort_data.flush()
subprocess.run(my_cmd, stdout=sort_data)

If you want to understand why it happens, and how the fix solves it: 如果您想了解其原因以及修复程序的解决方案,请执行以下操作:

When you open a file in text mode, you're opening a buffered file. 当您以文本模式打开文件时,就是在打开一个缓冲文件。 Writes go into the buffer, and the file object doesn't necessarily flush them to disk immediately. 写入会进入缓冲区,并且文件对象不一定会立即将其刷新到磁盘。 (There's also stream-encoding from Unicode to bytes going on, but that doesn't really add a new problem, it just adds two layers where the same thing can happen, so let's ignore that.) (从Unicode到字节的流编码也一直在进行,但这并没有真正增加新的问题,它只是添加了两层可以发生相同的事情,因此我们忽略它。)

As long as all of your writes are to the buffered file object, that's fine—they get sequenced properly in the buffer, so they get sequenced properly on the disk. 只要您对缓冲文件对象的所有写操作都没问题-它们会在缓冲区中正确排序,因此它们会在磁盘上正确排序。

But if you write to the underlying sort_data.buffer.raw disk file, or to the sort_data.fileno() OS file descriptor, those writes may get ahead of the ones that went to sort_data . 但是,如果您写入基础sort_data.buffer.raw磁盘文件或sort_data.fileno() OS文件描述符,则这些写入可能会领先于sort_data

And that's exactly what happens when you use the file as a pipe in subprocess . 这就是将文件用作subprocess的管道时发生的情况。 This doesn't seem to be explained directly, but can be inferred from Frequently Used Arguments : 这似乎没有直接解释,但可以从“ 常用参数”推论得出:

stdin , stdout and stderr specify the executed program's standard input, standard output and standard error file handles, respectively. stdinstdoutstderr分别指定执行程序的标准输入,标准输出和标准错误文件句柄。 Valid values are PIPE , DEVNULL , an existing file descriptor (a positive integer), an existing file object, and None . 有效值为PIPEDEVNULL ,现有文件描述符(正整数),现有文件对象和None

This implies pretty strongly—if you know enough about the way piping works on *nix and Windows—that it's passing the actual file descriptor/handle to the underlying OS functionality. 如果您对管道在* nix和Windows上的工作方式了解得足够多,这将非常有力地暗示它会将实际的文件描述符/句柄传递给底层OS功能。 But it doesn't actually say that. 但是实际上并没有那么说。 To really be sure, you have to check the Unix source and Windows source , where you can see that it is calling fileno or msvcrt.get_osfhandle on the file objects. 确实可以确定,您必须检查Unix源Windows源 ,您可以在其中看到它在文件对象上调用filenomsvcrt.get_osfhandle

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM