[英]add header to stdout of a subprocess in python
I am merging several dataframes into one and sorting them using unix sort
. 我将几个数据帧合并为一个,并使用unix sort
对其进行unix sort
。 Before I write the final sorted data I would like to add a prefix/header to that output. 在写入最终排序的数据之前,我想在该输出中添加一个前缀/标头。
So, my code is something like: 所以,我的代码是这样的:
my_cols = '\t'.join(['CHROM', 'POS', "REF" ....])
my_cmd = ["sort", "-k1,2", "-V", "final_merged.txt"]
with open(output + 'mergedAndSorted.txt', 'w') as sort_data:
sort_data.write(my_cols + '\n')
subprocess.run(my_cmd, stdout=sort_data)
But, this above doe puts my_cols
at the end of the final output file (ie mergedAndSorted.txt ) 但是,上面的doe将my_cols
放在最终输出文件的末尾(即mergedAndSorted.txt )
I also tried substituting: 我也尝试替换:
sort_data=io.StringIO(my_cols)
but this gives me an error as I had expected. 但这给了我一个我所期望的错误。
How can I add that header to the begining of the subprocess output. 如何将该标头添加到子流程输出的开头。 I believe this can be achieved by a simple code change. 我相信可以通过简单的代码更改来实现。
The problem with your code is a matter of buffering; 您的代码的问题是缓冲问题。 the tldr is that you can fix it like this: tldr是您可以像这样修复它:
sort_data.write(my_cols + '\n')
sort_data.flush()
subprocess.run(my_cmd, stdout=sort_data)
If you want to understand why it happens, and how the fix solves it: 如果您想了解其原因以及修复程序的解决方案,请执行以下操作:
When you open a file in text mode, you're opening a buffered file. 当您以文本模式打开文件时,就是在打开一个缓冲文件。 Writes go into the buffer, and the file object doesn't necessarily flush them to disk immediately. 写入会进入缓冲区,并且文件对象不一定会立即将其刷新到磁盘。 (There's also stream-encoding from Unicode to bytes going on, but that doesn't really add a new problem, it just adds two layers where the same thing can happen, so let's ignore that.) (从Unicode到字节的流编码也一直在进行,但这并没有真正增加新的问题,它只是添加了两层可以发生相同的事情,因此我们忽略它。)
As long as all of your writes are to the buffered file object, that's fine—they get sequenced properly in the buffer, so they get sequenced properly on the disk. 只要您对缓冲文件对象的所有写操作都没问题-它们会在缓冲区中正确排序,因此它们会在磁盘上正确排序。
But if you write to the underlying sort_data.buffer.raw
disk file, or to the sort_data.fileno()
OS file descriptor, those writes may get ahead of the ones that went to sort_data
. 但是,如果您写入基础sort_data.buffer.raw
磁盘文件或sort_data.fileno()
OS文件描述符,则这些写入可能会领先于sort_data
。
And that's exactly what happens when you use the file as a pipe in subprocess
. 这就是将文件用作subprocess
的管道时发生的情况。 This doesn't seem to be explained directly, but can be inferred from Frequently Used Arguments : 这似乎没有直接解释,但可以从“ 常用参数”推论得出:
stdin , stdout and stderr specify the executed program's standard input, standard output and standard error file handles, respectively. stdin , stdout和stderr分别指定执行程序的标准输入,标准输出和标准错误文件句柄。 Valid values are
PIPE
,DEVNULL
, an existing file descriptor (a positive integer), an existing file object, andNone
. 有效值为PIPE
,DEVNULL
,现有文件描述符(正整数),现有文件对象和None
。
This implies pretty strongly—if you know enough about the way piping works on *nix and Windows—that it's passing the actual file descriptor/handle to the underlying OS functionality. 如果您对管道在* nix和Windows上的工作方式了解得足够多,这将非常有力地暗示它会将实际的文件描述符/句柄传递给底层OS功能。 But it doesn't actually say that. 但是实际上并没有那么说。 To really be sure, you have to check the Unix source and Windows source , where you can see that it is calling fileno
or msvcrt.get_osfhandle
on the file objects. 确实可以确定,您必须检查Unix源和Windows源 ,您可以在其中看到它在文件对象上调用fileno
或msvcrt.get_osfhandle
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.