将标头添加到python中的子进程的stdout

Question

I am merging several dataframes into one and sorting them using unix sort . 我将几个数据帧合并为一个，并使用unix sort对其进行unix sort 。 Before I write the final sorted data I would like to add a prefix/header to that output. 在写入最终排序的数据之前，我想在该输出中添加一个前缀/标头。

So, my code is something like: 所以，我的代码是这样的：

my_cols =  '\t'.join(['CHROM', 'POS', "REF" ....])

my_cmd = ["sort", "-k1,2", "-V", "final_merged.txt"]

with open(output + 'mergedAndSorted.txt', 'w') as sort_data:
    sort_data.write(my_cols + '\n')  
    subprocess.run(my_cmd, stdout=sort_data)

But, this above doe puts my_cols at the end of the final output file (ie mergedAndSorted.txt ) 但是，上面的doe将my_cols放在最终输出文件的末尾（即mergedAndSorted.txt ）

I also tried substituting: 我也尝试替换：

sort_data=io.StringIO(my_cols)

but this gives me an error as I had expected. 但这给了我一个我所期望的错误。

How can I add that header to the begining of the subprocess output. 如何将该标头添加到子流程输出的开头。 I believe this can be achieved by a simple code change. 我相信可以通过简单的代码更改来实现。

Answer 1

The problem with your code is a matter of buffering; 您的代码的问题是缓冲问题。 the tldr is that you can fix it like this: tldr是您可以像这样修复它：

sort_data.write(my_cols + '\n')
sort_data.flush()
subprocess.run(my_cmd, stdout=sort_data)

If you want to understand why it happens, and how the fix solves it: 如果您想了解其原因以及修复程序的解决方案，请执行以下操作：

When you open a file in text mode, you're opening a buffered file. 当您以文本模式打开文件时，就是在打开一个缓冲文件。 Writes go into the buffer, and the file object doesn't necessarily flush them to disk immediately. 写入会进入缓冲区，并且文件对象不一定会立即将其刷新到磁盘。 (There's also stream-encoding from Unicode to bytes going on, but that doesn't really add a new problem, it just adds two layers where the same thing can happen, so let's ignore that.) （从Unicode到字节的流编码也一直在进行，但这并没有真正增加新的问题，它只是添加了两层可以发生相同的事情，因此我们忽略它。）

As long as all of your writes are to the buffered file object, that's fine—they get sequenced properly in the buffer, so they get sequenced properly on the disk. 只要您对缓冲文件对象的所有写操作都没问题-它们会在缓冲区中正确排序，因此它们会在磁盘上正确排序。

But if you write to the underlying sort_data.buffer.raw disk file, or to the sort_data.fileno() OS file descriptor, those writes may get ahead of the ones that went to sort_data . 但是，如果您写入基础sort_data.buffer.raw磁盘文件或sort_data.fileno() OS文件描述符，则这些写入可能会领先于sort_data 。

And that's exactly what happens when you use the file as a pipe in subprocess . 这就是将文件用作subprocess的管道时发生的情况。 This doesn't seem to be explained directly, but can be inferred from Frequently Used Arguments : 这似乎没有直接解释，但可以从“ 常用参数”推论得出：

stdin , stdout and stderr specify the executed program's standard input, standard output and standard error file handles, respectively. stdin ， stdout和stderr分别指定执行程序的标准输入，标准输出和标准错误文件句柄。 Valid values are PIPE , DEVNULL , an existing file descriptor (a positive integer), an existing file object, and None . 有效值为PIPE ， DEVNULL ，现有文件描述符（正整数），现有文件对象和None 。

This implies pretty strongly—if you know enough about the way piping works on *nix and Windows—that it's passing the actual file descriptor/handle to the underlying OS functionality. 如果您对管道在* nix和Windows上的工作方式了解得足够多，这将非常有力地暗示它会将实际的文件描述符/句柄传递给底层OS功能。 But it doesn't actually say that. 但是实际上并没有那么说。 To really be sure, you have to check the Unix source and Windows source , where you can see that it is calling fileno or msvcrt.get_osfhandle on the file objects. 确实可以确定，您必须检查Unix源和Windows源，您可以在其中看到它在文件对象上调用fileno或msvcrt.get_osfhandle 。

将标头添加到python中的子进程的stdout

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-06-01 18:46:45

将标头添加到python中的子进程的stdout

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-06-01 18:46:45

解决方案1
2 已采纳 2018-06-01 18:46:45