简体   繁体   English

子进程:无法将“_io.BufferedReader”对象隐式转换为 str

[英]Subprocess: Can't convert '_io.BufferedReader' object to str implicitly

I am working on writing a script which is a combination of snakemake and python code to automate a large number of files that comes in pair.我正在编写一个脚本,它结合了snakemake 和python 代码来自动化成对出现的大量文件。 More precisely, I am working on aligning reads with BWA MEM with paired end reads ( http://bio-bwa.sourceforge.net/bwa.shtml ).更准确地说,我正在将读取与 BWA MEM 与成对的末端读取 ( http://bio-bwa.sourceforge.net/bwa.shtml ) 对齐。 On the first part of the script, I iterated over the list of names in my file (which are fastq bunzipped files) then sorted them accordingly in a list.在脚本的第一部分,我遍历文件中的名称列表(它们是fastq bunzipped文件),然后在列表中相应地对它们进行排序。 Here's a quick look of some files:以下是一些文件的快速浏览:

['NG-8653_ 1A _lib95899_4332_7_ 1 ', 'NG-8653_ 1A _lib95899_4332_7_ 2 ', 'NG-8653_ 1B _lib95900_4332_7_ 1 ', 'NG-8653_ 1B _lib95900_4332_7_ 2 ', 'NG-8653_ 1N _lib95898_4332_7_ 1 ', 'NG-8653_ 1N _lib95898_4332_7_ 2 '] [ 'NG-8653_ 1A _lib95899_4332_7_ 1', 'NG-8653_ 1A _lib95899_4332_7_ 2', 'NG-8653_ 1B _lib95900_4332_7_ 1', 'NG-8653_ 1B _lib95900_4332_7_ 2', 'NG-8653_ 1N _lib95898_4332_7_ 1',“NG-8653_ 1N _lib95898_4332_7_ 2“]

As you can see, the reads are sorted two by two (1A_... 1 and 1A ..._2, etc...).如您所见,读取按两两排序(1A_... 1 和 1A ..._2 等...)。 Now using subprocess, I want to align them by decompressing them with bunzip2 and then passing them through stdin to bwa mem.现在使用子进程,我想通过用 bunzip2 解压缩它们然后将它们通过 stdin 传递到 bwa mem 来对齐它们。 The bwa mem command transforms fastq format files to .sam files, I have then to use samtools to convert them with .bam format. bwa mem 命令将 fastq 格式文件转换为 .sam 文件,然后我必须使用 samtools 将它们转换为 .bam 格式。 Here's the script so far:这是到目前为止的脚本:

import re, os, subprocess, bz2

WDIR = "/home/alaa/Documents/snakemake"
workdir: WDIR
SAMPLESDIR = "/home/alaa/Documents/snakemake/fastq/"
REF = "/home/alaa/Documents/inputs/reference/hg19_ref_genome.fa"

FILE_FASTQ = glob_wildcards("fastq/{samples}.fastq.bz2")
LIST_FILE_SAMPLES = []

for x in FILE_FASTQ[0]:
    LIST_FILE_SAMPLES.append(x)

LIST_FILE_SAMPLES = sorted(LIST_FILE_SAMPLES)
print(LIST_FILE_SAMPLES)

rule fastq_to_bam:
    run:
        for x in range(0, len(LIST_FILE_SAMPLES), 2):
            # get the name of the sample (1A, 1B ...)
            samp = ""
            samp += LIST_FILE_SAMPLES[x].split("_")[1]

            # get the corresponding read (1 or 2)
            r1 = SAMPLESDIR + LIST_FILE_SAMPLES[x] + ".fastq.bz2"
            r2 = SAMPLESDIR + LIST_FILE_SAMPLES[x+1] + ".fastq.bz2"

            # gunzipping the files and pipping them
            p1 = subprocess.Popen(['bunzip2', '-kc', r1], stdout=subprocess.PIPE)
            p2 = subprocess.Popen(['bunzip2', '-kc', r2], stdout=subprocess.PIPE)           


            # now write the output file to .bam format after aligning them
            with open("sam/" + samp + ".bam", "w") as stdout:
                fastq2sam = subprocess.Popen(["bwa", "mem", "-T 1", REF, p1.stdout, p2.stdout], stdout=subprocess.PIPE)
                fastq2samOutput = subprocess.Popen(["samtools", "view", "-Sb", "-"], shell = True, stdin=fastq2sam.stdout, stdout=stdout)

I was trying to debug the script by trying line by line.我试图通过逐行尝试来调试脚本。 When writting bunzip2 to an output file, it was working fine.将 bunzip2 写入输出文件时,它工作正常。 Now if I try to pipe it, I get an error:现在,如果我尝试对其进行管道传输,则会收到错误消息:

Error in job fastq_to_bam while creating output file .
RuleException:
TypeError in line 39 of /home/alaa/Documents/snakemake/Snakefile:
Can't convert '_io.BufferedReader' object to str implicitly
  File "/home/alaa/Documents/snakemake/Snakefile", line 39, in __rule_fastq_to_bam
  File "/usr/lib/python3.5/subprocess.py", line 947, in __init__
  File "/usr/lib/python3.5/subprocess.py", line 1490, in _execute_child
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
 Exiting because a job execution failed. Look above for error message
 Will exit after finishing currently running jobs.
 Exiting because a job execution failed. Look above for error message

Can you please tell me what is the problem with the script ?你能告诉我脚本有什么问题吗? I am trying to look for the problem since this morning and I can't seem to figure it out.从今天早上开始,我一直在努力寻找问题,但我似乎无法弄清楚。 Any help is much appreciated.任何帮助深表感谢。 Thanks in advance.提前致谢。

EDIT 1:编辑 1:

After reading more about the feedback from @bli and @Johannes, I have made it this far:在阅读了更多关于@bli 和@Johannes 的反馈后,我已经做到了这一点:

import re, os, subprocess, bz2, multiprocessing
from os.path import join
from contextlib import closing

WDIR = "/home/alaa/Documents/snakemake"
workdir: WDIR
SAMPLESDIR = "fastq/"
REF = "/home/alaa/Documents/inputs/reference/hg19_ref_genome.fa"


FILE_FASTQ = glob_wildcards("fastq/{samples, NG-8653_\d+[a-zA-Z]+_.+}")
LIST_FILE_SAMPLES = []

for x in FILE_FASTQ[0]:
    LIST_FILE_SAMPLES.append("_".join(x.split("_")[0:5]))

LIST_FILE_SAMPLES = sorted(LIST_FILE_SAMPLES)
print(LIST_FILE_SAMPLES)


rule final:
    input:
        expand('bam/' + '{sample}.bam', sample = LIST_FILE_SAMPLES)

rule bunzip_fastq:
    input:
        r1 = SAMPLESDIR + '{sample}_1.fastq.bz2',
        r2 = SAMPLESDIR + '{sample}_2.fastq.bz2'
    output:
        o1 = SAMPLESDIR + '{sample}_r1.fastq.gz',
        o2 = SAMPLESDIR + '{sample}_r2.fastq.gz'
    shell:
        """
        bunzip2 -kc < {input.r1} | gzip -c > {output.o1}
        bunzip2 -kc < {input.r2} | gzip -c > {output.o2}
        """

rule fastq_to_bam:
    input:
        r1 = SAMPLESDIR + '{sample}_r1.fastq.gz',
        r2 = SAMPLESDIR + '{sample}_r2.fastq.gz',
        ref = REF
    output:
        'bam/' + '{sample}.bam'
    shell:
        """
        bwa mem {input.ref} {input.r1} {input.r2} | samtools -b > {output}
        """

Thank a lot for your help !非常感谢您的帮助! I think I can manage from here on.我想我可以从这里开始。

Best regards, Alaa最好的问候,阿拉

Your problem is here:你的问题在这里:

["bwa", "mem", "-T 1", REF, p1.stdout, p2.stdout]

p1.stdout and p2.stdout are of type BufferedReader , but subprocess.Popen expects a list of strings. p1.stdoutp2.stdout是类型BufferedReader ,但subprocess.Popen预计字符串列表。 What you might want to use is eg p1.stdout.read() .您可能想要使用的是例如p1.stdout.read()

However, please be aware that your approach is not the idiomatic way to use Snakemake, in fact, there is currently nothing in the script that really makes use of Snakemake's features.但是,请注意,您的方法不是使用 Snakemake 的惯用方式,事实上,目前脚本中没有任何内容真正利用 Snakemake 的功能。

With Snakemake, you would rather have a rule that processes a single sample with bwa mem, taking fastq as input and storing bam as output.使用 Snakemake,您更愿意使用 bwa mem 处理单个样本的规则,将 fastq 作为输入并将 bam 作为输出存储。 See this example in the official Snakemake tutorial.请参阅官方 Snakemake 教程中的此示例 It does exactly what you are trying to accomplish here, but with much less necessary boilerplate.它完全符合您在此处尝试完成的任务,但所需的样板文件要少得多。 Simply let Snakemake do the job, don't try to reimplement this yourself.只需让 Snakemake 完成工作,不要尝试自己重新实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM