snakemake 管道中的 STAR 错误：“由于致命错误而退出：无法打开基因组文件”

Question

I'm trying to use a 2 pass STAR mapping strategy (also explained here https://informatics.fas.harvard.edu/rsem-example-on-odyssey.html ), but I'm getting an error.我正在尝试使用 2 通 STAR 映射策略（此处也解释了https://informatics.fas.harvard.edu/rsem-example-on-odyssey.html ），但出现错误。

I've read through this page [https://github.com/alexdobin/STAR/issues/181] and I have a similar issue, but the discussed solutions don't seem to help.我已阅读此页面 [https://github.com/alexdobin/STAR/issues/181] 并且遇到了类似的问题，但讨论的解决方案似乎没有帮助。 Perhaps this is more a snakemake issue rather than a STAR issue, therefore I'm asking it here.也许这更像是一个蛇形问题而不是 STAR 问题，因此我在这里问它。

I use STAR version 2.7.10 on an HPC cluster.我在 HPC 集群上使用 STAR 版本 2.7.10。 I'm running a snakemake file in which I map human and mouse samples with STAR 2 pass mapping, and get the following error:我正在运行一个 snakemake 文件，其中我使用 STAR 2 通道映射对 map 人和小鼠样本，并得到以下错误：

['PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872', 'PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621'] ['SRR7942395_GSM3406786_sAML_Control_1', 'SRR7942395_GSM3406786_sAML_Control_1', 'SRR5195524_GSM2465521_KrasT_45649_NoDox', 'SRR5195524_GSM2465521_KrasT_45649_NoDox'] ['Homo_sapiens', 'Homo_sapiens', 'Mus_musculus', 'Mus_musculus'] [1, 2, 1, 2] ['GRCh38.106', 'GRCh38.106', 'GRCm39.107', 'GRCm39.107'] ['GRCh38', 'GRCh38', 'GRCm39', 'GRCm39']
The flag 'directory' used in rule all is only valid for outputs, not inputs.
The flag 'directory' used in rule all is only valid for outputs, not inputs.
The flag 'directory' used in rule all is only valid for outputs, not inputs.
The flag 'directory' used in rule all is only valid for outputs, not inputs.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 56
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    2   RSEM_calculate_expression
    2   RSEM_prepare_reference
    1   all
    2   filter
    2   genebody_coverage
    2   infer_strandedness
    4   rawFastqc
    2   run_multiqc
    2   samtools_index_bam
    1   starIndex
    2   star_1pass_alignment
    2   star_2pass_alignment
    2   ucsc_gtftobed_Homo_sapiens
    26
 
[Tue Aug 16 13:09:01 2022]
rule RSEM_prepare_reference:
    input: /DATA//resources/Homo_sapiens.GRCh38.dna.primary_assembly.fa, /DATA//resources/Homo_sapiens.GRCh38.106.gtf
    output: /DATA//resources/RSEM_ref/Homo_sapiens_GRCh38.106/Homo_sapiens_GRCh38.106_GRCh38_rsem_ref.seq
    jobid: 19
    wildcards: species=Homo_sapiens, gtf_version=GRCh38.106, fa_version=GRCh38
    threads: 12
 
[Tue Aug 16 13:09:01 2022]
rule RSEM_prepare_reference:
    input: /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa, /DATA//resources/Mus_musculus.GRCm39.107.gtf
    output: /DATA//resources/RSEM_ref/Mus_musculus_GRCm39.107/Mus_musculus_GRCm39.107_GRCm39_rsem_ref.seq
    jobid: 20
    wildcards: species=Mus_musculus, gtf_version=GRCm39.107, fa_version=GRCm39
    threads: 12
 
[Tue Aug 16 13:09:01 2022]
rule star_1pass_alignment:
    input: /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz, /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz
    output: /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Aligned.sortedByCoord.out.bam, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Log.final.out, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_SJ.out.tab
    jobid: 10
    wildcards: dataset=PRJNA362883_GSE93946_SRP097621, sample=SRR5195524_GSM2465521_KrasT_45649_NoDox, species=Mus_musculus
    threads: 12
 
 
[Tue Aug 16 13:09:01 2022]
rule starIndex:
    input: /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa, /DATA//resources/Mus_musculus.GRCm39.107.gtf
    output: /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107
    jobid: 2
    wildcards: species=Mus_musculus, fa_version=GRCm39, gtf_version=GRCm39.107
    threads: 20
 
Activating conda environment: /DATA//workflow/.snakemake/conda/d88d0970
Activating conda environment: /DATA//workflow/.snakemake/conda/d88d0970
Activating conda environment: /DATA//workflow/.snakemake/conda/b20308a2
Activating conda environment: /DATA//workflow/.snakemake/conda/b20308a2
rsem-extract-reference-transcripts {config[project_path]}+resources/RSEM_ref/Mus_musculus_GRCm39.107/Mus_musculus_GRCm39.107_GRCm39 0 /DATA//resources/Mus_musculus.GRCm39.107.gtf None 0 /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa
    STAR --runThreadN 20 --runMode genomeGenerate --genomeDir /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107 --genomeFastaFiles /DATA//resources/Mus_musculus.GRCm39.dna.primary_assembly.fa --sjdbGTFfile /DATA//resources/Mus_musculus.GRCm39.107.gtf
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Aug 16 13:09:13 ..... started STAR run
Aug 16 13:09:13 ... starting to generate Genome files
    STAR --runMode alignReads --genomeDir /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107 --genomeLoad LoadAndKeep --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 10000000000 --limitGenomeGenerateRAM 20000000000 --readFilesIn /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz --runThreadN 12 --readFilesCommand gunzip -c --outFileNamePrefix /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_
    STAR version: 2.7.10a   compiled: 2022-01-14T18:50:00-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Aug 16 13:09:13 ..... started STAR run
Aug 16 13:09:13 ..... loading genome
 
EXITING because of FATAL ERROR: could not open genome file /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107//genomeParameters.txt
SOLUTION: check that the path to genome files, specified in --genomeDir is correct and the files are present, and have user read permsissions
 
Aug 16 13:09:13 ...... FATAL ERROR, exiting
[Tue Aug 16 13:09:13 2022]
Error in rule star_1pass_alignment:
    jobid: 10
    output: /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Aligned.sortedByCoord.out.bam, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_Log.final.out, /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_SJ.out.tab
    conda-env: /DATA//workflow/.snakemake/conda/d88d0970
    shell:
        
        STAR --runMode alignReads --genomeDir /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107 --genomeLoad LoadAndKeep --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 10000000000 --limitGenomeGenerateRAM 20000000000 --readFilesIn /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz /DATA//resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz --runThreadN 12 --readFilesCommand gunzip -c --outFileNamePrefix /DATA//results/PRJNA362883_GSE93946_SRP097621/star_aligned_1pass/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_
       
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I've attached my log file.我附上了我的日志文件。

log.txt日志.txt

import pandas as pd
#import glob
import os
import fnmatch
import re
 
# --- Importing Configuration Files --- #
configfile: "/DATA//config/config.yaml"
 
#DATASET,SAMPLE,SPECIES,FRR, =glob_wildcards(config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_{frr}.fastq.gz")
#print(DATASET,SAMPLE,SPECIES,FRR)
#SPECIES,GTF_VERSION,=glob_wildcards(config["project_path"]+"resources/{gtf_species}.{gtf_version}.gtf")
#SPECIES,FA_VERSION,=glob_wildcards(config["project_path"]+"resources/{fa_species}-{fa_version}.dna.primary_assembly.fa")
#print(SPECIES,GTF_VERSION)
 
table_cols = ['dataset','sample','species','frr','gtf_version','fa_version']
table_samples = pd.read_table('/DATA//config/samples.tsv', header=0, sep='\t', names=table_cols)
DATASET = table_samples.dataset.values.tolist()
SAMPLE = table_samples['sample'].values.tolist()
SPECIES = table_samples.species.values.tolist()
FRR = table_samples.frr.values.tolist()
GTF_VERSION = table_samples.gtf_version.values.tolist()
FA_VERSION = table_samples.fa_version.values.tolist()
 
print(DATASET,SAMPLE,SPECIES,FRR,GTF_VERSION,FA_VERSION)
 
 
rule all:
        input:
                directory(expand(config["project_path"]+"resources/starIndex_{species}_{fa_version}_{gtf_version}",zip, species=SPECIES, fa_version=FA_VERSION, gtf_version=GTF_VERSION)),
                expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, frr=FRR),
                expand(config["project_path"]+"results/{dataset}/rawQC/multiqc_report.html", dataset=DATASET),
                expand(config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_Aligned.sortedByCoord.out.bam", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
                expand(config["project_path"]+"results/{dataset}/star_aligned_2pass/{sample}_{species}_Aligned.sortedByCoord.out.bam", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
                expand(config["project_path"]+"results/{dataset}/star_aligned_2pass/{sample}_{species}_Aligned.sortedByCoord.out.bam.bai", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
                expand(config["project_path"]+"resources/{species}.{gtf_version}.bed", zip, species=SPECIES, gtf_version=GTF_VERSION),
                expand(config["project_path"]+"results/{dataset}/star_aligned_2pass/{sample}_{species}_Aligned.toTranscriptome.out.bam", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES),
                expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_{gtf_version}.geneBodyCoverage.txt", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, gtf_version=GTF_VERSION),
                expand(config["project_path"]+"resources/RSEM_ref/{species}_{gtf_version}/{species}_{gtf_version}_{fa_version}_rsem_ref.seq", zip, species=SPECIES, gtf_version=GTF_VERSION, fa_version=FA_VERSION),
                expand(config["project_path"]+"results/{dataset}/RSEM/{sample}_{species}_{gtf_version}_{fa_version}.genes.results", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, gtf_version=GTF_VERSION, fa_version=FA_VERSION)
 
wildcard_constraints:
        dataset="|".join([re.escape(x) for x in DATASET]),
        sample="|".join([re.escape(x) for x in SAMPLE]),
        species="|".join([re.escape(x) for x in SPECIES]),
        gtf_version="|".join([re.escape(x) for x in GTF_VERSION]),
        fa_version="|".join([re.escape(x) for x in FA_VERSION])
 
## rule starIndex ##  Create star index if it does not exist yet
rule starIndex:
        input:
                fasta=config["project_path"]+"resources/{species}.{fa_version}.dna.primary_assembly.fa",
                gtf=config["project_path"]+"resources/{species}.{gtf_version}.gtf"
        output:
                directory(config["project_path"]+"resources/starIndex_{species}_{fa_version}_{gtf_version}")
        threads:
                20
        conda:
                "envs/DTPedia_bulkRNAseq.yaml"
        shell:
                """
                STAR --runThreadN {threads} --runMode genomeGenerate --genomeDir {output} --genomeFastaFiles {input.fasta} --sjdbGTFfile {input.gtf}
                """
 
# function determine_species_fasta # function for determining fasta input of correct species to rule starIndex
def determine_species(wildcards,input):
        if fnmatch.fnmatch(input.read1, '*Homo_sapiens*'):
                return directory(expand(rules.starIndex.output, species = "Homo_sapiens", fa_version="GRCh38", gtf_version="GRCh38.106"))
        elif fnmatch.fnmatch(input.read1, '*Mus_musculus*'):
                return directory(expand(rules.starIndex.output, species="Mus_musculus", fa_version="GRCm39", gtf_version="GRCm39.107"))
 
 
rule star_1pass_alignment:
        input:
                read1=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_1.fastq.gz",
                read2=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_2.fastq.gz"
        output:
                bam=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_Aligned.sortedByCoord.out.bam",
                log=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_Log.final.out",
                sj_1pass=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_SJ.out.tab"
        params:
                index=determine_species,
                prefix=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_"
        threads:
                12
        conda:
                "envs/DTPedia_bulkRNAseq.yaml"
        shell:
                """
                STAR --runMode alignReads --genomeDir {params.index} --genomeLoad LoadAndKeep --outSAMtype BAM SortedByCoordinate --limitBAMsortRAM 10000000000 --limitGenomeGenerateRAM 20000000000 --readFilesIn {input.read1} {input.read2} --runThreadN {threads} --readFilesCommand gunzip -c --outFileNamePrefix {params.prefix}
                """

Output of $ulimit -a is $ulimit -a 的 Output 是

$ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2063155
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2063155
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

It seems like the STAR 1st pass rule is being run before the STAR index rule and therefore giving an error that it cannot find the genome index files in /DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107/ .似乎 STAR 1st pass 规则在 STAR 索引规则之前运行，因此给出了一个错误，即它无法在/DATA//resources/starIndex_Mus_musculus_GRCm39_GRCm39.107/中找到基因组索引文件。 It's not clear how to solve this, am I doing something wrong in snakemake, STAR or something else that I'm overlooking?目前尚不清楚如何解决这个问题，我是否在 snakemake、STAR 或其他我忽略的东西中做错了什么？ Help would be very much appreciated, thanks!非常感谢您的帮助，谢谢！

EDIT 1: After modifying the code as @dariober suggested, the initial error disappeared.编辑1：按照@dariober 的建议修改代码后，最初的错误消失了。 Now I'm getting the following error:现在我收到以下错误：

WorkflowError in line...: Function did not return str or list of str

I've modified the function def determine_species according to the suggestions.我已经根据建议修改了 function def determine_species 。 It's getting closer but not fully there yet:它越来越近了，但还没有完全到达：

def determine_species(wildcards):
        read1 = config["project_path"]+"resources/raw_datasets/{wildcards.dataset}/{wildcards.sample}_{wildcards.species}_RNA-Seq_1.fastq.gz"
        Hs_index = "/DATA/m.venkatesan/DTPedia/resources/starIndex_Homo_sapiens_GRCh38_GRCh38.106"
        Ms_index = "/DATA/m.venkatesan/DTPedia/resources/starIndex_Mus_musculus_GRCm39_GRCm39.107"
        if fnmatch.fnmatch(read1, '*Homo_sapiens*'):
                return Hs_index
        elif fnmatch.fnmatch(read1, '*Mus_musculus*'):
                return Ms_index

Answer 1

The problem is that the input to your rule star_1pass_alignment is just the two fastq files:问题是您的规则star_1pass_alignment的输入只是两个 fastq 文件：

rule star_1pass_alignment:
    input:
        read1=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_1.fastq.gz",   
        read2=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_2.fastq.gz"
    ...

However, STAR requires also the index of the reference genome, which you list as a parameter.但是，STAR 还需要参考基因组的索引，您将其列为参数。 This means that snakemake doesn't "see" that rule starIndex should run before star_1pass_alignment .这意味着snakemake 不会“看到”规则starIndex应该在star_1pass_alignment之前运行。 From snakemake's point of view, rule starIndex is unnecessary.从snakemake 的角度来看，规则starIndex是不必要的。

Therefore, the solution is to make the output of starIndex ( resources/starIndex_{species}_{fa_version}_{gtf_version} ) an input of star_1pass_alignment .因此，解决方法是让starIndex （ resources/starIndex_{species}_{fa_version}_{gtf_version} ）的output成为star_1pass_alignment的输入。 Something like this, not tested:像这样的东西，未经测试：

rule star_1pass_alignment:
    input:
        read1=...
        read2=...
        index=determine_species,
        prefix=config["project_path"]+"results/{dataset}/star_aligned_1pass/{sample}_{species}_",

Rule priorities should not matter here.规则优先级在这里应该无关紧要。

Regarding the error about TypeError: determine_species() missing 1 required positional argument: 'input' (this should be a separate question): You defined determine_species to take two parameters: wildcards and input .关于关于TypeError: determine_species() missing 1 required positional argument: 'input'错误（这应该是一个单独的问题）：您定义了determine_species采用两个参数： wildcards和input 。 However, the code index=determine_species, calls this function passing to it only the wildcards object, hence the error (I find this a bit cryptic but that's the way it is...).但是，代码index=determine_species,调用此 function仅将wildcards object 传递给它，因此出现错误（我觉得这有点神秘，但就是这样......）。 The solution is to "build" the input parameter input inside the function itself using the wildcards object.解决方案是使用wildcards object 在 function 自身内部“构建”输入参数input 。 That is something like:那是这样的：

def determine_species(wildcards):
    read1 = config["project_path"]+"resources/raw_datasets/{wildcards.dataset}/{wildcards.sample}_{wildcards.species}_RNA-Seq_1.fastq.gz",   
    if fnmatch.fnmatch(read1, '*Homo_sapiens*'):
        return expand(rules.starIndex.output, species = "Homo_sapiens", fa_version="GRCh38", gtf_version="GRCh38.106")
    elif fnmatch.fnmatch(read1, '*Mus_musculus*'):
        return expand(rules.starIndex.output, species="Mus_musculus", fa_version="GRCm39", gtf_version="GRCm39.107")

(NB: I haven't checked whether what this function does!) （注意：我还没有检查过这个 function 是做什么的！）

If you need to use a function that takes additional parameters you can use this construct:如果您需要使用带有附加参数的 function，您可以使用以下构造：

def my_func(wildcards, arg1, arg2):
    # Do something with wildcards, arg1, arg2

input:
    foo = lambda wildcards: my_func(wildcards, arg1, arg2),

Answer 2

For future readers.对于未来的读者。 I've stumbled upon this post: How do I get Snakemake to apply all samples to a single rule, before proceeding to the next rule?我偶然发现了这篇文章：在继续下一条规则之前，如何让 Snakemake 将所有样本应用于单个规则？ . . Also see https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#priorities另请参阅https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#priorities

I should prioritize the rule starIndex within my snakemake pipeline.我应该在我的蛇形管道中优先考虑规则 starIndex。 This wasn't a STAR issue after all.毕竟这不是 STAR 问题。

snakemake 管道中的 STAR 错误：“由于致命错误而退出：无法打开基因组文件”

问题描述

1 个解决方案

解决方案1
1 2022-08-17 08:36:51

解决方案2
0 2022-08-16 21:31:21

snakemake 管道中的 STAR 错误：“由于致命错误而退出：无法打开基因组文件”

问题描述

1 个解决方案

解决方案1 1 2022-08-17 08:36:51

解决方案2 0 2022-08-16 21:31:21

解决方案1
1 2022-08-17 08:36:51

解决方案2
0 2022-08-16 21:31:21