简体   繁体   English

循环文件对

[英]Looping over pairs of files

Hello I need to iterate over pairs of files and do something with them.您好,我需要遍历成对的文件并对它们做一些事情。

For example I have 4 files which are named AA2234_1.fastq.gz AA2234_2.fastq.gz AA3945_1.fastq.gz AA3945_2.fastq.gz例如,我有 4 个名为AA2234_1.fastq.gz AA2234_2.fastq.gz AA3945_1.fastq.gz AA3945_2.fastq.gz的文件

As you can propably tell the pairs are AA2234_1.fastq.gz <-> AA2234_2.fastq.gz and AA3945_1.fastq.gz <-> AA3945_2.fastq.gz (they share the name before _ sign)正如你可以正确地告诉这些对是AA2234_1.fastq.gz <-> AA2234_2.fastq.gzAA3945_1.fastq.gz <-> AA3945_2.fastq.gz (它们在_符号之前共享名称)

I have a command with syntax looking like this:我有一个语法如下的command

initialize_of_command file1 file2 output_a output_b output_c output_d parameteres

I want this script to find the number of files with fastq.gz extension in a directory, divide them by 2 to find number of pairs then match the pairs together using probably regex (maybe to two variables) and execute this command for each pair once.我希望这个脚本在目录中找到扩展名为fastq.gz的文件的数量,将它们除以 2 以找到对的数量,然后使用可能的正则表达式(可能是两个变量)将这些对匹配在一起,并为每对执行一次此command .

I have no idea how to pair up those files using regex and how to iterate over the pairs so the scripts knows through which pairs it have already iterated.我不知道如何使用正则表达式配对这些文件以及如何迭代这些对,以便脚本知道它已经迭代了哪些对。

Here is my unfinished script:这是我未完成的脚本:

#!/bin/bash
raw_count_of_files=$(ls | grep -c "fastq.gz")
count_of_files=$((raw_count_of_files / 2))

for ((i=1;i<=count_of_files;i++));
do
java -jar /home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar PE -phred33 AA2234_1.fastq.gz AA2234_2.fastq.gz AA2234_forward_paired.fq.gz AA2234_forward_unpaired.fq.gz AA2234_reverse_paired.fq.gz AA2234_reverse_unpaired.fq.gz SLIDINGWINDOW:4:20 MINLEN:20;
done

Also I would like for the output names to be named after the shared name of input files which in this case is AA2234 and AA3945此外,我希望 output 名称以输入文件的共享名称命名,在本例中为AA2234AA3945

The desired output of this script should be 8 files named accordingly to pairs:此脚本所需的 output 应为 8 个文件,对应对:

AA2234_forward_paired.fq.gz 
AA2234_forward_unpaired.fq.gz 
AA2234_reverse_paired.fq.gz 
AA2234_reverse_unpaired.fq.gz

and

AA3945_forward_paired.fq.gz 
AA3945_forward_unpaired.fq.gz 
AA3945_reverse_paired.fq.gz 
AA3945_reverse_unpaired.fq.gz

Assuming the filenames do not contain whitespace, would you please try:假设文件名不包含空格,请尝试:

#!/bin/bash

declare -A hash                         # associative array to tie basename with files
for f in *fastq.gz; do                  # search the files with the suffix
    base=${f%_*}                        # remove after "_"
    if [[ -z ${hash[$base]} ]]; then    # if the variable is not defined
        hash[$base]=$f                  # then store the filename
    else
        hash[$base]+=" $f"              # else append the filenmame delimited by the whitespace
    fi
done

for base in "${!hash[@]}"; do           # loop over the hash keys (basename)
    read -r f1 f2 <<< "${hash[$base]}"  # split into filenames

    echo java -jar /home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar PE -phred33 "$f1" "$f2" "$base"_forward_paired.fq.gz "$base"_forward_unpaired.fq.gz "$base"_reverse_paired.fq.gz "$base"_reverse_unpaired.fq.gz SLIDINGWINDOW:4:20 MINLEN:20;
done

The script outputs the java command lines as a dry run.该脚本输出 java 命令行作为试运行。 If the output looks good, drop echo and run.如果 output 看起来不错,放弃echo并运行。

One way to iterate over pairs of arguments:一种迭代 arguments 对的方法:

#!/usr/bin/env sh

proc_fastq_pairs() {
  # loop while there are fastq files passed as argument
  while [ $# -gt 0 ]; do
    fq1=$1
    # consume 1 argument as file 1
    shift
    fq2=$1
    # consume 1 argument as file 2
    shift
    initialize_of_command "$fq1" "$fq2" output_a output_b output_c output_d parameteres
  done
}

initialize_of_command() {
  # dummy command to show passed arguments for debug purpose
  printf 'initialize_of_command %s\n' "$*"
}

# Expansion of the globbing pattern ./*.fastq.gz
# is always sorted alphabetically.
# It ensures all similarly named files are kept
# togaether fq1 fq2 ...
proc_fastq_pairs ./*.fastq.gz

Alternatively with xargs :或者使用xargs

printf '%s\n' ./*.fastq.gz | xargs -L 2 bash -c 'initialize_of_command "$1" "$2" output_a output_b output_c output_d parameteres' _
#!/bin/bash

declare -A assoc=()
shopt -s nullglob

for f in *_?.fastq.gz; do
    base=${f%_*}
    assoc[$base]=${assoc[$base]-}${assoc[$base]+ }$f
done

set -f

for pair in "${assoc[@]}"; do
    set -- $pair
    # TODO: Check $# and do something with $1 and $2
done

If there are exactly two files for each prefix (prefix is the portion before the _ in filename), then this job could be accomplished by using a simple for without resorting to arrays:如果每个前缀正好有两个文件(前缀是文件名中_之前的部分),那么可以通过使用简单for来完成这项工作,而无需求助于 arrays:

#!/bin/bash

jarfile='/home/aa/git/trimmomatic/src/Trimmomatic/trimmomatic-0.39.jar'

prefix=
for file in *_*.fastq.gz; do
    if [[ $prefix ]]; then
        echo java -jar "$jarfile" PE -phred33 \
            "$first" "$file" "$prefix"_{forward,reverse}_{,un}paired.fq.gz \
            'SLIDINGWINDOW:4:20' 'MINLEN:20'
        prefix=
    else
        first=$file
        prefix=${file%%_*}
    fi
done

Drop the echo if the command printed out looks good.如果打印出来的命令看起来不错,则删除echo

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM