简体   繁体   English

使用 awk 打印标题名称和子字符串

[英]using awk to print header name and a substring

i try using this code for printing a header of a gene name and then pulling a substring based on its location but it doesn't work我尝试使用此代码打印基因名称的标题,然后根据其位置提取子字符串,但它不起作用

>output_file
cat input_file | while read row; do
        echo $row > temp
        geneName=`awk '{print $1}' tmp`
        startPos=`awk '{print $2}' tmp`
        endPOs=`awk '{print $3}' tmp`
                for i in temp; do
                echo ">${geneName}" >> genes_fasta ;
                echo "awk '{val=substr($0,${startPos},${endPOs});print val}' fasta" >> genes_fasta
        done
done

input_file输入文件

nad5_exon1 250405 250551
nad5_exon2 251490 251884
nad5_exon3 195620 195641
nad5_exon4 154254 155469
nad5_exon5 156319 156548

fasta法斯塔

atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc............

and this is my wrong output file这是我错误的输出文件

>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta

output should look like that:输出应如下所示:

>name1
atgcatgcatgcatgcatgcat
>name2
tgcatgcatgcatgcat
>name3
gcatgcatgcatgcatgcat
>namen....

You can do this with a single call to awk which will be orders of magnitude more efficient than looping in a shell script and calling awk 4-times per-iteration.您可以通过对awk的单次调用来完成此操作,这比在 shell 脚本中循环并每次迭代调用awk 4 次要高效几个数量级。 Since you have bash, you can simply use command substitution and redirect the contents of fasta to an awk variable and then simply output the heading and the substring containing the beginning through ending characters from your fasta file.由于您有 bash,您可以简单地使用命令替换并将fasta的内容重定向到awk变量,然后简单地输出包含从fasta文件开始到结束字符的标题和子字符串。

For example:例如:

awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input

or using getline within the BEGIN rule:或在BEGIN规则中使用getline

awk 'BEGIN{getline fasta<"fasta"}
{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input

Example Input Files示例输入文件

Note: the beginning and ending values have been reduced to fit within the 129 characters of your example:注意:开始和结束值已减少以适合您示例的 129 个字符:

$ cat input
rad5_exon1 1 17
rad5_exon2 23 51
rad5_exon3 110 127
rad5_exon4 38 62
rad5_exon5 59 79

and the first 129-characters of your example fasta以及示例fasta的前 129 个字符

$ cat fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc

Example Use/Output示例使用/输出

$ awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
>rad5_exon1
atgcatgcatgcatgca
>rad5_exon2
gcatgcatgcatgcatgcatgcatgcatg
>rad5_exon3
tgcatgcatgcatgcatg
>rad5_exon4
tgcatgcatgcatgcatgcatgcat
>rad5_exon5
gcatgcatgcatgcatgcatg

Look thing over and let me know if I understood your question requirements.仔细看看,如果我理解你的问题要求,请告诉我。 Also let me know if you have further questions on the solution.如果您对解决方案还有其他问题,也请告诉我。

If I'm understanding correctly, how about:如果我理解正确,那么:

awk 'NR==FNR {fasta = fasta $0; next}
    {
        printf(">%s %s\n", $1, substr(fasta, $2, $3 - $2 + 1))
    }' fasta input_file > genes_fasta

  • It first reads fasta file and stores the sequence in a variable fasta .它首先读取fasta文件并将序列存储在变量fasta
  • Then it reads input_file line by line, extracts the substring of fasta starting at $2 and of length $3 - $2 + 1 .然后它逐行读取input_file ,提取fasta的子字符串,从$2开始,长度$3 - $2 + 1 (Note that the 3rd argument to substr function is length, not endpos.) (请注意, substr函数的第三个参数是长度,而不是 endpos。)

Hope this helps.希望这可以帮助。

made it work!让它工作! this is the script for pulling substrings from a fasta file这是从 fasta 文件中提取子字符串的脚本

cat genes_and_bounderies1 | while read row; do
        echo $row > temp
        geneName=`awk '{print $1}' temp`
        startPos=`awk '{print $2}' temp`
        endPos=`awk '{print $3}' temp`
        length=$(expr $endPos - $startPos)
                for i in temp; do
                echo ">${geneName}" >> genes_fasta
                awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' unwraped_${fasta} >> genes_fasta
        done
done

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM