[英]Specific sorting to category by awk and bash
Dear all i have one question. 亲爱的所有人,我有一个问题。
I have the input like : (second column is only index) 我有这样的输入:(第二列只是索引)
chr1 1 30
chr1 2 40.5
chr1 3 30.5
chr1 4 41
chr2 10 60
chr2 15 40.1
And i want to get this: 我想得到这个:
chr1 chr2
30 - 31 2 0
31 - 32 0 0
...
40 - 41 1 1 etc..
I need categorize data to each group from 30 to 60 per 1. From the input data I count all rows for chr1 which are contain in in the category 30-31 from $3. 我需要将数据分类为每1组30到60。从输入数据中,我计算出chr1的所有行,这些行包含在30-31类中,价格从$ 3起。 I have this code, but I do not understand where is problem: (some problem with loop)
我有这段代码,但是我不明白问题出在哪里:(循环有问题)
samtools view /home/filip/Desktop/AMrtin\ Hynek/54321Odfiltrovany.bam | awk '{ n=length($10); print $3,"\t",NR,"\t", gsub(/[GCCgcs]/,"",$10)/n;}' | awk '($3 <= 0.6 && $3 >= 0.3)' | awk '{print $1,"\t",$2,"\t",($3*100)}' > data.txt
for j in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
do
export $j
awk -v sop=$j '{if($1 == $sop) print $0}' data.txt |
awk '{d=int($3)
a[d]++
if (NR==1) {min=d}
min=(min>=d?d:min)
max=(max>d?max:d)}
END{for (i=min; i<=max; i++) print i, "-", i+1, a[i]+0}' ;
done
Part of code I made by help "fedorqui" 我通过帮助“ fedorqui”制作的部分代码
First, you could use : 首先,您可以使用:
for j in {1..22}; do
chrj="char$j"
# now you could use $chrj instead of $j in this loop
done
Instead of : 代替 :
for j in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 do # ... done
Then, you don't need to multiply calls to awk
and pipes. 然后,您无需
awk
和管道的调用倍增。 Only one awk
should be enough. 仅一个
awk
就足够了。
For example : 例如 :
... | awk '($3 <= 0.6 && $3 >= 0.3)' | awk '{print $1,"\\t",$2,"\\t",($3*100)}'
Should be : 应该 :
awk '($3 <= 0.6 && $3 >= 0.3){print $1,"\t",$2,"\t",($3*100)}'
# or
awk '{if ($3 <= 0.6 && $3 >= 0.3){print $1,"\t",$2,"\t",($3*100)}}'
Otherwise : 除此以外 :
export $j
What is the purpose of this export
? 这种
export
的目的是什么?
I haven't read everything on your code but at this point many optimizations must be done ! 我还没有阅读代码中的所有内容,但目前必须进行许多优化!
Using awk
: 使用
awk
:
awk '
!($1 in chrs) { chr[++c] = $1 ; chrs[$1]++ }
{
val = int($3);
map[$1,val]++;
min = (NR==1?val:min>=val?val:min);
max = (max>val?max:val)
}
END {
printf "\t\t"
for (j=1; j<=c; j++) {
printf "%s%s", sep, chr[j]
sep = "\t"
}
print ""
for (i=min; i<=max; i++) {
printf "%d - %d\t", i, i+1
for (j=1; j<=c; j++) {
printf "\t%s", map[chr[j],i] + 0
}
print ""
}
}' file
chr1 chr2
30 - 31 2 0
31 - 32 0 0
32 - 33 0 0
...
38 - 39 0 0
39 - 40 0 0
40 - 41 1 1
41 - 42 1 0
42 - 43 0 0
...
59 - 60 0 0
60 - 61 0 1
chr
array by the order of chromosome seen. chr
数组。 map
array that is indexed at chromosome and range having counts as its value. map
数组,该map
数组以其值作为计数的染色体和范围处的索引之外,其余大部分都是您的代码。 END
block we first iterate over our chr
array and print the chromosomes END
块中,我们首先遍历chr
数组并打印染色体 min
and max
variables we create a loop and print the values from our map
array which is indexed at chromosome and the range. min
和max
变量创建一个循环,并从map
数组中打印值,该值在染色体和范围处索引。 min
and ending at max
. min
到max
所有数字。 If you are using gawk, this should work. 如果您使用的是gawk,则应该可以使用。 There's a filter for $1 that should handle everything you were doing with $j (unless you truly need only chr1..chr22, in which case it should still be possible to develop a regex for it).
有一个用于$ 1的过滤器,该过滤器应该处理您使用$ j进行的所有操作(除非您确实只需要chr1..chr22,在这种情况下仍然应该可以为其开发正则表达式)。
BEGIN {
for(i = 30; i <= 60; i++) {
rstring = i " - " i + 1;
rows[rstring] = 0;
}
}
$1 ~ /^chr[0-9][0-9]?$/ {
row = int($3) " - " int($3) + 1;
columns[$1] = 0;
rows[row] = 0;
data[row][$1] += 1;
rowwidth = length(row) > rowwidth ? length(row) : rowwidth;
colwidth = length($1) > colwidth ? length($1) : colwidth;
}
END {
rowheader = "%-" (rowwidth * 2) "s";
colheader = "%" colwidth "s\t";
dataformat = "%" int(colwidth / 2) "d\t";
asorti(columns, sortedcolumns);
asorti(rows, sortedrows);
printf rowheader, "";
for(c in sortedcolumns) printf "%s\t", sortedcolumns[c];
print "";
for(r in sortedrows) {
printf rowheader, sortedrows[r];
for(c in sortedcolumns)
printf dataformat, data[sortedrows[r]][sortedcolumns[c]];
print ""
}
}
Running it with gawk -f [scriptfile from above] < data.txt
should produce something like: 使用
gawk -f [scriptfile from above] < data.txt
运行它会产生类似以下内容:
chr1 chr2 30 - 31 2 0 31 - 32 0 0 . . . 39 - 40 0 0 40 - 41 1 1 41 - 42 1 0 42 - 43 0 0 . . . 59 - 60 0 0 60 - 61 0 1
Following can be used if you want to use Perl 如果要使用Perl,可以使用以下内容
perl -ane '
$h{$F[0]}{int $F[2]}++;
push @range, int $F[2];
}{
@range = sort @range;
print "\t\t", join "\t", sort { $a cmp $b } keys %h; print "\n";
for $i ($range[0] .. $range[-1]) {
print "$i - ", $i + 1, "\t\t";
print $h{$_}{$i} + 0, "\t" for sort { $a cmp $b } keys %h; print "\n"
}' file
Output should be like this 输出应该是这样的
chr1 chr2
30 - 31 2 0
31 - 32 0 0
32 - 33 0 0
33 - 34 0 0
34 - 35 0 0
35 - 36 0 0
36 - 37 0 0
37 - 38 0 0
38 - 39 0 0
39 - 40 0 0
40 - 41 1 1
41 - 42 1 0
42 - 43 0 0
43 - 44 0 0
44 - 45 0 0
45 - 46 0 0
46 - 47 0 0
47 - 48 0 0
48 - 49 0 0
49 - 50 0 0
50 - 51 0 0
51 - 52 0 0
52 - 53 0 0
53 - 54 0 0
54 - 55 0 0
55 - 56 0 0
56 - 57 0 0
57 - 58 0 0
58 - 59 0 0
59 - 60 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.