按awk和bash对类别进行特定排序

Question

Dear all i have one question. 亲爱的所有人，我有一个问题。

I have the input like : (second column is only index) 我有这样的输入：（第二列只是索引）

chr1 1 30

chr1 2 40.5

chr1 3 30.5

chr1 4 41

chr2 10 60

chr2 15 40.1

And i want to get this: 我想得到这个：

           chr1  chr2

30 - 31     2     0

31 - 32     0     0

...

40 - 41     1     1 etc..

I need categorize data to each group from 30 to 60 per 1. From the input data I count all rows for chr1 which are contain in in the category 30-31 from $3. 我需要将数据分类为每1组30到60。从输入数据中，我计算出chr1的所有行，这些行包含在30-31类中，价格从$ 3起。 I have this code, but I do not understand where is problem: (some problem with loop) 我有这段代码，但是我不明白问题出在哪里：（循环有问题）

samtools view /home/filip/Desktop/AMrtin\ Hynek/54321Odfiltrovany.bam | awk '{ n=length($10); print $3,"\t",NR,"\t", gsub(/[GCCgcs]/,"",$10)/n;}'  | awk '($3 <= 0.6 && $3 >= 0.3)' | awk '{print $1,"\t",$2,"\t",($3*100)}' > data.txt

for j in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
do

export $j

awk -v sop=$j  '{if($1 == $sop) print $0}' data.txt |
awk '{d=int($3)
      a[d]++
      if (NR==1) {min=d}
      min=(min>=d?d:min)
      max=(max>d?max:d)}
      END{for (i=min; i<=max; i++) print i, "-", i+1, a[i]+0}' ;
done

Part of code I made by help "fedorqui" 我通过帮助“ fedorqui”制作的部分代码

Answer 1

First, you could use : 首先，您可以使用：

for j in {1..22}; do
    chrj="char$j"
    # now you could use $chrj instead of $j in this loop
done

Instead of : 代替：

 for j in chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 do # ... done

Then, you don't need to multiply calls to awk and pipes. 然后，您无需awk和管道的调用倍增。 Only one awk should be enough. 仅一个awk就足够了。

For example : 例如：

 ... | awk '($3 <= 0.6 && $3 >= 0.3)' | awk '{print $1,"\\t",$2,"\\t",($3*100)}'

Should be : 应该：

awk '($3 <= 0.6 && $3 >= 0.3){print $1,"\t",$2,"\t",($3*100)}'
# or
awk '{if ($3 <= 0.6 && $3 >= 0.3){print $1,"\t",$2,"\t",($3*100)}}'

Otherwise : 除此以外：

 export $j

What is the purpose of this export ? 这种export的目的是什么？

I haven't read everything on your code but at this point many optimizations must be done ! 我还没有阅读代码中的所有内容，但目前必须进行许多优化！

Answer 2

Using awk : 使用awk ：

awk '
!($1 in chrs) { chr[++c] = $1 ; chrs[$1]++ }
{
    val = int($3); 
    map[$1,val]++;
    min = (NR==1?val:min>=val?val:min);
    max = (max>val?max:val)
}
END {
    printf "\t\t"
    for (j=1; j<=c; j++) {
        printf "%s%s", sep, chr[j]
        sep = "\t"
    }
    print ""
    for (i=min; i<=max; i++) {
        printf "%d - %d\t", i, i+1
        for (j=1; j<=c; j++) {
            printf "\t%s", map[chr[j],i] + 0
        }
        print ""
    }
}' file
                chr1    chr2
30 - 31         2       0
31 - 32         0       0
32 - 33         0       0
...
38 - 39         0       0
39 - 40         0       0
40 - 41         1       1
41 - 42         1       0
42 - 43         0       0
...
59 - 60         0       0
60 - 61         0       1

You increment the chr array by the order of chromosome seen. 您按看到的染色体顺序增加chr数组。
Rest stuff in the main block is pretty much your code except that we also create a map array that is indexed at chromosome and range having counts as its value. 除了我们还会创建一个map数组，该map数组以其值作为计数的染色体和范围处的索引之外，其余大部分都是您的代码。
In the END block we first iterate over our chr array and print the chromosomes 在END块中，我们首先遍历chr数组并打印染色体
Then using our min and max variables we create a loop and print the values from our map array which is indexed at chromosome and the range. 然后使用我们的min和max变量创建一个循环，并从map数组中打印值，该值在染色体和范围处索引。
I have truncated some lines from the output. 我从输出中截断了一些行。 As you can see from the output it will print all numbers starting at min and ending at max . 从输出中可以看到，它将打印从min到max所有数字。

Answer 3

If you are using gawk, this should work. 如果您使用的是gawk，则应该可以使用。 There's a filter for $1 that should handle everything you were doing with $j (unless you truly need only chr1..chr22, in which case it should still be possible to develop a regex for it). 有一个用于$ 1的过滤器，该过滤器应该处理您使用$ j进行的所有操作（除非您确实只需要chr1..chr22，在这种情况下仍然应该可以为其开发正则表达式）。

BEGIN {
  for(i = 30; i <= 60; i++) {
    rstring = i " - " i + 1;
    rows[rstring] = 0;
  }
}
$1 ~ /^chr[0-9][0-9]?$/ {
  row = int($3) " - " int($3) + 1;
  columns[$1] = 0;
  rows[row] = 0;
  data[row][$1] += 1;
  rowwidth = length(row) > rowwidth ? length(row) : rowwidth;
  colwidth = length($1) > colwidth ? length($1) : colwidth;
 }
END {
  rowheader = "%-" (rowwidth * 2) "s";
  colheader = "%" colwidth "s\t";
  dataformat = "%" int(colwidth / 2) "d\t";

  asorti(columns, sortedcolumns);
  asorti(rows, sortedrows);

  printf rowheader, "";
  for(c in sortedcolumns) printf "%s\t", sortedcolumns[c];
  print "";

  for(r in sortedrows) {
    printf rowheader, sortedrows[r];
    for(c in sortedcolumns)
      printf dataformat, data[sortedrows[r]][sortedcolumns[c]];
    print ""
  }
}

Running it with gawk -f [scriptfile from above] < data.txt should produce something like: 使用gawk -f [scriptfile from above] < data.txt运行它会产生类似以下内容：

chr1 chr2 30 - 31 2 0 31 - 32 0 0 . . . 39 - 40 0 0 40 - 41 1 1 41 - 42 1 0 42 - 43 0 0 . . . 59 - 60 0 0 60 - 61 0 1

Answer 4

Following can be used if you want to use Perl 如果要使用Perl，可以使用以下内容

perl -ane '
$h{$F[0]}{int $F[2]}++;
push @range, int $F[2]; 
}{
@range = sort @range; 
print "\t\t", join "\t", sort { $a cmp $b } keys %h; print "\n";
for $i ($range[0] .. $range[-1]) {
print "$i - ", $i + 1, "\t\t"; 
print $h{$_}{$i} + 0, "\t" for sort { $a cmp $b } keys %h; print "\n"
}' file

Output should be like this 输出应该是这样的

                chr1    chr2
30 - 31         2       0
31 - 32         0       0
32 - 33         0       0
33 - 34         0       0
34 - 35         0       0
35 - 36         0       0
36 - 37         0       0
37 - 38         0       0
38 - 39         0       0
39 - 40         0       0
40 - 41         1       1
41 - 42         1       0
42 - 43         0       0
43 - 44         0       0
44 - 45         0       0
45 - 46         0       0
46 - 47         0       0
47 - 48         0       0
48 - 49         0       0
49 - 50         0       0
50 - 51         0       0
51 - 52         0       0
52 - 53         0       0
53 - 54         0       0
54 - 55         0       0
55 - 56         0       0
56 - 57         0       0
57 - 58         0       0
58 - 59         0       0
59 - 60         0       0

按awk和bash对类别进行特定排序

问题描述

4 个解决方案

解决方案1
3 2014-05-29 13:43:30

解决方案2
3 已采纳 2014-05-29 16:04:45

解决方案3
3 2014-05-29 16:24:12

解决方案4
2 2014-05-29 20:26:48

按awk和bash对类别进行特定排序

问题描述

4 个解决方案

解决方案1 3 2014-05-29 13:43:30

解决方案2 3 已采纳 2014-05-29 16:04:45

解决方案3 3 2014-05-29 16:24:12

解决方案4 2 2014-05-29 20:26:48

解决方案1
3 2014-05-29 13:43:30

解决方案2
3 已采纳 2014-05-29 16:04:45

解决方案3
3 2014-05-29 16:24:12

解决方案4
2 2014-05-29 20:26:48