如何加速在非常大的单单元 BAM 文件上使用正则表达式的 sed

Question

I have the following simple script that tries to count the tag encoded with "CB:Z" in SAM/BAM file :我有以下简单的脚本，它试图计算SAM/BAM 文件中用“CB:Z”编码的标签：

samtools view -h small.bam |  grep "CB:Z:" |
    sed 's/.*CB:Z:\([ACGT]*\).*/\1/' |
    sort |
    uniq -c |
    awk '{print $2 " " $1}'

Typically it needs to process 40 million lines.通常它需要处理 4000 万行。 That codes takes around 1 hour to finish.该代码大约需要 1 小时才能完成。

This line sed 's/.*CB:Z:$[ACGT]*$.*/\1/' is very time consuming.这一行sed 's/.*CB:Z:$[ACGT]*$.*/\1/'非常耗时。 How can I speed it up?我怎样才能加快速度？

The reason I used the Regex is that the "CB" tag column-wise position is not fixed.我使用正则表达式的原因是“CB”标签列 position 不固定。 Sometimes it's at column 20 and sometimes column 21.有时在第 20 列，有时在第 21 列。

Example BAM file can be found HERE .可以在此处找到示例 BAM 文件。

Update更新

Speed comparison on complete 40 million lines file:完整 4000 万行文件的速度比较：

My initial code:我的初始代码：

real    21m47.088s
user    26m51.148s
sys 1m27.912s

James Brown's with AWK:詹姆斯布朗的 AWK：

real    1m28.898s
user    2m41.336s
sys 0m6.864s

James Brown's with MAWK:詹姆斯布朗与 MAWK：

real    1m10.642s
user    1m41.196s
sys 0m6.484s

Answer 1

Another awk, pretty much like @tripleee's, I'd assume:另一个awk，很像@tripleee，我假设：

$ samtools view -h small.bam | awk '
match($0,/CB:Z:[ACGT]*/) {               # use match for the regex match
    a[substr($0,RSTART+5,RLENGTH-5)]++   # len(CB:z:)==5, hence +-5
}
END {
    for(i in a)
        print i,a[i]                     # sample output,tweak it to your liking
}'

Sample output:样品 output：

...
TCTTAATCGTCC 175
GGGAAGGCCTAA 190
TCGGCCGATCGG 32
GACTTCCAAGCC 76
CCGCGGCATCGG 36
TAGCGATCGTGG 125
...

Notice : Your sed 's/.*CB:Z:... matches the last instance where as my awk 'match($0,/CB:Z:[ACGT]*/)... matches the first.注意：您的sed 's/.*CB:Z:...匹配最后一个实例，而我的awk 'match($0,/CB:Z:[ACGT]*/)...匹配第一个实例。

Notice 2 : Quoting @Sundeep in the comments: - - using LC_ALL=C mawk '..' will give even better speed.注意 2 ：在评论中引用@Sundeep： - - 使用LC_ALL=C mawk '..'将提供更快的速度。

Answer 2

With perl带perl

perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}'

CB:Z:\K[ACGT]++ will match any sequence of ACGT characters preceded by CB:Z: . CB:Z:\K[ACGT]++将匹配任何以CB:Z:开头的ACGT字符序列。 \K is used here to prevent CB:Z: from being part of matched portion, which is available via $& variable \K用于防止CB:Z:成为匹配部分的一部分，可通过$&变量获得

Sample time with small.bam input file. small.bam输入文件的采样时间。 mawk is fastest for this input, but it might change for larger input file.对于这个输入， mawk是最快的，但是对于更大的输入文件它可能会改变。

# script.awk is the one mentioned in James Brown's answer
# result here shown with GNU awk
$ time LC_ALL=C awk -f script.awk small.bam > f1
real    0m0.092s

# mawk is faster compared to GNU awk for this use case
$ time LC_ALL=C mawk -f script.awk small.bam > f2
real    0m0.054s

$ time perl -ne '$h{$&}++ if /CB:Z:\K[ACGT]++/; END{print "$_ $h{$_}\n" for keys %h}' small.bam > f3
real    0m0.064s

$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical

Answer 3

Better to avoid parsing the output of samtools view in the first place.最好首先避免解析samtools view的output。 Here's one way to get what you need just using python and the pysam library:这是使用python和pysam库获得所需内容的一种方法：

import pysam

from collections import defaultdict

counts = defaultdict(int)
tag = 'CB'

with pysam.AlignmentFile('small.bam') as sam:
    for aln in sam:
        if aln.has_tag(tag):
            counts[ aln.get_tag(tag) ] += 1

for k, v in counts.items():
    print(k, v)

Answer 4

Following your original pipeline approach:遵循您原来的管道方法：

pcre2grep -o 'CB:Z:\K[^\t]*' small.bam |
 awk '{++c[$0]} END {for (i in c) print i,c[i]}'

In case you're interested in trying to speed up sed (although it's not likely to be the fastest):如果您有兴趣尝试加速sed （尽管它可能不是最快的）：

sed 't a;s/CB:Z:/\n/;D;:a;s/\t/\n/;P;d' small.bam |
 awk '{++c[$0]} END {for (i in c) print i,c[i]}'

_{above syntax is compatible with GNU sed.}_{上述语法与 GNU sed 兼容。}

Answer 5

regrading the AWK based solutions, i've noticed few taking advantage of FS.重新升级基于 AWK 的解决方案，我注意到很少有人利用 FS。

I'm not too familiar with BAM format.我对 BAM 格式不太熟悉。 If CB only show up once per line, then如果 CB 每行只出现一次，那么

mawk/mawk2/gawk -b 'BEGIN { FS = "CB:Z:"; 

   } $2 ~ /^[ACGT]/ {     # if FS never matches, $2 would be beyond
                          # end of line, then this would just match
                          # against  null string, & eval to false

      seen[substr($2, 1, -1 + match($2, /[^ACGT]|$/))]++ 

   } END { for (x in seen) { print seen[x] " " x } }'

If it shows up more than once, then change that to a loop of any field greater than 1. This version uses the laziest evaluation model possible to speed it up, then do all the uniq -c item.如果它出现不止一次，则将其更改为任何大于 1 的字段的循环。此版本使用最惰性的评估 model 可能加速它，然后执行所有 uniq -c 项。

While this is rather similar to the best answer above, by having FS pre-split the fields, it causes match() and substr() to do a lot less work.虽然这与上面的最佳答案非常相似，但通过让 FS 预先拆分字段，它会导致 match() 和 substr() 做的工作少得多。 I'm simply matching 1 single char after the genetic sequence, and directly using its return, minus 1, as the substring length, and skipping RSTART or RLENGTH all together.我只是在基因序列之后匹配 1 个单个字符，并直接使用其返回值（负 1）作为 substring 长度，并一起跳过 RSTART 或 RLENGTH。

Answer 6

Regarding:关于：

$ diff -sq <(sort f1) <(sort f2)
Files /dev/fd/63 and /dev/fd/62 are identical
$ diff -sq <(sort f1) <(sort f3)
Files /dev/fd/63 and /dev/fd/62 are identical

there's absolutely no need to have them physically output to disk and do a diff.绝对没有必要将它们物理上 output 到磁盘并进行差异。 Just simply have the output of each piped to a very high speed hashing algorithm that adds close to no time (when the output is gigantic enough you might end up saving time versus going to disk.只需简单地将每个管道的 output 连接到一个非常高速的散列算法，该算法几乎不会增加时间（当 output 足够巨大时，您最终可能会节省时间而不是进入磁盘。

my personal favorite is xxhash in 128-bit mode, available via python pip.我个人最喜欢的是 128 位模式下的 xxhash，可通过 python pip 获得。 it's NOT a cryptographic hash, but it's much faster than even something like MD5.它不是加密的 hash，但它甚至比 MD5 之类的东西快得多。 This method also allows for hassle-free compare since the benchmark timing of it will also perform the accuracy check.此方法还允许轻松比较，因为它的基准时间也将执行准确性检查。

如何加速在非常大的单单元 BAM 文件上使用正则表达式的 sed

问题描述

6 个解决方案

解决方案1
5 已采纳 2020-11-25 08:01:17

解决方案2
4 2020-11-25 08:57:59

解决方案3
1 2020-11-25 10:35:28

解决方案4
1 2020-11-25 15:07:03

解决方案5
1 2021-05-16 09:12:19

解决方案6
0 2020-12-08 02:59:35

如何加速在非常大的单单元 BAM 文件上使用正则表达式的 sed

问题描述

6 个解决方案

解决方案1 5 已采纳 2020-11-25 08:01:17

解决方案2 4 2020-11-25 08:57:59

解决方案3 1 2020-11-25 10:35:28

解决方案4 1 2020-11-25 15:07:03

解决方案5 1 2021-05-16 09:12:19

解决方案6 0 2020-12-08 02:59:35

解决方案1
5 已采纳 2020-11-25 08:01:17

解决方案2
4 2020-11-25 08:57:59

解决方案3
1 2020-11-25 10:35:28

解决方案4
1 2020-11-25 15:07:03

解决方案5
1 2021-05-16 09:12:19

解决方案6
0 2020-12-08 02:59:35