简体   繁体   English

如何在bash中计算字符串中的连续(重复)字符?

[英]How to count consecutive (repeated) character in string in bash?

I am wondering if there is a simple bash or AWK oneliner to get the number of repeated characters, per repeat.我想知道是否有一个简单的 bash 或 AWK oneliner 来获取每次重复的重复字符数。

For example considering this string:例如考虑这个字符串:

AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA

Is it possible to get the number of Ns in the first repeat, the number of Ns in the second repeat, etc.?是否有可能在第一次重复中获得 Ns 的数量,在第二次重复中获得 Ns 的数量,等等?

Thanks!谢谢!

Expected results, the length of each repeat on a new line.预期结果,每次重复的长度换行。

You can use awk to split fields on each character that not N and print each field and it's length:您可以使用awk在每个不是N字符上拆分字段并打印每个字段及其长度:

s='AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA'

awk -F '[^N]+' '{for (i=1; i<=NF; i++) if ($i != "") print $i, length($i)}' <<< "$s"

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

Another option is to use grep + awk :另一种选择是使用grep + awk

grep -Eo 'N+' <<< "$s" | awk '{print $1, length($1)}'

And here is pure BASH solution :这是纯 BASH 解决方案

shopt -s extglob
while read -r line; do
    [[ -n $line ]] && echo "$line ${#line}"
done <<< "${s//+([!N])/$'\n'}"

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

BASH solution details: BASH解决方案详情:

  1. It uses extended glob pattern to match 1 or more non-N characters and replace them with line break in +([!N])/$'\\n'}"它使用扩展的 glob 模式来匹配 1 个或多个non-N字符,并用+([!N])/$'\\n'}"换行符替换它们
  2. Using a while loop we iterate through each substring of N characters使用while循环,我们遍历每个N字符的子串
  3. Inside the loop we print each string and length of that string.在循环内部,我们打印每个字符串和该字符串的长度。

A simple solution:一个简单的解决方案:

echo "$string" | grep -oE "N+" | awk '{ print $0, length}'

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

EDIT:编辑:
As per suggestion of @Ed-Morton: Changing -P to -E.根据@Ed-Morton 的建议:将 -P 更改为 -E。
Man page of grep says -P is "highly experimental" functionality. grep 的手册页说 -P 是“高度实验性”的功能。
We don't need PCREs to use +, just EREs are sufficient.我们不需要 PCRE 来使用 +,只要 ERE 就足够了。

With GNU awk for multi-char RS:使用用于多字符 RS 的 GNU awk:

$ awk -v RS='N+' 'RT{print length(RT)}' file
5
8
7

$ awk -v RS='N+' 'RT{print RT, length(RT)}' file
NNNNN 5
NNNNNNNN 8
NNNNNNN 7

Here's a Perl one-liner:这是一个 Perl 单行代码:

perl -ne 'while (m/(.)(\1*)/g) { printf "%5i %s\n", length($2)+1, $1 }' <<<AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA
2 A
1 T
1 G
1 A
1 T
2 G
2 A
5 N
1 G
1 A
1 T
1 A
1 G
2 A
1 C
1 G
1 A
1 T
8 N
1 G
1 A
1 T
2 A
1 T
1 G
1 A
7 N
1 T
1 A
1 G
1 A
1 C
1 T
1 G
1 A

The m/(.)(\\1*)/ successively matches as many identical characters as possible, with the /g causing the matching to pick up again on the next iteration for as long as the string still contains something which we have not yet matched. m/(.)(\\1*)/连续匹配尽可能多的相同字符, /g导致匹配在下一次迭代中再次出现,只要字符串仍然包含一些我们还没有的东西匹配。 So we are looping over the string in chunks of identical characters, and on each iteration, printing the first character as well as the length of the entire matched string.因此,我们以相同字符的块循环遍历字符串,并且在每次迭代时,打印第一个字符以及整个匹配字符串的长度。

The first pair of parentheses capture a character at the beginning of the (remaining unmatched) line, and \\1 says to repeat this character.第一对括号在(剩余的不匹配)行的开头捕获一个字符, \\1表示重复该字符。 The * quantifier matches this as many times as possible. *量词尽可能多地与此匹配。

If you are interested in just the N:s, you could change the first parenthesis to (N) , or you could add a conditional like printf("%7i %s\\n", length($2), $1) if ($1 == "N") .如果您只对 N:s 感兴趣,您可以将第一个括号更改为(N) ,或者您可以添加一个条件,如printf("%7i %s\\n", length($2), $1) if ($1 == "N") Similarly, if you want only hits where there are repeats (more than one occurrence), you can say \\1+ instead of \\1* or add a conditional like ... if length($2) >= 1 .同样,如果您只想要重复(多次出现)的命中,您可以说\\1+而不是\\1*或添加一个条件,如... if length($2) >= 1

当您要求 sed 解决方案时,如果您的重复字符链不超过 9 个字符并且您的字符串不包含任何分号,则可以使用此解决方案:

sed 's/$/;NNNNNNNNN0123456789/;:a;s/\\(N\\+\\)\\([^;]*;\\1.\\{9\\}\\)\\(.\\)\\(.*\\)/\\2\\3\\4\\n\\3/;ta;s/[^\\n]*\\n//'

try these two:试试这两个:

First one第一个

sed 's/[^N]/ /g' file | awk '{for(i=1;i<=NF;i++){print $i":"length($i)}}'

Second One第二个

cat file | tr -c 'N' ' ' | awk '{for(i=1;i<=NF;i++){print $i":"length($i)}}'

Short GNU awk approach:简短的 GNU awk方法:

str='AATGATGGAANNNNNGATAGAACGATNNNNNNNNGATAATGANNNNNNNTAGACTGA'

awk -v FPAT='N+' '{for(i=1;i<=NF;i++) print $i,length($i)}' <<< $str

The output:输出:

NNNNN 5
NNNNNNNN 8
NNNNNNN 7

You could take help of the regular expression method.您可以借助正则表达式方法。

This is a solution code I get from the following link这是我从以下链接获得的解决方案代码

Count occurrences of a char in a string using Bash 使用 Bash 计算字符串中字符的出现次数

needle=","
var="text,text,text,text"

number_of_occurrences=$(grep -o "$needle" <<< "$var" | wc -l)

as you can see we get the number of occurrences of "$needle" pretty easily with the help of WC(word count).如您所见,在 WC(字数)的帮助下,我们可以很容易地获得“$needle”的出现次数。

You can loop it to satisfy your demand.您可以循环它以满足您的需求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM