简体   繁体   中英

Count the number of occurrences of a substring in a string

How can I count the number of occurrences of a substring in a string using Bash?

EXAMPLE:

I'd like to know how many times this substring...

Bluetooth
         Soft blocked: no
         Hard blocked: no

...occurs in this string...

0: asus-wlan: Wireless LAN
         Soft blocked: no
         Hard blocked: no
1: asus-bluetooth: Bluetooth
         Soft blocked: no
         Hard blocked: no
2: phy0: Wireless LAN
         Soft blocked: no
         Hard blocked: no
113: hci0: Bluetooth
         Soft blocked: no
         Hard blocked: no

NOTE I: I have tried several approaches with sed, grep, awk... Nothing seems to work when we have strings with spaces and multiple lines.

NOTE II: I'm a Linux user and I'm trying a solution that does not involve installing applications/tools outside those that are usually found in Linux distributions.


IMPORTANT:

In addition to my question it is possible to have something according to the hypothetical example below. In this case instead of using files we use two Shell variables (Bash).

EXAMPLE: (based on @Ed Morton contribution)

STRING="0: asus-wlan: Wireless LAN
         Soft blocked: no
         Hard blocked: no
1: asus-bluetooth: Bluetooth
         Soft blocked: no
         Hard blocked: no
2: phy0: Wireless LAN
         Soft blocked: no
         Hard blocked: no
113: hci0: Bluetooth
         Soft blocked: no
         Hard blocked: no"

SUB_STRING="Bluetooth
         Soft blocked: no
         Hard blocked: no"

awk -v RS='\0' 'NR==FNR{str=$0; next} {print gsub(str,"")}' "$STRING" "$SUB_STRING"

Using GNU awk:

$ awk '
BEGIN { RS="[0-9]+:" }      # number followed by colon is the record separator
NR==1 {                     # read the substring to b
    b=$0
    next
}
$0~b { c++ }                # if b matches current record, increment counter
END { print c }             # print counter value
' substringfile stringfile
2

This solution requires that the match is identical to the amount of space and your example would not work as-is since the substring has less space in the indention than the string. Notice that due to the chosen RS matching for example phy0: is not possible; in that case something like RS="(^|\\n)[0-9]+:" would probably work.

Another:

$ awk '
BEGIN{ RS="^$" }                           # treat whole files as one record
NR==1 { b=$0; next }                       # buffer substringfile
{
    while(match($0,b)) {                   # count matches of b in stringfile
        $0=substr($0,RSTART+RLENGTH-1)
        c++
    }
}
END { print c }                            # output
' substringfile stringfile

Edit : Sure, remove the BEGIN section and use Bash's process substitution like below:

$ awk '
NR==1 { 
    b=$0
    gsub(/^ +| +$/,"",b)                 # clean surrounding space from substring
    next 
}
{
    while(match($0,b)) {
        $0=substr($0,RSTART+RLENGTH-1)
        c++
    }
}
END { print c }
' <(echo $SUB_STRING) <(echo $STRING)    # feed it with process substitution
2

echo ing in process substitution flattens the data and removes duplicate spaces too:

$ echo $SUB_STRING
Bluetooth Soft blocked: no Hard blocked: no

so the space problem should ease up a bit.

Edit2 : Based on @EdMorton's hawk-eyed observation in the comments:

$ awk '
NR==1 { 
    b=$0
    gsub(/^ +| +$/,"",b)                 # clean surrounding space from substring
    next 
}
{ print gsub(b,"") }
' <(echo $SUB_STRING) <(echo $STRING)    # feed it with process substitution
2

Update given your comments below, if the white space is the same in both strings:

awk 'BEGIN{print gsub(ARGV[2],"",ARGV[1])}' "$STRING" "$SUB_STRING"

or if the white space is different as in your example where the STRING lines start with 9 blanks but SUB_STRING with 8:

$ awk 'BEGIN{gsub(/[[:space:]]+/,"[[:space:]]+",ARGV[2]); print gsub(ARGV[2],"",ARGV[1])}' "$STRING" "$SUB_STRING"

Original answer:

With GNU awk if your white-space matched between files and the search string doesn't contain RE metachars all you'd need is:

awk -v RS='^$' 'NR==FNR{str=$0; next} {print gsub(str,"")}' str file

or with any awk if your input also doesn't contain NUL chars:

awk -v RS='\0' 'NR==FNR{str=$0; next} {print gsub(str,"")}' str file

but for a full solution with explanations, read on:

With any POSIX awk in any shell on any UNIX box:

$ cat str
Bluetooth
        Soft blocked: no
        Hard blocked: no

$ awk '
NR==FNR { str=(str=="" ? "" : str ORS) $0; next }
{ rec=(rec=="" ? "" : rec ORS) $0 }
END {
    gsub(/[^[:space:]]/,"[&]",str) # make sure each non-space char is treated as literal
    gsub(/[[:space:]]+/,"[[:space:]]+",str) # make sure space differences do not matter
    print gsub(str,"",rec)
}
' str file
2

With a non-POSIX awk like nawk just use 0-9 instead of [:space:] . If your search string can contain backslashes then we'd need to add 1 more gsub() to handle them.

Alternatively, with GNU awk for multi-char RS:

$ awk -v RS='^$' 'NR==FNR{gsub(/[^[:space:]]/,"[&]"); gsub(/[[:space:]]+/,"[[:space:]]+"); str=$0; next} {print gsub(str,"")}' str file
2

or with any awk if your input cannot contain NUL chars:

$ awk -v RS='\0' 'NR==FNR{gsub(/[^[:space:]]/,"[&]"); gsub(/[[:space:]]+/,"[[:space:]]+"); str=$0; next} {print gsub(str,"")}' str file
2

and on and on...

You could try this with GNU grep:

grep -zo -P ".*Bluetooth\n\s*Soft blocked: no\n\s*Hard blocked: no" <your_file> | grep -c "Bluetooth"

The first grep will match on multiple lines and display only matched groups. Counting occurrences of Bluetooth from that match will give you count of matched 'substring'.

Output of first grep:

1: asus-bluetooth: Bluetooth
         Soft blocked: no
         Hard blocked: no
113: hci0: Bluetooth
         Soft blocked: no
         Hard blocked: no

Output of entire command:

2

Use python:

#! /usr/bin/env python

import sys
import re

with open(sys.argv[1], 'r') as i:
    print(len(re.findall(sys.argv[2], i.read(), re.MULTILINE)))

invoke as

$ ./search.py file.txt 'Bluetooth
 +Soft blocked: no
 +Hard blocked: no'

the + allows one or more spaces.

EDIT

If the content is already in bash variables it's even simpler

#! /usr/bin/env python

import sys
import re

print(len(re.findall(sys.argv[2], sys.argv[1], re.MULTILINE)))

invoke as

$ ./search.py "$STRING" "$SUB_STRING"

This might work for you (GNU sed & wc):

sed -nr 'N;/^(\s*)Soft( blocked: no\s*)\n\1Hard\2$/P;D' file | wc -l

Output a line for each occurrence of the multi-line match and count the lines.

Another awk

awk '
  NR==FNR{
    b[i++]=$0          # get each line of string in array b
    next}
  $0 ~ b[0]{            # if current record match first line of string
    for(j=1;j<i;j++){
      getline
      if($0!~b[j])  # next record do not match break
        j+=i}
     if(j==i)         # all record match string
       k++}
  END{
    print k}
' stringfile infile

EDIT :

And for the XY problem of the OP, a simple script :

cat scriptbash.sh

list="${1//$'\n'/@}"
var="${2//$'\n'/@}"
result="${list//$var}"
echo $(((${#list} - ${#result}) / ${#var}))

And you call it like that :

./scriptbash.sh "$String" "$Sub_String"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM