简体   繁体   中英

How do I use grep to count the number of occurrences of a string

How do I use grep to count the number of occurrences of a string?

input:

.
├── a.txt
├── b.txt
// a.txt
aaa
// b.txt
aaa
bbb
ccc

Now I want to know how many times aaa and bbb appear.

output:

aaa: 2
bbb: 1

You can try awk . This uses split to count the occurrences of the search patterns and puts them in the "associative" array n .

$ awk 'BEGIN{ pat1="aaa"; pat2="bbb" } 
    { n[pat1]+=(split($0,arr,pat1)-1) } 
    { n[pat2]+=(split($0,arr,pat2)-1) } 
    END{ for(i in n){ print i":",n[i] } }' a.txt b.txt
aaa: 10
bbb: 14

Example data

$ cat a.txt
aaa
aaa efwepom dq
bbb qwpdo bbb
qwdo qwdpomaaa
qwo bbb
pefaaaomaaaewe bb aa
aaa bbb

$ cat b.txt
aaa
aaa efwepom dq
bbb qwpdo bbb
qwdo qwdpomaaa
qwo bbb
pebbb bbb fobbbmebbbwe bb aa
aaa bbb
bbbbbbsad

Just an idea:

grep -E "aaa|bbb|ccc" *.txt | awk -F: '{print $2}' | sort | uniq -c

This means:

grep -E "...|..." : extended grep, look for all entries

The result is given as:
a.txt:aaa
b.txt:aaa
b.txt:bbb
b.txt:ccc

awk -F: '{print $2}' : split the result in 2 columns, 
                       based on the semicolon, 
                       and only show the second column

sort | uniq -c : sort and count unique entries

The problem with grep is if you have more than one item on a single line.
grep counts lines , so you need -o and another instance of grep or a wc or some such.

$: cat lst
aaa
bbb

$: cat a.txt
aaa

$: cat b.txt # I added a second hit on the bbb line
aaa
bbb bbb
ccc

$: files=( [ab].txt )
$: time while read pattern; do 
     printf "%s: " "$pattern";
     grep -o "$pattern" "${files[@]}" | wc -l;
   done < lst
aaa: 2
bbb: 2

Note that this is slow, even with such a small dataset.

real    0m1.119s
user    0m0.060s
sys     0m0.308s

This lets you make a list file, but reads every file in your target set once per pattern, and executes the grep AND the wc on each. Andre 's awk solution would be cleaner, faster, and generally better all around, especially if you put the list in a file and parsed against that rather than as a set of literals.

$: time awk 'NR==FNR{ pats[$0]; next; } 
   { for (p in pats) { n[p]+=(split($0,arr,p)-1) } } 
   END{ for(p in n){ print p": ",n[p] } }' lst "${files[@]}"
aaa:  2
bbb:  2

Considerably faster - likely MUCH more so on more data and files.

real    0m0.344s
user    0m0.015s
sys     0m0.092s

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM