简体   繁体   中英

how do i find the count of multiple words in a text file?

i am able to find the number of times a word occurs in a text file like in Linux we can use

cat filename|grep -c tom

my question is how do i find the count of multiple words like "tom" and "joe" in a text file.

Ok, so first split the file into words, then sort and uniq :

tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c

You use uniq :

 
 
 
  
  sort filename | uniq -c
 
  

Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).

test.txt:

tom is really really cool!  joe for the win!
tom is actually lame.


$ grep -c '\<\(tom\|joe\)\>' test.txt
2

As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.

I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.

$ grep -o '\(joe\|tom\)' test.txt|wc -l
       3

3...the correct answer! Hope this helps

Use awk:

{for (i=1;i<=NF;i++)
    count[$i]++
}
END {
    for (i in count)
        print count[i], i
}

This will produce a complete word frequency count for the input. Pipe tho output to grep to get the desired fields

awk -f w.awk input | grep -E 'tom|joe'

BTW, you do not need cat in your example, most programs that acts as filters can take the filename as an parameter; hence it's better to use

grep -c tom filename

if not, there is a strong possibility that people will start throwing Useless Use of Cat Award at you ;-)

To find all hits in all lines

echo "tom is really really cool!  joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3

This will count "tomtom" as 2 hits.

Here is one:

cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c

UPDATE

A shell script solution:

#!/bin/bash

file_name="$2"
string="$1"

if [ $# -ne 2 ]
  then
   echo "Usage: $0 <pattern to search> <file_name>"
   exit 1
fi

if [ ! -f "$file_name" ]
 then
  echo "file \"$file_name\" does not exist, or is not a regular file"
  exit 2
fi

line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0

# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
 do
  flag=0
  while [[ "$line" == *$string* ]]
   do
    flag=1
    line_no_list[line_no_indx]=$curr_line_indx
    line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
    total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
    line=${line/"$string"/}
  done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
  if (( flag == 1 ))
   then
    line_no_indx=$((line_no_indx+2))
  fi
  curr_line_indx=$((curr_line_indx+1))
done < "$file_name"


echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "

for ((i=0; i<line_no_indx; i=i+2))
 do
  echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done

echo
  1. The sample you gave does not search for words "tom". It will count "atom" and "bottom" and many more.
  2. Grep searches for regular expressions . Regular expression that matches word "tom" or "joe" is

     \\<\\(tom\\|joe\\)\\> 

你可以做正则表达式,

 cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"

I completely forgot about grep -f:

cat filename | grep -fc names

AWK solution:

Assuming the names are in a file called names :

cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -

Note that your original grep doesn't search for words. eg

$ echo tomorrow | grep -c tom
1

You need grep -w

gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'

The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. Then grep counts lines that match one of the words you want exactly.

We use gawk because the POSIX awk doesn't allow regex record separator.

For brevity, you can replace '{print}' with 1 - either way, it's an Awk program that simply prints out all input records ("is 1 true? it is? then do the default action, which is {print} .")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM