[英]how do i find the count of multiple words in a text file?
i am able to find the number of times a word occurs in a text file like in Linux we can use 我能够找到一个单词出现在文本文件中的次数,就像我们可以使用的Linux一样
cat filename|grep -c tom
my question is how do i find the count of multiple words like "tom" and "joe" in a text file. 我的问题是如何在文本文件中找到多个单词的数量,如“tom”和“joe”。
Ok, so first split the file into words, then sort
and uniq
: 好的,首先将文件拆分为单词,然后
sort
和uniq
:
tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c
You use
uniq
:
你使用
uniq
:
sort filename | uniq -c
Since you have a couple names, regular expressions is the way to go on this one. 由于你有几个名字,正则表达式是这个的方法。 At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).
起初我认为它只是对joe或tom的正则表达式上的grep计数一样简单,但是这并没有说明tom和joe在同一行上的情况(或tom和tom就此而言) 。
test.txt: 的test.txt:
tom is really really cool! joe for the win!
tom is actually lame.
$ grep -c '\<\(tom\|joe\)\>' test.txt
2
As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line. 正如您在test.txt文件中看到的那样,2是错误的答案,因此我们需要考虑同一行上的名称。
I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. 然后我使用grep -o只显示匹配行的匹配行的部分,它在文件中给出了tom或joe的正确模式匹配。 I then piped the results into number of lines into wc for the line count.
然后我将结果输入到行数为wc的行数。
$ grep -o '\(joe\|tom\)' test.txt|wc -l
3
3...the correct answer! 3 ...正确答案! Hope this helps
希望这可以帮助
Use awk: 使用awk:
{for (i=1;i<=NF;i++)
count[$i]++
}
END {
for (i in count)
print count[i], i
}
This will produce a complete word frequency count for the input. 这将为输入产生完整的字频率计数。 Pipe tho output to
grep
to get the desired fields 将输出管道输出到
grep
以获取所需的字段
awk -f w.awk input | grep -E 'tom|joe'
BTW, you do not need cat
in your example, most programs that acts as filters can take the filename as an parameter; 顺便说一句,你的例子中不需要
cat
,大多数充当过滤器的程序都可以将文件名作为参数; hence it's better to use 因此最好使用
grep -c tom filename
if not, there is a strong possibility that people will start throwing Useless Use of Cat Award at you ;-) 如果没有,人们很可能会开始向你投掷无用的猫奖 ;-)
To find all hits in all lines 查找所有行中的所有匹配
echo "tom is really really cool! joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3
This will count "tomtom" as 2 hits. 这将被称为“tomtom”为2次点击。
Here is one: 这是一个:
cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c
UPDATE UPDATE
A shell script solution: 一个shell脚本解决方案:
#!/bin/bash
file_name="$2"
string="$1"
if [ $# -ne 2 ]
then
echo "Usage: $0 <pattern to search> <file_name>"
exit 1
fi
if [ ! -f "$file_name" ]
then
echo "file \"$file_name\" does not exist, or is not a regular file"
exit 2
fi
line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0
# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
do
flag=0
while [[ "$line" == *$string* ]]
do
flag=1
line_no_list[line_no_indx]=$curr_line_indx
line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
line=${line/"$string"/}
done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
if (( flag == 1 ))
then
line_no_indx=$((line_no_indx+2))
fi
curr_line_indx=$((curr_line_indx+1))
done < "$file_name"
echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "
for ((i=0; i<line_no_indx; i=i+2))
do
echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done
echo
Grep searches for regular expressions . Grep搜索正则表达式 。 Regular expression that matches word "tom" or "joe" is
匹配单词“tom”或“joe”的正则表达式是
\\<\\(tom\\|joe\\)\\>
你可以做正则表达式,
cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"
I completely forgot about grep -f: 我完全忘记了grep -f:
cat filename | grep -fc names
AWK solution: AWK解决方案:
Assuming the names are in a file called names
: 假设名称位于名为
names
的文件中:
cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -
Note that your original grep doesn't search for words. 请注意,您的原始grep不会搜索单词。 eg
例如
$ echo tomorrow | grep -c tom
1
You need grep -w
你需要
grep -w
gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'
The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. gawk程序将记录分隔符设置为非字母的任何内容,因此每个单词最终都会出现在单独的行中。 Then grep counts lines that match one of the words you want exactly.
然后grep计算与您想要的单词之一匹配的行。
We use gawk because the POSIX awk doesn't allow regex record separator. 我们使用gawk因为POSIX awk不允许使用正则表达式记录分隔符。
For brevity, you can replace '{print}'
with 1
- either way, it's an Awk program that simply prints out all input records ("is 1
true? it is? then do the default action, which is {print}
.") 为了简洁起见,您可以将
'{print}'
替换为1
- 无论哪种方式,它都是一个Awk程序,只是打印出所有输入记录(“是1
是真的吗?它是?然后执行默认操作,即{print}
。 )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.