简体   繁体   English

如何在文本文件中找到多个单词的计数?

[英]how do i find the count of multiple words in a text file?

i am able to find the number of times a word occurs in a text file like in Linux we can use 我能够找到一个单词出现在文本文件中的次数,就像我们可以使用的Linux一样

cat filename|grep -c tom

my question is how do i find the count of multiple words like "tom" and "joe" in a text file. 我的问题是如何在文本文件中找到多个单词的数量,如“tom”和“joe”。

Ok, so first split the file into words, then sort and uniq : 好的,首先将文件拆分为单词,然后sortuniq

tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c

You use uniq : 你使用 uniq

 
 
 
  
  sort filename | uniq -c
 
  

Since you have a couple names, regular expressions is the way to go on this one. 由于你有几个名字,正则表达式是这个的方法。 At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter). 起初我认为它只是对joe或tom的正则表达式上的grep计数一样简单,但是这并没有说明tom和joe在同一行上的情况(或tom和tom就此而言) 。

test.txt: 的test.txt:

tom is really really cool!  joe for the win!
tom is actually lame.


$ grep -c '\<\(tom\|joe\)\>' test.txt
2

As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line. 正如您在test.txt文件中看到的那样,2是错误的答案,因此我们需要考虑同一行上的名称。

I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. 然后我使用grep -o只显示匹配行的匹配行的部分,它在文件中给出了tom或joe的正确模式匹配。 I then piped the results into number of lines into wc for the line count. 然后我将结果输入到行数为wc的行数。

$ grep -o '\(joe\|tom\)' test.txt|wc -l
       3

3...the correct answer! 3 ...正确答案! Hope this helps 希望这可以帮助

Use awk: 使用awk:

{for (i=1;i<=NF;i++)
    count[$i]++
}
END {
    for (i in count)
        print count[i], i
}

This will produce a complete word frequency count for the input. 这将为输入产生完整的字频率计数。 Pipe tho output to grep to get the desired fields 将输出管道输出到grep以获取所需的字段

awk -f w.awk input | grep -E 'tom|joe'

BTW, you do not need cat in your example, most programs that acts as filters can take the filename as an parameter; 顺便说一句,你的例子中不需要cat ,大多数充当过滤器的程序都可以将文件名作为参数; hence it's better to use 因此最好使用

grep -c tom filename

if not, there is a strong possibility that people will start throwing Useless Use of Cat Award at you ;-) 如果没有,人们很可能会开始向你投掷无用的猫奖 ;-)

To find all hits in all lines 查找所有行中的所有匹配

echo "tom is really really cool!  joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3

This will count "tomtom" as 2 hits. 这将被称为“tomtom”为2次点击。

Here is one: 这是一个:

cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c

UPDATE UPDATE

A shell script solution: 一个shell脚本解决方案:

#!/bin/bash

file_name="$2"
string="$1"

if [ $# -ne 2 ]
  then
   echo "Usage: $0 <pattern to search> <file_name>"
   exit 1
fi

if [ ! -f "$file_name" ]
 then
  echo "file \"$file_name\" does not exist, or is not a regular file"
  exit 2
fi

line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0

# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
 do
  flag=0
  while [[ "$line" == *$string* ]]
   do
    flag=1
    line_no_list[line_no_indx]=$curr_line_indx
    line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
    total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
    line=${line/"$string"/}
  done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
  if (( flag == 1 ))
   then
    line_no_indx=$((line_no_indx+2))
  fi
  curr_line_indx=$((curr_line_indx+1))
done < "$file_name"


echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "

for ((i=0; i<line_no_indx; i=i+2))
 do
  echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done

echo
  1. The sample you gave does not search for words "tom". 您提供的示例不会搜索单词 “tom”。 It will count "atom" and "bottom" and many more. 它会计算“原子”和“底部”等等。
  2. Grep searches for regular expressions . Grep搜索正则表达式 Regular expression that matches word "tom" or "joe" is 匹配单词“tom”或“joe”的正则表达式是

     \\<\\(tom\\|joe\\)\\> 

你可以做正则表达式,

 cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"

I completely forgot about grep -f: 我完全忘记了grep -f:

cat filename | grep -fc names

AWK solution: AWK解决方案:

Assuming the names are in a file called names : 假设名称位于名为names的文件中:

cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -

Note that your original grep doesn't search for words. 请注意,您的原始grep不会搜索单词。 eg 例如

$ echo tomorrow | grep -c tom
1

You need grep -w 你需要grep -w

gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'

The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. gawk程序将记录分隔符设置为非字母的任何内容,因此每个单词最终都会出现在单独的行中。 Then grep counts lines that match one of the words you want exactly. 然后grep计算与您想要的单词之一匹配的行。

We use gawk because the POSIX awk doesn't allow regex record separator. 我们使用gawk因为POSIX awk不允许使用正则表达式记录分隔符。

For brevity, you can replace '{print}' with 1 - either way, it's an Awk program that simply prints out all input records ("is 1 true? it is? then do the default action, which is {print} .") 为了简洁起见,您可以将'{print}'替换为1 - 无论哪种方式,它都是一个Awk程序,只是打印出所有输入记录(“是1是真的吗?它是?然后执行默认操作,即{print} 。 )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 查找和删除文本文件中包含多个单词的行 - Find and delete lines with multiple words in a text file 在 Sublime Text 中,如何查找和替换文件名 - In Sublime Text, how do I find and replace the file name 如何在Linux或Mac上的文件中一次替换多个文本? - How do I replace multiple text at once in a file on Linux or Mac? 如何计算字符串在仅包含AWK的文本文件中出现的次数? - How do I count the number of times a string appears in a text file with only AWK? 在多个文件中搜索文本文件中的单词列表 - Searching multiple files for list of words in a text file 如何计算Markdown语法文件中的粗体字和斜体字的数量 - How to count the number of bold words and italic words in a markdown syntax file 计算文件中的指定单词 - Count specified words in file Perl-计算文件字数 - Perl - count words of a file 如何使用glob模式为tcsh的文件匹配找到包含两个以上字母但没有以两个特定字母开头的单词? - How do I find words that have more than two letters but do not start with a specific two letters using the glob patterns for file matching for tcsh? 如何在 Spark 中编写一个独立的应用程序,以在填充了提取的推文的文本文件中找到 20 个大多数提及 - How do I write a standalone application in Spark to find 20 of most mentions in a text file filled with extracted tweets
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM