![](/img/trans.png)
[英]In bash script, for a string of characters (English word), how do count the number of occurrences of that word in an external text file?
[英]Bash Script to process irregular text, count occurrences, cut at a threshold
我有一個很大的文本樣本,非常不規則,我想將其標記為單個單詞,並計算每個單詞的出現次數,並有一個輸出,其中出現次數> threshold_value
if [ $# -ne 3 ]; then
echo 'Usage <file> <output_file> <threshold>'
exit 1
fi
clean_and_rank () {
tr -dc [:graph:][:cntrl:][:space:] < $1 \
| tr -d [:punct:] \
| tr -s ' ' \
| tr ' ' '\n' \
| tr '[A-Z]' '[a-z]' \
| grep -v '^$' \
| sort \
| uniq -c \
| sort -nr
}
cut_below_threshold () {
$THRESHOLD=$1
awk '$1 > '$THRESHOLD' { print $1, $2 }'
}
clean_and_rank $1 \
| cut_below_threshold $3
| sort -nr > $2
但是由於某些原因,我在使用cut_below_threshold()函數時遇到了麻煩–
同樣,一旦完成此操作,我就希望能夠將其與另一個樣本進行比較(我的數據是幾行帶有標簽的文本片段的2個樣本,並且我想為樣本A /樣本B中的普遍性獨立地對單詞評分)
有沒有更好的方法來解決這個問題? 最終,我正在尋找類似於“ $ WORD在示例1 000次中,在100000個總單詞中,它在示例2中100次,在10000個單詞中”的洞見
我假設您能夠以以下格式獲取兩個文本文件的統計信息:
$ cat a.txt
5 word1
3 word2
1 word3
$ cat b.txt
4 word1
3 word2
1 word4
然后,此腳本執行比較工作:
#!/bin/sh
# the 1st argument passed to the script, the 1st file to compare (statistics for sample A)
STATA=$1
# the 2nd argument -- the 2nd file (statistics for sample B)
STATB=$2
# concatenate both files and pipe it to the next command
cat ${STATA} ${STATB} |
# call awk; -v is awk option to set a variable
# n1=$() variable n1 gets its value from the output of the command in ()
# wc -l <file counts number of lines in the file
# ' starts awk script
awk -v n1=$(wc -l <${STATA}) '
# (){} means when condition in () is true, execute statement in {}
# NR is number of records processed thus far (usually this is number of lines)
# (NR <= n1) essentially means 'reading statistics file for sample A'
# {1; 2} two statements
# wa += $1 add value of the first field to the wa variable
# each line is splitted by a field separator (space or tab by default) into several fields:
# $1 is the 1st field, $2 is the 2nd, $NF is the last one, $0 is a whole line
# $1 in this case is number of occurrences of a word
# awk variables have zero default value; no need to specify them explicitly
# cnta[] is an associative array -- index is a string (the word in this case)
# $2 in this case is the word
(NR <= n1){wa += $1; cnta[$2] = $1}
# the same for statistics for sample B
(NR > n1){wb += $1; cntb[$2] = $1}
# END{} to execute statements after there's no input left
END {
print "nof words in sample A = " wa;
print "nof words in sample B = " wb;
# standard printf to output a table header
printf "%-15s %5s %8s %5s %8s\n", "word", "cntA", "freqA", "cntB", "freqB";
# iterate over each element (the word) in the count array A
for (w in cnta){
# check that the word is present in the count array B
if (cntb[w] > 0) {
# output statistics in a table form
printf "%-15s %5d %8.6f %5d %8.6f\n", w, cnta[w], cnta[w] / wa, cntb[w], cntb[w]/wb
}
}
}
'
測試運行:
$ ./compare.sh a.txt b.txt
nof words in sample A = 9
nof words in sample B = 8
word cntA freqA cntB freqB
word1 5 0.555556 4 0.500000
word2 3 0.333333 3 0.375000
讓bash使用關聯數組完成大部分工作。 這不是一個嚴格的示例,留給您作為練習:
declare -A ct
exec 3< file
while IFS= read -u3 line ; do
set -- $line
for tkn ; do
cct=${ct[$tkn]}
ct[$tkn]=$(( ${cct:-0} + 1 ))
done
done
for tkn in ${!ct[*]}
do echo $tkn ${ct[$tkn]} ; done
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.