簡體   English   中英

Bash腳本處理不規則文本,計數出現次數,按閾值剪切

[英]Bash Script to process irregular text, count occurrences, cut at a threshold

我有一個很大的文本樣本,非常不規則,我想將其標記為單個單詞,並計算每個單詞的出現次數,並有一個輸出,其中出現次數> threshold_value

if [ $# -ne 3 ]; then
        echo 'Usage <file> <output_file> <threshold>'
        exit 1
fi

clean_and_rank () {
    tr -dc [:graph:][:cntrl:][:space:] < $1 \
    | tr -d [:punct:] \
    | tr -s ' ' \
    | tr ' ' '\n' \
    | tr '[A-Z]' '[a-z]' \
    | grep -v '^$' \
    | sort \
    | uniq -c \
    | sort -nr
}

cut_below_threshold () {
        $THRESHOLD=$1
        awk '$1 > '$THRESHOLD' { print $1, $2 }'
}

clean_and_rank $1 \
| cut_below_threshold $3
| sort -nr > $2

但是由於某些原因,我在使用cut_below_threshold()函數時遇到了麻煩–

同樣,一旦完成此操作,我就希望能夠將其與另一個樣本進行比較(我的數據是幾行帶有標簽的文本片段的2個樣本,並且我想為樣本A /樣本B中的普遍性獨立地對單詞評分)

有沒有更好的方法來解決這個問題? 最終,我正在尋找類似於“ $ WORD在示例1 000次中,在100000個總單詞中,它在示例2中100次,在10000個單詞中”的洞見

我假設您能夠以以下格式獲取兩個文本文件的統​​計信息:

$ cat a.txt
5 word1
3 word2
1 word3
$ cat b.txt
4 word1
3 word2
1 word4

然后,此腳本執行比較工作:

#!/bin/sh
# the 1st argument passed to the script, the 1st file to compare (statistics for sample A)
STATA=$1
# the 2nd argument -- the 2nd file (statistics for sample B)
STATB=$2
# concatenate both files and pipe it to the next command
cat ${STATA} ${STATB} |
# call awk; -v is awk option to set a variable
# n1=$() variable n1 gets its value from the output of the command in ()
# wc -l <file counts number of lines in the file
# ' starts awk script
awk -v n1=$(wc -l <${STATA}) '
# (){} means when condition in () is true, execute statement in {}
# NR is number of records processed thus far (usually this is number of lines)
# (NR <= n1) essentially means 'reading statistics file for sample A'
# {1; 2} two statements
# wa += $1 add value of the first field to the wa variable
# each line is splitted by a field separator (space or tab by default) into several fields:
# $1 is the 1st field, $2 is the 2nd, $NF is the last one, $0 is a whole line
# $1 in this case is number of occurrences of a word 
# awk variables have zero default value; no need to specify them explicitly
# cnta[] is an associative array -- index is a string (the word in this case)
# $2 in this case is the word
(NR <= n1){wa += $1; cnta[$2] = $1}
# the same for statistics for sample B
(NR  > n1){wb += $1; cntb[$2] = $1}
# END{} to execute statements after there's no input left
END {
  print "nof words in sample A = " wa;
  print "nof words in sample B = " wb;
  # standard printf to output a table header
  printf "%-15s %5s %8s %5s %8s\n", "word", "cntA", "freqA", "cntB", "freqB";
  # iterate over each element (the word) in the count array A
  for (w in cnta){
    # check that the word is present in the count array B
    if (cntb[w] > 0) {
      # output statistics in a table form
      printf "%-15s %5d %8.6f %5d %8.6f\n", w, cnta[w], cnta[w] / wa, cntb[w], cntb[w]/wb
    }
  }
}
'

測試運行:

$ ./compare.sh a.txt b.txt
nof words in sample A = 9
nof words in sample B = 8
word             cntA    freqA  cntB    freqB
word1               5 0.555556     4 0.500000
word2               3 0.333333     3 0.375000

讓bash使用關聯數組完成大部分工作。 這不是一個嚴格的示例,留給您作為練習:

declare -A ct

exec 3< file
while IFS= read -u3 line ; do
   set -- $line
   for tkn ; do
      cct=${ct[$tkn]}
      ct[$tkn]=$(( ${cct:-0} + 1 ))
   done
done

for tkn in ${!ct[*]}
do echo $tkn ${ct[$tkn]} ; done

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM