Bash腳本處理不規則文本，計數出現次數，按閾值剪切

Question

我有一個很大的文本樣本，非常不規則，我想將其標記為單個單詞，並計算每個單詞的出現次數，並有一個輸出，其中出現次數> threshold_value

if [ $# -ne 3 ]; then
        echo 'Usage <file> <output_file> <threshold>'
        exit 1
fi

clean_and_rank () {
    tr -dc [:graph:][:cntrl:][:space:] < $1 \
    | tr -d [:punct:] \
    | tr -s ' ' \
    | tr ' ' '\n' \
    | tr '[A-Z]' '[a-z]' \
    | grep -v '^$' \
    | sort \
    | uniq -c \
    | sort -nr
}

cut_below_threshold () {
        $THRESHOLD=$1
        awk '$1 > '$THRESHOLD' { print $1, $2 }'
}

clean_and_rank $1 \
| cut_below_threshold $3
| sort -nr > $2

但是由於某些原因，我在使用cut_below_threshold（）函數時遇到了麻煩–

同樣，一旦完成此操作，我就希望能夠將其與另一個樣本進行比較（我的數據是幾行帶有標簽的文本片段的2個樣本，並且我想為樣本A /樣本B中的普遍性獨立地對單詞評分）

有沒有更好的方法來解決這個問題？ 最終，我正在尋找類似於“ $ WORD在示例1 000次中，在100000個總單詞中，它在示例2中100次，在10000個單詞中”的洞見

Answer 1

我假設您能夠以以下格式獲取兩個文本文件的統計信息：

$ cat a.txt
5 word1
3 word2
1 word3
$ cat b.txt
4 word1
3 word2
1 word4

然后，此腳本執行比較工作：

#!/bin/sh
# the 1st argument passed to the script, the 1st file to compare (statistics for sample A)
STATA=$1
# the 2nd argument -- the 2nd file (statistics for sample B)
STATB=$2
# concatenate both files and pipe it to the next command
cat ${STATA} ${STATB} |
# call awk; -v is awk option to set a variable
# n1=$() variable n1 gets its value from the output of the command in ()
# wc -l <file counts number of lines in the file
# ' starts awk script
awk -v n1=$(wc -l <${STATA}) '
# (){} means when condition in () is true, execute statement in {}
# NR is number of records processed thus far (usually this is number of lines)
# (NR <= n1) essentially means 'reading statistics file for sample A'
# {1; 2} two statements
# wa += $1 add value of the first field to the wa variable
# each line is splitted by a field separator (space or tab by default) into several fields:
# $1 is the 1st field, $2 is the 2nd, $NF is the last one, $0 is a whole line
# $1 in this case is number of occurrences of a word 
# awk variables have zero default value; no need to specify them explicitly
# cnta[] is an associative array -- index is a string (the word in this case)
# $2 in this case is the word
(NR <= n1){wa += $1; cnta[$2] = $1}
# the same for statistics for sample B
(NR  > n1){wb += $1; cntb[$2] = $1}
# END{} to execute statements after there's no input left
END {
  print "nof words in sample A = " wa;
  print "nof words in sample B = " wb;
  # standard printf to output a table header
  printf "%-15s %5s %8s %5s %8s\n", "word", "cntA", "freqA", "cntB", "freqB";
  # iterate over each element (the word) in the count array A
  for (w in cnta){
    # check that the word is present in the count array B
    if (cntb[w] > 0) {
      # output statistics in a table form
      printf "%-15s %5d %8.6f %5d %8.6f\n", w, cnta[w], cnta[w] / wa, cntb[w], cntb[w]/wb
    }
  }
}
'

測試運行：

$ ./compare.sh a.txt b.txt
nof words in sample A = 9
nof words in sample B = 8
word             cntA    freqA  cntB    freqB
word1               5 0.555556     4 0.500000
word2               3 0.333333     3 0.375000

Answer 2

讓bash使用關聯數組完成大部分工作。 這不是一個嚴格的示例，留給您作為練習：

declare -A ct

exec 3< file
while IFS= read -u3 line ; do
   set -- $line
   for tkn ; do
      cct=${ct[$tkn]}
      ct[$tkn]=$(( ${cct:-0} + 1 ))
   done
done

for tkn in ${!ct[*]}
do echo $tkn ${ct[$tkn]} ; done

Bash腳本處理不規則文本，計數出現次數，按閾值剪切

問題描述

2 個解決方案

解決方案1
1 已采納 2014-03-30 22:14:57

解決方案2
0 2014-03-29 22:15:42

Bash腳本處理不規則文本，計數出現次數，按閾值剪切

問題描述

2 個解決方案

解決方案1 1 已采納 2014-03-30 22:14:57

解決方案2 0 2014-03-29 22:15:42

解決方案1
1 已采納 2014-03-30 22:14:57

解決方案2
0 2014-03-29 22:15:42