嘗試在文本文件中輸出x個最常用的單詞

Question

我正在嘗試編寫一個程序，該程序將在文本文件中讀取並輸出最常用單詞（現在寫入代碼時為30）及其計數的列表。 所以像這樣：

word1 count1
word2 count2
word3 count3
...   ...
...   ...
wordn countn

順序為count1> count2> count3> ...> countn。 到目前為止，這是我所擁有的，但是我無法獲得排序后的函數來執行所需的功能。 我現在得到的錯誤是：

TypeError: list indices must be integers, not tuple

我是python的新手。 任何幫助，將不勝感激。 謝謝。

 def count_func(dictionary_list):
  return dictionary_list[1]

def print_top(filename):
  word_list = {}
  with open(filename, 'r') as input_file:

    count = 0

    #best
    for line in input_file:
      for word in line.split():
        word = word.lower()
        if word not in word_list:
          word_list[word] = 1
        else:
          word_list[word] += 1

#sorted_x = sorted(word_list.items(), key=operator.itemgetter(1))
#  items = sorted(word_count.items(), key=get_count, reverse=True)

  word_list = sorted(word_list.items(), key=lambda x: x[1])

  for word in word_list:
    if (count > 30):#19
      break
    print "%s: %s" % (word, word_list[word])
    count += 1


# This basic command line argument parsing code is provided and
# calls the print_words() and print_top() functions which you must define.
def main():
  if len(sys.argv) != 3:
    print 'usage: ./wordcount.py {--count | --topcount} file'
    sys.exit(1)

  option = sys.argv[1]
  filename = sys.argv[2]
  if option == '--count':
    print_words(filename)
  elif option == '--topcount':
    print_top(filename)
  else:
    print 'unknown option: ' + option
    sys.exit(1)

if __name__ == '__main__':
  main()

Answer 1

使用collections.Counter類。

from collections import Counter

for word, count in Counter(words).most_common(30):
    print(word, count)

一些不請自來的建議：在一切都作為一個大代碼塊工作之前，不要做太多的功能。 在工作后重構為函數。 這么小的腳本，您甚至不需要一個主要部分。

Answer 2

使用itertools的groupby ：

from itertools import groupby

words = sorted([w.lower() for w in open("/path/to/file").read().split()])
count = [[item[0], len(list(item[1]))] for item in groupby(words)]
count.sort(key=lambda x: x[1], reverse = True)
for item in count[:5]:
    print(*item)

這將列出文件中的單詞，對其進行排序，並列出唯一單詞及其出現。 隨后，發現名單是由發生排序方式：
```
 count.sort(key=lambda x: x[1], reverse = True) 
```
reverse = True是首先列出最常見的單詞。
在該行中：
```
 for item in count[:5]: 
```
[:5]定義要顯示的最多出現的單詞數。

Answer 3

其他人建議的第一種方法，即通過使用most_common(...)不能根據您的需要運行，因為它返回第n個最常見的單詞，而不是返回計數小於或等於n的單詞：

這里使用most_common(...) ：請注意，它僅顯示前n個最常用的單詞：

>>> import re
... from collections import Counter
... def print_top(filename, max_count):
...     words = re.findall(r'\w+', open(filename).read().lower())
...     for word, count in Counter(words).most_common(max_count):
...         print word, count
... print_top('n.sh', 1)
force 1

正確的方法如下，請注意，它將打印計數小於等於count的所有單詞：

>>> import re
... from collections import Counter
... def print_top(filename, max_count):
...     words = re.findall(r'\w+', open(filename).read().lower())
...     for word, count in filter(lambda x: x[1]<=max_count, sorted(Counter(words).items(), key=lambda x: x[1], reverse=True)):
...         print word, count
... print_top('n.sh', 1)
force 1
in 1
done 1
mysql 1
yes 1
egrep 1
for 1
1 1
print 1
bin 1
do 1
awk 1
reinstall 1
bash 1
mythtv 1
selections 1
install 1
v 1
y 1

Answer 4

這是我的python3解決方案。 在面試中有人問我這個問題，訪調員很高興這個解決方案，盡管在較少時間限制的情況下，上面提供的其他解決方案對我來說似乎更好。

    dict_count = {}
    lines = []

    file = open("logdata.txt", "r")

    for line in file:# open("logdata.txt", "r"):
        lines.append(line.replace('\n', ''))

    for line in lines:
        if line not in dict_count:
            dict_count[line] = 1
        else:
            num = dict_count[line]
            dict_count[line] = (num + 1)

    def greatest(words):
        greatest = 0
        string = ''
        for key, val in words.items():
            if val > greatest:
                greatest = val
                string = key
        return [greatest, string]

    most_common = []
    def n_most_common_words(n, words):
        while len(most_common) < n:
            most_common.append(greatest(words))
            del words[(greatest(words)[1])]

    n_most_common_words(20, dict_count)

    print(most_common)

嘗試在文本文件中輸出x個最常用的單詞

問題描述

4 個解決方案

解決方案1
2 2016-09-02 19:38:42

解決方案2
1 2016-09-02 20:11:03

解決方案3
0 2016-09-02 19:56:31

解決方案4
0 2019-03-11 19:55:35

嘗試在文本文件中輸出x個最常用的單詞

問題描述

4 個解決方案

解決方案1 2 2016-09-02 19:38:42

解決方案2 1 2016-09-02 20:11:03

解決方案3 0 2016-09-02 19:56:31

解決方案4 0 2019-03-11 19:55:35

解決方案1
2 2016-09-02 19:38:42

解決方案2
1 2016-09-02 20:11:03

解決方案3
0 2016-09-02 19:56:31

解決方案4
0 2019-03-11 19:55:35