txt 文件程序的字数统计

Question

我正在使用以下代码计算 txt 文件的单词：

#!/usr/bin/python
file=open("D:\\zzzz\\names2.txt","r+")
wordcount={}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
print (word,wordcount)
file.close();

这给了我这样的输出：

>>> 
goat {'goat': 2, 'cow': 1, 'Dog': 1, 'lion': 1, 'snake': 1, 'horse': 1, 'ï»¿': 1, 'tiger': 1, 'cat': 2, 'dog': 1}

但我希望以下列方式输出：

word  wordcount
goat    2
cow     1
dog     1.....

此外，我在输出中得到了一个额外的符号（ ï»¿ ）。 我怎样才能删除这个？

Answer 1

您遇到的有趣符号是 UTF-8 BOM (Byte Order Mark) 。 要摆脱它们，请使用正确的编码打开文件（我假设您使用的是 Python 3）：

file = open(r"D:\zzzz\names2.txt", "r", encoding="utf-8-sig")

此外，对于计数，您可以使用collections.Counter ：

from collections import Counter
wordcount = Counter(file.read().split())

显示它们：

>>> for item in wordcount.items(): print("{}\t{}".format(*item))
...
snake   1
lion    2
goat    2
horse   3

Answer 2

#!/usr/bin/python
file=open("D:\\zzzz\\names2.txt","r+")
wordcount={}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
for k,v in wordcount.items():
    print k, v

Answer 3

FILE_NAME = 'file.txt'

wordCounter = {}

with open(FILE_NAME,'r') as fh:
  for line in fh:
    # Replacing punctuation characters. Making the string to lower.
    # The split will spit the line into a list.
    word_list = line.replace(',','').replace('\'','').replace('.','').lower().split()
    for word in word_list:
      # Adding  the word into the wordCounter dictionary.
      if word not in wordCounter:
        wordCounter[word] = 1
      else:
        # if the word is already in the dictionary update its count.
        wordCounter[word] = wordCounter[word] + 1

print('{:15}{:3}'.format('Word','Count'))
print('-' * 18)

# printing the words and its occurrence.
for  (word,occurance)  in wordCounter.items(): 
  print('{:15}{:3}'.format(word,occurance))

#

 Word Count ------------------ of 6 examples 2 used 2 development 2 modified 2 open-source 2

Answer 4

import sys
file=open(sys.argv[1],"r+")
wordcount={}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
for key in wordcount.keys():
  print ("%s %s " %(key , wordcount[key]))
file.close();

Answer 5

如果您使用的是graphLab，则可以使用此功能。 真的很强大

products['word_count'] = graphlab.text_analytics.count_words(your_text)

Answer 6

#!/usr/bin/python
file=open("D:\\zzzz\\names2.txt","r+")
wordcount={}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1

for k,v in wordcount.items():
    print k,v
file.close();

Answer 7

你可以这样做：

file= open(r'D:\\zzzz\\names2.txt')
file_split=set(file.read().split())
print(len(file_split))

Answer 8

以下来自Python 的代码| 如何计算文本文件中单词的频率？ 为我工作。

 import re
    frequency = {}
    #Open the sample text file in read mode.
    document_text = open('sample.txt', 'r')
    #convert the string of the document in lowercase and assign it to text_string variable.
    text = document_text.read().lower()
    pattern = re.findall(r'\b[a-z]{2,15}\b', text)
    for word in pattern:
         count = frequency.get(word,0)
         frequency[word] = count + 1
     frequency_list = frequency.keys()
     for words in frequency_list:
         print(words, frequency[words])

输出：

Answer 9

print("sorted counting values:-")
from collections import Counter

fname = open(filename)

fname = fname.read()

fsplit = fname.split()

user  = Counter(fsplit)

for i,v in sorted(user.items()):

   print((v,i))

txt 文件程序的字数统计

问题描述

9 个解决方案

解决方案1
47 2014-01-14 06:55:30

解决方案2
32 2014-01-14 08:08:24

解决方案3
2 2017-02-20 14:40:32

解决方案4
1 2014-01-14 06:56:34

解决方案5
1 2016-03-13 16:36:09

解决方案6
1 2017-11-10 21:50:09

解决方案7
0 2019-10-20 21:31:45

解决方案8
0 2020-07-04 05:30:13

解决方案9
-1 2020-08-13 07:40:47

txt 文件程序的字数统计

问题描述

9 个解决方案

解决方案1 47 2014-01-14 06:55:30

解决方案2 32 2014-01-14 08:08:24

解决方案3 2 2017-02-20 14:40:32

解决方案4 1 2014-01-14 06:56:34

解决方案5 1 2016-03-13 16:36:09

解决方案6 1 2017-11-10 21:50:09

解决方案7 0 2019-10-20 21:31:45

解决方案8 0 2020-07-04 05:30:13

解决方案9 -1 2020-08-13 07:40:47

解决方案1
47 2014-01-14 06:55:30

解决方案2
32 2014-01-14 08:08:24

解决方案3
2 2017-02-20 14:40:32

解决方案4
1 2014-01-14 06:56:34

解决方案5
1 2016-03-13 16:36:09

解决方案6
1 2017-11-10 21:50:09

解决方案7
0 2019-10-20 21:31:45

解决方案8
0 2020-07-04 05:30:13

解决方案9
-1 2020-08-13 07:40:47