如何使用 Python 计算文本文档中的唯一单词（没有特殊字符/大小写干扰）

Question

I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:我是 Python 新手，需要一些帮助来尝试想出一个文本内容分析器，它可以帮助我在文本文件中找到 7 个内容：

Total word count总字数
Total count of unique words (without case and special characters interfering)唯一词的总数（没有大小写和特殊字符干扰）
The number of sentences句子数
Average words in a sentence一个句子中的平均词
Find common used phrases (a phrase of 3 or more words used over 3 times)查找常用短语（使用 3 次以上的 3 个或更多单词的短语）
A list of words used, in order of descending frequency (without case and special characters interfering)使用的单词列表，按频率降序排列（没有大小写和特殊字符干扰）
The ability to accept input from STDIN, or from a file specified on the command line能够接受来自 STDIN 或命令行上指定的文件的输入

So far I have this Python program to print total word count:到目前为止，我有这个 Python 程序来打印总字数：

with open('/Users/name/Desktop/20words.txt', 'r') as f:

     p = f.read()

     words = p.split()

     wordCount = len(words)
     print "The total word count is:", wordCount

So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog , dog. , "dog , and dog, as different words)到目前为止，我有这个 Python 程序来打印唯一的单词和它们的频率：（它不按顺序，看到诸如： dog 、 dog. 、 "dog和dog,词dog,作为不同的词）

 file=open("/Users/name/Desktop/20words.txt", "r+")

 wordcount={}

 for word in file.read().split():

     if word not in wordcount:
         wordcount[word] = 1
     else:
         wordcount[word] += 1
 for k, v in wordcount.items():
     print k, v

Thank you for any help you can give!感谢您提供的任何帮助！

Answer 1

If you know what characters you want to avoid, you can use str.strip to remove these characters from the extremities.如果您知道要避免哪些字符，可以使用str.strip从四肢删除这些字符。

word = word.strip().strip("'").strip('"')...

This will remove the occurrence of these characters on the extremities of the word.这将删除出现在单词末端的这些字符。 This probably isn't as efficient as using some NLP library, but it can get the job done.这可能不如使用某些 NLP 库那么有效，但它可以完成工作。

str.strip Docs str.strip文档

Answer 2

Certainly the most difficult part is identifying the sentences.当然，最困难的部分是识别句子。 You could use a regular expression for this, but there might still be some ambiguity, eg with names and titles, that have a dot followed by an upper case letter.您可以为此使用正则表达式，但可能仍然存在一些歧义，例如名称和标题，其中有一个点后跟一个大写字母。 For words, too, you can use a simple regex, instead of using split .对于单词，您也可以使用简单的正则表达式，而不是使用split 。 The exact expression to use depends on what qualifies as a "word".使用的确切表达取决于什么是“词”。 Finally, you can use collections.Counter for counting all of those instead of doing this manually.最后，您可以使用collections.Counter来计算所有这些，而不是手动执行此操作。 Use str.lower to convert either the text as a whole or the individual words to lowercase.使用str.lower将整个文本或单个单词转换为小写。

This should help you getting startet:这应该可以帮助您入门：

import re, collections
text = """Sentences start with an upper-case letter. Do they always end 
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two, 
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""

sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)    
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
    print n, s

word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
    print n, w

For "more power", you could use some natural language toolkit , but this might be a bit much for this task.为了“更强大”，您可以使用一些自然语言工具包，但这对于此任务可能有点多。

如何使用 Python 计算文本文档中的唯一单词（没有特殊字符/大小写干扰）

问题描述

2 个解决方案

解决方案1
1 2015-06-23 12:57:24

解决方案2
1 已采纳 2015-06-23 13:15:52

如何使用 Python 计算文本文档中的唯一单词（没有特殊字符/大小写干扰）

问题描述

2 个解决方案

解决方案1 1 2015-06-23 12:57:24

解决方案2 1 已采纳 2015-06-23 13:15:52

解决方案1
1 2015-06-23 12:57:24

解决方案2
1 已采纳 2015-06-23 13:15:52