简体   繁体   English

如何使用 Python 计算文本文档中的唯一单词(没有特殊字符/大小写干扰)

[英]How can you use Python to count the unique words (without special characters/ cases interfering) in a text document

I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:我是 Python 新手,需要一些帮助来尝试想出一个文本内容分析器,它可以帮助我在文本文件中找到 7 个内容:

  1. Total word count总字数
  2. Total count of unique words (without case and special characters interfering)唯一词的总数(没有大小写和特殊字符干扰)
  3. The number of sentences句子数
  4. Average words in a sentence一个句子中的平均词
  5. Find common used phrases (a phrase of 3 or more words used over 3 times)查找常用短语(使用 3 次以上的 3 个或更多单词的短语)
  6. A list of words used, in order of descending frequency (without case and special characters interfering)使用的单词列表,按频率降序排列(没有大小写和特殊字符干扰)
  7. The ability to accept input from STDIN, or from a file specified on the command line能够接受来自 STDIN 或命令行上指定的文件的输入

So far I have this Python program to print total word count:到目前为止,我有这个 Python 程序来打印总字数:

with open('/Users/name/Desktop/20words.txt', 'r') as f:

     p = f.read()

     words = p.split()

     wordCount = len(words)
     print "The total word count is:", wordCount

So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog , dog. , "dog , and dog, as different words)到目前为止,我有这个 Python 程序来打印唯一的单词和它们的频率:(它不按顺序,看到诸如: dogdog."dogdog,dog,作为不同的词)

 file=open("/Users/name/Desktop/20words.txt", "r+")

 wordcount={}

 for word in file.read().split():

     if word not in wordcount:
         wordcount[word] = 1
     else:
         wordcount[word] += 1
 for k, v in wordcount.items():
     print k, v

Thank you for any help you can give!感谢您提供的任何帮助!

If you know what characters you want to avoid, you can use str.strip to remove these characters from the extremities.如果您知道要避免哪些字符,可以使用str.strip从四肢删除这些字符。

word = word.strip().strip("'").strip('"')...

This will remove the occurrence of these characters on the extremities of the word.这将删除出现在单词末端的这些字符。 This probably isn't as efficient as using some NLP library, but it can get the job done.这可能不如使用某些 NLP 库那么有效,但它可以完成工作。

str.strip Docs str.strip文档

Certainly the most difficult part is identifying the sentences.当然,最困难的部分是识别句子。 You could use a regular expression for this, but there might still be some ambiguity, eg with names and titles, that have a dot followed by an upper case letter.您可以为此使用正则表达式,但可能仍然存在一些歧义,例如名称和标题,其中有一个点后跟一个大写字母。 For words, too, you can use a simple regex, instead of using split .对于单词,您也可以使用简单的正则表达式,而不是使用split The exact expression to use depends on what qualifies as a "word".使用的确切表达取决于什么是“词”。 Finally, you can use collections.Counter for counting all of those instead of doing this manually.最后,您可以使用collections.Counter来计算所有这些,而不是手动执行此操作。 Use str.lower to convert either the text as a whole or the individual words to lowercase.使用str.lower将整个文本或单个单词转换为小写。

This should help you getting startet:这应该可以帮助您入门:

import re, collections
text = """Sentences start with an upper-case letter. Do they always end 
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two, 
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""

sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)    
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
    print n, s

word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
    print n, w

For "more power", you could use some natural language toolkit , but this might be a bit much for this task.为了“更强大”,您可以使用一些自然语言工具包,但这对于此任务可能有点多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 PYTHON 如何计算没有特殊字符的单词中的字母 - PYTHON How to count letters in words without special characters 如何提取文本中没有特殊字符的单词 python pandas - How to extract words in text without special characters python pandas 如何在Python中将字符串拆分为单词和特殊字符? - How do You Split String into Words and Special Characters in Python? 计算文本文件中的唯一单词 (Python) - Count unique words in a text file (Python) 如何使用 python 将较低级别的 ASCII 字符附加到(记事本)文档中? - How can you use python to append lower-level ASCII characters to a (notepad) document? 如何在不使用 Python 中的 rstrip() 的情况下计算文本文件中的总字数? - How to count total words in a text file without using rstrip() in Python? 从python中的文本文件中计算列表中出现和不出现特殊字符的所有元素 - Count all occurrences of elements with and without special characters in a list from a text file in python 如何用 function 计算 python 中的唯一词? - How to count unique words in python with function? 您如何计算页面上所有单词的所有唯一实例? (Python/硒) - How do you count all unique instances of all words on a page? (Python/Selenium) 用python中的单词替换特殊字符 - Replace special characters with words in python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM