[英]How can you use Python to count the unique words (without special characters/ cases interfering) in a text document
I am new to Python and need some help with trying to come up with a text content analyzer that will help me find 7 things within a text file:我是 Python 新手,需要一些帮助来尝试想出一个文本内容分析器,它可以帮助我在文本文件中找到 7 个内容:
So far I have this Python program to print total word count:到目前为止,我有这个 Python 程序来打印总字数:
with open('/Users/name/Desktop/20words.txt', 'r') as f:
p = f.read()
words = p.split()
wordCount = len(words)
print "The total word count is:", wordCount
So far I have this Python program to print unique words and their frequency: (it's not in order and sees words such as: dog
, dog.
, "dog
, and dog,
as different words)到目前为止,我有这个 Python 程序来打印唯一的单词和它们的频率:(它不按顺序,看到诸如:
dog
、 dog.
、 "dog
和dog,
词dog,
作为不同的词)
file=open("/Users/name/Desktop/20words.txt", "r+")
wordcount={}
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for k, v in wordcount.items():
print k, v
Thank you for any help you can give!感谢您提供的任何帮助!
If you know what characters you want to avoid, you can use str.strip
to remove these characters from the extremities.如果您知道要避免哪些字符,可以使用
str.strip
从四肢删除这些字符。
word = word.strip().strip("'").strip('"')...
This will remove the occurrence of these characters on the extremities of the word.这将删除出现在单词末端的这些字符。 This probably isn't as efficient as using some NLP library, but it can get the job done.
这可能不如使用某些 NLP 库那么有效,但它可以完成工作。
Certainly the most difficult part is identifying the sentences.当然,最困难的部分是识别句子。 You could use a regular expression for this, but there might still be some ambiguity, eg with names and titles, that have a dot followed by an upper case letter.
您可以为此使用正则表达式,但可能仍然存在一些歧义,例如名称和标题,其中有一个点后跟一个大写字母。 For words, too, you can use a simple regex, instead of using
split
.对于单词,您也可以使用简单的正则表达式,而不是使用
split
。 The exact expression to use depends on what qualifies as a "word".使用的确切表达取决于什么是“词”。 Finally, you can use
collections.Counter
for counting all of those instead of doing this manually.最后,您可以使用
collections.Counter
来计算所有这些,而不是手动执行此操作。 Use str.lower
to convert either the text as a whole or the individual words to lowercase.使用
str.lower
将整个文本或单个单词转换为小写。
This should help you getting startet:这应该可以帮助您入门:
import re, collections
text = """Sentences start with an upper-case letter. Do they always end
with a dot? No! Also, not each dot is the end of a sentence, e.g. these two,
but this is. Still, some ambiguity remains with names, like Mr. Miller here."""
sentence = re.compile(r"[A-Z].*?[.!?](?=\s+[A-Z]|$)", re.S)
sentences = collections.Counter(sentence.findall(text))
for n, s in sentences.most_common():
print n, s
word = re.compile(r"\w+")
words = collections.Counter(word.findall(text.lower()))
for n, w in words.most_common():
print n, w
For "more power", you could use some natural language toolkit , but this might be a bit much for this task.为了“更强大”,您可以使用一些自然语言工具包,但这对于此任务可能有点多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.