[英]How to print frequency of each unique word from a string with for loop in python
The paragraph is meant to have spaces and random punctuation, I removed them in my for loop, by doing .replace. 该段旨在包含空格和随机标点符号,我通过执行.replace将其移至我的for循环中。 Then I made paragraph into a list by .split() to get ['the', 'title', 'etc'].
然后,我通过.split()将段落放入列表中,以获得['the','title','etc']。 Then I made two functions count words to count each word but I didn't want it to count every word, so I made another function to create a unique list.
然后,我使两个函数对单词进行计数以对每个单词进行计数,但是我不想让它对每个单词进行计数,因此我使另一个函数创建了一个唯一列表。 However, I need to create a for loop to print out each word and how many times it been said with the output being something like this
但是,我需要创建一个for循环以打印出每个单词以及输出了多少次这样的输出
The word The appears 2 times in the paragraph.
The word titled appears 1 times in the paragraph.
The word track appears 1 times in the paragraph.
I also have a hard time understanding what a for loop essentially does. 我也很难理解for循环的本质功能。 I read that we should just be using for loops for counting, and while loops for any other things but a while loop can also be used for counting.
我读到,我们应该只使用for循环进行计数,而while循环进行任何其他操作,而while循环也可以用于计数。
paragraph = """ The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!.... """
for r in ((",", ""), ("!", ""), (".", ""), (" ", "")):
paragraph = paragraph.replace(*r)
paragraph_list = paragraph.split()
def count_words(word, word_list):
word_count = 0
for i in range(len(word_list)):
if word_list[i] == word:
word_count += 1
return word_count
def unique(word):
result = []
for f in word:
if f not in result:
result.append(f)
return result
unique_list = unique(paragraph_list)
It is better if you use re
and get
with a default value: 如果您使用的是更好的
re
和get
一个默认值:
paragraph = """ The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!.... c c c c c c c ccc"""
import re
word_count = {}
for w in re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()):
word_count[w] = word_count.get(w, 0) + 1
del word_count['']
for k, v in word_count.items():
print("The word {} appears {} time(s) in the paragraph".format(k, v))
Output: 输出:
The word the appears 4 time(s) in the paragraph
The word titled appears 1 time(s) in the paragraph
The word track appears 1 time(s) in the paragraph
...
It is discussible what to do with Chuu's
, I decided not to split in '
but you can add that later if you want. 与
Chuu's
关系是可以讨论Chuu's
,我决定不拆分为'
但是如果需要,您可以稍后添加。
Update: 更新:
The following line splits paragraph.lower()
using a regular expression. 下面的行使用正则表达式对
paragraph.lower()
进行拆分。 The advantage is that you can describe multiple separators 好处是您可以描述多个分隔符
re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()
With respect to this line: 关于这条线:
word_count[w] = word_count.get(w, 0) + 1
word_count
is a dictionary. word_count
是一本字典。 The advantage of using get
is that you can define a default value in case w
is not in the dictionary yet. 使用
get
的好处是,如果w
不在字典中,则可以定义一个默认值。 The line basically updates the count for word w
该行基本上更新单词
w
的计数
Beware, your example text is simple but punctuation rules can be complex or not correctly observed. 当心,示例文本很简单,但标点规则可能很复杂,或者没有正确遵守。 What is the text contains 2 adjacent spaces (yes it is incorrect but frequent)?
文本包含2个相邻空格是什么(是的,它不正确但很频繁)? What if the writer is more used to French and writes spaces before and after a colon or semicolon?
如果作家更习惯法语,并在冒号或分号之前和之后写空格怎么办?
I think the 's
construct need special processing. 我认为
's
构造需要特殊处理。 What about: """John has a bicycle. Mary says that her one is nicer that John's."""
IMHO the word John
occurs twice here, while your algo will see 1 John
and 1 Johns
. 那怎么办:
"""John has a bicycle. Mary says that her one is nicer that John's."""
恕我直言, John
一词在这里出现过两次,而您的算法将看到1个John
和1个Johns
。
Additionaly as Unicode text is now common on WEB pages, you should be prepared to find high code equivalents of spaces and punctuations: 另外,由于Unicode文本现在在WEB页面上很常见,因此您应该准备好寻找与空格和标点符号等价的代码:
“ U+201C LEFT DOUBLE QUOTATION MARK
” U+201D RIGHT DOUBLE QUOTATION MARK
’ U+2019 RIGHT SINGLE QUOTATION MARK
‘ U+2018 LEFT SINGLE QUOTATION MARK
U+00A0 NO-BREAK SPACE
In addition, according to this older question to best way to remove punctuation is translate
. 另外,根据这个较早的问题 ,去除标点的最佳方法是
translate
。 Linked question used Python 2 syntax, but in Python 3 you can do: 链接的问题使用Python 2语法,但是在Python 3中,您可以执行以下操作:
paragraph = paragraph.strip() # remove initial and terminal white spaces
paragraph = paragraph.translate(str.maketrans('“”’‘\xa0', '""\'\' ')) # fix high code punctuations
paragraph = re.replace("\w's\s", "", paragraph) # remove 's
paragraph = paragraph.translate(str.maketrans(None, None, string.punctuation) # remove punctuations
words = paragraph.split()
Plese try this one: 请尝试以下方法:
paragraph = """ The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!.... c c c c c c c ccc"""
characterToRemove = (",","!",".","?",'“','”')
for i in paragraph:
if i in characterToRemove:
paragraph = paragraph.replace(i,"")
paragraph=paragraph.split()
uniqueWords=set(paragraph)
dictionartWords={}
for i in uniqueWords:
dictionartWords[i]=0
for i in paragraph:
if i in dictionartWords.keys():
dictionartWords[i]+=1
As a result you get dictionary wich cintains unique words as a key and digit value which indicates number of each unique words in the paragraph: 如此一来,您会得到字典,其中包含唯一词作为键和数字值,该数字和数字值指示段落中每个唯一词的数量:
print(dictionartWords)
{'The': 2, 'like': 1, 'serious': 1, 'titled': 1, 'Rene': 1, 'a': 1, 'artist': 1, 'video': 1, 'c': 7, 'with': 1, 'track': 1, 'to': 1, 'fictional': 1, 'feelings': 1, 'ccc': 1, 'but': 1, 'not': 1, 'has': 1, 'interpret': 1, 'way': 1, 'as': 1, 'of': 1, 'emoticon': 1, 'Heart': 1, 'in': 2, 'adorable': 1, 'love': 1, 'references': 1, 'being': 1, 'Magritte': 1, 'Chuu's': 1, 'historical': 1, 'such': 1, 'and': 1, 'does': 1, 'music': 1, 'the': 2, 'figures': 1, 'Attack': 1, 'own': 1, 'ways': 1} {'The':2,'like':1,'serious':1,'titled':1,'Rene':1,'a':1,'artist':1,'video':1,' c':7,'with':1,'track':1,'to':1,'fictional':1,'feelings':1,'ccc':1,'but':1,'not' :1,'has':1,'解释':1,'way':1,'as':1,'of':1,'表情符号':1,'Heart':1,'in':2 ,“可爱”:1,“爱”:1,“引荐”:1,“存在”:1,“马格利特”:1,“ Chuu's”:1,“历史”:1,“此类”:1,“和':1,'does':1,'music':1,'the':2,'figures':1,'Attack':1,'own':1,'ways':1}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.