[英]How to print frequency of each unique word from a string with for loop in python
該段旨在包含空格和隨機標點符號,我通過執行.replace將其移至我的for循環中。 然后,我通過.split()將段落放入列表中,以獲得['the','title','etc']。 然后,我使兩個函數對單詞進行計數以對每個單詞進行計數,但是我不想讓它對每個單詞進行計數,因此我使另一個函數創建了一個唯一列表。 但是,我需要創建一個for循環以打印出每個單詞以及輸出了多少次這樣的輸出
The word The appears 2 times in the paragraph.
The word titled appears 1 times in the paragraph.
The word track appears 1 times in the paragraph.
我也很難理解for循環的本質功能。 我讀到,我們應該只使用for循環進行計數,而while循環進行任何其他操作,而while循環也可以用於計數。
paragraph = """ The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!.... """
for r in ((",", ""), ("!", ""), (".", ""), (" ", "")):
paragraph = paragraph.replace(*r)
paragraph_list = paragraph.split()
def count_words(word, word_list):
word_count = 0
for i in range(len(word_list)):
if word_list[i] == word:
word_count += 1
return word_count
def unique(word):
result = []
for f in word:
if f not in result:
result.append(f)
return result
unique_list = unique(paragraph_list)
如果您使用的是更好的re
和get
一個默認值:
paragraph = """ The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!.... c c c c c c c ccc"""
import re
word_count = {}
for w in re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()):
word_count[w] = word_count.get(w, 0) + 1
del word_count['']
for k, v in word_count.items():
print("The word {} appears {} time(s) in the paragraph".format(k, v))
輸出:
The word the appears 4 time(s) in the paragraph
The word titled appears 1 time(s) in the paragraph
The word track appears 1 time(s) in the paragraph
...
與Chuu's
關系是可以討論Chuu's
,我決定不拆分為'
但是如果需要,您可以稍后添加。
更新:
下面的行使用正則表達式對paragraph.lower()
進行拆分。 好處是您可以描述多個分隔符
re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()
關於這條線:
word_count[w] = word_count.get(w, 0) + 1
word_count
是一本字典。 使用get
的好處是,如果w
不在字典中,則可以定義一個默認值。 該行基本上更新單詞w
的計數
當心,示例文本很簡單,但標點規則可能很復雜,或者沒有正確遵守。 文本包含2個相鄰空格是什么(是的,它不正確但很頻繁)? 如果作家更習慣法語,並在冒號或分號之前和之后寫空格怎么辦?
我認為's
構造需要特殊處理。 那怎么辦: """John has a bicycle. Mary says that her one is nicer that John's."""
恕我直言, John
一詞在這里出現過兩次,而您的算法將看到1個John
和1個Johns
。
另外,由於Unicode文本現在在WEB頁面上很常見,因此您應該准備好尋找與空格和標點符號等價的代碼:
“ U+201C LEFT DOUBLE QUOTATION MARK
” U+201D RIGHT DOUBLE QUOTATION MARK
’ U+2019 RIGHT SINGLE QUOTATION MARK
‘ U+2018 LEFT SINGLE QUOTATION MARK
U+00A0 NO-BREAK SPACE
另外,根據這個較早的問題 ,去除標點的最佳方法是translate
。 鏈接的問題使用Python 2語法,但是在Python 3中,您可以執行以下操作:
paragraph = paragraph.strip() # remove initial and terminal white spaces
paragraph = paragraph.translate(str.maketrans('“”’‘\xa0', '""\'\' ')) # fix high code punctuations
paragraph = re.replace("\w's\s", "", paragraph) # remove 's
paragraph = paragraph.translate(str.maketrans(None, None, string.punctuation) # remove punctuations
words = paragraph.split()
請嘗試以下方法:
paragraph = """ The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!.... c c c c c c c ccc"""
characterToRemove = (",","!",".","?",'“','”')
for i in paragraph:
if i in characterToRemove:
paragraph = paragraph.replace(i,"")
paragraph=paragraph.split()
uniqueWords=set(paragraph)
dictionartWords={}
for i in uniqueWords:
dictionartWords[i]=0
for i in paragraph:
if i in dictionartWords.keys():
dictionartWords[i]+=1
如此一來,您會得到字典,其中包含唯一詞作為鍵和數字值,該數字和數字值指示段落中每個唯一詞的數量:
print(dictionartWords)
{'The':2,'like':1,'serious':1,'titled':1,'Rene':1,'a':1,'artist':1,'video':1,' c':7,'with':1,'track':1,'to':1,'fictional':1,'feelings':1,'ccc':1,'but':1,'not' :1,'has':1,'解釋':1,'way':1,'as':1,'of':1,'表情符號':1,'Heart':1,'in':2 ,“可愛”:1,“愛”:1,“引薦”:1,“存在”:1,“馬格利特”:1,“ Chuu's”:1,“歷史”:1,“此類”:1,“和':1,'does':1,'music':1,'the':2,'figures':1,'Attack':1,'own':1,'ways':1}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.