簡體   English   中英

如何在Python中使用for循環從字符串中打印每個唯一單詞的頻率

[英]How to print frequency of each unique word from a string with for loop in python

該段旨在包含空格和隨機標點符號,我通過執行.replace將其移至我的for循環中。 然后,我通過.split()將段落放入列表中,以獲得['the','title','etc']。 然后,我使兩個函數對單詞進行計數以對每個單詞進行計數,但是我不想讓它對每個單詞進行計數,因此我使另一個函數創建了一個唯一列表。 但是,我需要創建一個for循環以打印出每個單詞以及輸出了多少次這樣的輸出

The word The appears 2 times in the paragraph.
The word titled appears 1 times in the paragraph.
The word track appears 1 times in the paragraph.

我也很難理解for循環的本質功能。 我讀到,我們應該只使用for循環進行計數,而while循環進行任何其他操作,而while循環也可以用於計數。

    paragraph = """  The titled track “Heart Attack” does not interpret the 
    feelings of being in love in a serious way, 
    but with Chuu’s own adorable emoticon like ways. The music video has 
    references to historical and fictional 
    figures such as the artist Rene Magritte!!....  """


for r in ((",", ""), ("!", ""), (".", ""), ("  ", "")):
    paragraph = paragraph.replace(*r)

paragraph_list = paragraph.split()


def count_words(word, word_list):

    word_count = 0
    for i in range(len(word_list)):
        if word_list[i] == word:
            word_count += 1
    return word_count

def unique(word):
    result = []
    for f in word:
        if f not in result:
            result.append(f)
    return result
unique_list = unique(paragraph_list)

如果您使用的是更好的reget一個默認值:

paragraph = """  The titled track “Heart Attack” does not interpret the
feelings of being in love in a serious way,
but with Chuu’s own adorable emoticon like ways. The music video has
references to historical and fictional
figures such as the artist Rene Magritte!!....  c c c c c c c ccc"""

import re

word_count = {}
for w in re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()):
    word_count[w] = word_count.get(w, 0) + 1
del word_count['']

for k, v in word_count.items():
    print("The word {} appears {} time(s) in the paragraph".format(k, v))

輸出:

The word the appears 4 time(s) in the paragraph
The word titled appears 1 time(s) in the paragraph
The word track appears 1 time(s) in the paragraph
...

Chuu's關系是可以討論Chuu's ,我決定不拆分為'但是如果需要,您可以稍后添加。

更新:

下面的行使用正則表達式對paragraph.lower()進行拆分。 好處是您可以描述多個分隔符

re.split(' |,|“|”|!|\?|\.|\n', paragraph.lower()

關於這條線:

word_count[w] = word_count.get(w, 0) + 1

word_count是一本字典。 使用get的好處是,如果w不在字典中,則可以定義一個默認值。 該行基本上更新單詞w的計數

當心,示例文本很簡單,但標點規則可能很復雜,或者沒有正確遵守。 文本包含2個相鄰空格是什么(是的,它不正確但很頻繁)? 如果作家更習慣法語,並在冒號或分號之前和之后寫空格怎么辦?

我認為's構造需要特殊處理。 那怎么辦: """John has a bicycle. Mary says that her one is nicer that John's."""恕我直言, John一詞在這里出現過兩次,而您的算法將看到1個John和1個Johns

另外,由於Unicode文本現在在WEB頁面上很常見,因此您應該准備好尋找與空格和標點符號等價的代碼:

“ U+201C LEFT DOUBLE QUOTATION MARK
” U+201D RIGHT DOUBLE QUOTATION MARK
’ U+2019 RIGHT SINGLE QUOTATION MARK
‘ U+2018 LEFT SINGLE QUOTATION MARK
  U+00A0 NO-BREAK SPACE

另外,根據這個較早的問題 ,去除標點的最佳方法是translate 鏈接的問題使用Python 2語法,但是在Python 3中,您可以執行以下操作:

paragraph = paragraph.strip()                   # remove initial and terminal white spaces
paragraph = paragraph.translate(str.maketrans('“”’‘\xa0', '""\'\' '))  # fix high code punctuations
paragraph = re.replace("\w's\s", "", paragraph)  # remove 's
paragraph = paragraph.translate(str.maketrans(None, None, string.punctuation) # remove punctuations
words = paragraph.split()

請嘗試以下方法:

paragraph = """  The titled track “Heart Attack” does not interpret the 
feelings of being in love in a serious way, 
but with Chuu’s own adorable emoticon like ways. The music video has 
references to historical and fictional 
figures such as the artist Rene Magritte!!....  c c c c c c c ccc"""

characterToRemove = (",","!",".","?",'“','”')
for i in paragraph:
    if i in characterToRemove:
         paragraph = paragraph.replace(i,"")

paragraph=paragraph.split()
uniqueWords=set(paragraph)
dictionartWords={}
for i in uniqueWords:
    dictionartWords[i]=0

for i in paragraph:
    if i in dictionartWords.keys():
        dictionartWords[i]+=1

如此一來,您會得到字典,其中包含唯一詞作為鍵和數字值,該數字和數字值指示段落中每個唯一詞的數量:

 print(dictionartWords)

{'The':2,'like':1,'serious':1,'titled':1,'Rene':1,'a':1,'artist':1,'video':1,' c':7,'with':1,'track':1,'to':1,'fictional':1,'feelings':1,'ccc':1,'but':1,'not' :1,'has':1,'解釋':1,'way':1,'as':1,'of':1,'表情符號':1,'Heart':1,'in':2 ,“可愛”:1,“愛”:1,“引薦”:1,“存在”:1,“馬格利特”:1,“ Chuu's”:1,“歷史”:1,“此類”:1,“和':1,'does':1,'music':1,'the':2,'figures':1,'Attack':1,'own':1,'ways':1}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM