從python停止單詞

Question

我有一個文本文件，其中我在計算行數，字符數和單詞數之和。 我如何通過使用string.replace（）刪除停用詞（例如a）來清理數據

我現在有下面的代碼。

例如 如果文本文件包含以下行：

"The only words to count are Buttons and Shares for this text"

它應該輸出：

1 Buttons
1 Shares
1 words
1 only
1 text

盡管我的代碼沒有輸出我已列入黑名單的停用詞，但如果停用詞在其他任何單詞中，它也會刪除這些停用詞。 以下是我的代碼輸出。

1 Butns (this is a problem)
1 Shs (this is a problem)
1 words
1 only
1 text

下面是我到目前為止的代碼。

# Open the input file
fname = open('2013_honda_accord.txt', 'r').read()

# COUNT CHARACTERS
num_chars = len(fname)

# COUNT LINES 
num_lines = fname.count('\n')

#COUNT WORDS
fname = fname.lower() # convert the text to lower first

# Remove Stop words 
blacklist = ["the", "to", "are", "and", "for", "this" ]  # Blacklist of words to be filtered out
for word in blacklist:
   fname = fname.replace(word, "")

# Removing special characters from the word count
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz1234567890-' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))

d = {}
for w in words:
    # if the word is repeated - start count
    if w in d:    
       d[w] += 1
    # if the word is only used once then give it a count of 1
    else:
       d[w] = 1

# Add the sum of all the repeated words 
num_words = sum(d[w] for w in d)

lst = [(d[w], w) for w in d]
# sort the list of words in alpha for the same count 
lst.sort()
# list word count from greatest to lowest (will also show the sort in reserve order Z-A)
lst.reverse()

# output the total number of characters
print('Your input file has characters = ' + str(num_chars))
# output the total number of lines
print('Your input file has num_lines = ' + str(num_lines))
# output the total number of words
print('Your input file has num_words = ' + str(num_words))

print('\n The 30 most frequent words are \n')

# print the number of words as a count from the text file with the sum of each word used within the text
i = 1
for count, word in lst[:10000]:
print('%2s.  %4s %s' % (i, count, word))
i += 1

謝謝

Answer 1

假設您的分析不需要標點符號，則可以執行以下操作-

punctuation_list = ['?',',','.'] # non exhaustive

for punctuation in punctuation_list:
   fname = fname.replace(punctuation, "")

blacklist = ["the", "to", "are", "and", "for", "this" ]  

for word in blacklist:
   fname = fname.replace(" "+word+" ", " ") #replace StopWord preceded by a space and followed by a space with a space

Answer 2

您要刪除“至”，“是”並替換它們。

# Remove Stop words 
blacklist = ["the", "to", "are", "and", "for", "this" ]  
# Blacklist of words to be filtered out
for word in blacklist:
   fname = fname.replace(word, "")

Answer 3

創建完d （字典將單詞映射到計數）后，將過濾從停用詞移到。 在此處添加一行- if w not in blacklist: -跳過黑名單中包含的單詞將刪除停用詞，而不會更改其他單詞。

#COUNT WORDS
fname = fname.lower() # convert the text to lower first

# Removing special characters from the word count
get_alphabetical_characters = lambda word: "".join([char if char in 'abcdefghijklmnopqrstuvwxyz1234567890-' else '' for char in word])
words = list(map(get_alphabetical_characters, fname.split()))

# Remove Stop words 
blacklist = ["the", "to", "are", "and", "for", "this" ]  # Blacklist of words to be filtered out

d = {}
for w in words:
  # Do not count words in the blacklist
  if w not in blacklist:
    # if the word is repeated - start count
    if w in d:    
      d[w] += 1
    # if the word is only used once then give it a count of 1
    else:
      d[w] = 1

從python停止單詞

問題描述

3 個解決方案

解決方案1
1 已采納 2016-04-24 23:06:22

解決方案2
0 2016-04-24 22:58:24

解決方案3
0 2016-04-24 23:18:57

從python停止單詞

問題描述

3 個解決方案

解決方案1 1 已采納 2016-04-24 23:06:22

解決方案2 0 2016-04-24 22:58:24

解決方案3 0 2016-04-24 23:18:57

解決方案1
1 已采納 2016-04-24 23:06:22

解決方案2
0 2016-04-24 22:58:24

解決方案3
0 2016-04-24 23:18:57