從python列表中刪除單詞？

Question

我是python和網絡抓取的完全菜鳥，並且很早就遇到了一些問題。 我已經能夠抓取荷蘭新聞網站的標題並將其拆分。 現在，我的目標是從結果中刪除某些單詞。 例如，我不想在列表中使用“ het”和“ om”之類的詞。 有人知道我該怎么做嗎？ （我正在使用python請求和BeautifulSoup）

 import requests from bs4 import BeautifulSoup url="http://www.nu.nl" r=requests.get(url) soup=BeautifulSoup(r.content) g_data=soup.find_all("span" , {"class": "title"}) for item in g_data: print item.text.split()

Answer 1

在自然語言處理中，排除常見單詞的術語稱為“停用詞”。

您是要保留每個單詞的順序和計數，還是只希望頁面上出現的單詞集？

如果只希望頁面上顯示一組單詞，那么使用單詞組可能是您的最佳選擇。 類似以下內容可能會起作用：

# It's probably more common to define your STOP_WORDS in a file and then read
# them into your data structure to keep things simple for large numbers of those
# words.
STOP_WORDS = set([
    'het',
    'om'
])

all_words = set()
for item in g_data:
    all_words |= set(item.text.split())
all_words -= STOP_WORDS
print all_words

另一方面，如果您關心訂單，則可以避免在列表中添加停用詞。

words_in_order = []
for item in g_data:
    words_from_span = item.text.split()
    # You might want to break this out into its own function for modularity.
    for word in words_from_span:
        if word not in STOP_WORDS:
            words_in_order.append(word)
print words_in_order

如果您不關心順序，但想要頻率，則可以創建一個要計數的單詞的dict（或為方便起見使用defaultdict）。

from collections import defaultdict
word_counts = defaultdict(int)
for item in g_data:
    # You might want to break this out into its own function for modularity.
    for word in item.text.split():
        if word not in STOP_WORDS:
            word_counts[word] += 1
for word, count in word_counts.iteritems():
    print '%s: %d' % (word, count)

從python列表中刪除單詞？

問題描述

1 個解決方案

解決方案1
0 已采納 2015-04-05 20:25:50

從python列表中刪除單詞？

問題描述

1 個解決方案

解決方案1 0 已采納 2015-04-05 20:25:50

解決方案1
0 已采納 2015-04-05 20:25:50