繁体   English   中英

从python列表中删除单词?

[英]Removing words from python lists?

我是python和网络抓取的完全菜鸟,并且很早就遇到了一些问题。 我已经能够抓取荷兰新闻网站的标题并将其拆分。 现在,我的目标是从结果中删除某些单词。 例如,我不想在列表中使用“ het”和“ om”之类的词。 有人知道我该怎么做吗? (我正在使用python请求和BeautifulSoup)

 import requests from bs4 import BeautifulSoup url="http://www.nu.nl" r=requests.get(url) soup=BeautifulSoup(r.content) g_data=soup.find_all("span" , {"class": "title"}) for item in g_data: print item.text.split() 

在自然语言处理中,排除常见单词的术语称为“停用词”。

您是要保留每个单词的顺序和计数,还是只希望页面上出现的单词集?

如果只希望页面上显示一组单词,那么使用单词组可能是您的最佳选择。 类似以下内容可能会起作用:

# It's probably more common to define your STOP_WORDS in a file and then read
# them into your data structure to keep things simple for large numbers of those
# words.
STOP_WORDS = set([
    'het',
    'om'
])

all_words = set()
for item in g_data:
    all_words |= set(item.text.split())
all_words -= STOP_WORDS
print all_words

另一方面,如果您关心订单,则可以避免在列表中添加停用词。

words_in_order = []
for item in g_data:
    words_from_span = item.text.split()
    # You might want to break this out into its own function for modularity.
    for word in words_from_span:
        if word not in STOP_WORDS:
            words_in_order.append(word)
print words_in_order

如果您不关心顺序,但想要频率,则可以创建一个要计数的单词的dict(或为方便起见使用defaultdict)。

from collections import defaultdict
word_counts = defaultdict(int)
for item in g_data:
    # You might want to break this out into its own function for modularity.
    for word in item.text.split():
        if word not in STOP_WORDS:
            word_counts[word] += 1
for word, count in word_counts.iteritems():
    print '%s: %d' % (word, count)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM