简体   繁体   English

从python列表中删除单词?

[英]Removing words from python lists?

I am a complete noob in python and web scraping and have ran into some issues quite early. 我是python和网络抓取的完全菜鸟,并且很早就遇到了一些问题。 I have been able to scrape a Dutch news website their titles and splitting the words. 我已经能够抓取荷兰新闻网站的标题并将其拆分。 Now my objective is to remove certain words from the results. 现在,我的目标是从结果中删除某些单词。 For instance I don't want word like "het" and "om" in the list. 例如,我不想在列表中使用“ het”和“ om”之类的词。 Does anyone know how I can do this? 有人知道我该怎么做吗? (I'm using python requests and BeautifulSoup) (我正在使用python请求和BeautifulSoup)

 import requests from bs4 import BeautifulSoup url="http://www.nu.nl" r=requests.get(url) soup=BeautifulSoup(r.content) g_data=soup.find_all("span" , {"class": "title"}) for item in g_data: print item.text.split() 

In natural language processing, the term for excluding common words is called "stop words". 在自然语言处理中,排除常见单词的术语称为“停用词”。

Do you want to preserve the order and count of each word, or do you just want the set of words that appear on the page? 您是要保留每个单词的顺序和计数,还是只希望页面上出现的单词集?

If you just want the set of words that appear on the page, using sets is probably the way to go. 如果只希望页面上显示一组单词,那么使用单词组可能是您的最佳选择。 Something like the following might work: 类似以下内容可能会起作用:

# It's probably more common to define your STOP_WORDS in a file and then read
# them into your data structure to keep things simple for large numbers of those
# words.
STOP_WORDS = set([
    'het',
    'om'
])

all_words = set()
for item in g_data:
    all_words |= set(item.text.split())
all_words -= STOP_WORDS
print all_words

If, on the other hand, you care about the order, you could just refrain from adding stop words to your list. 另一方面,如果您关心订单,则可以避免在列表中添加停用词。

words_in_order = []
for item in g_data:
    words_from_span = item.text.split()
    # You might want to break this out into its own function for modularity.
    for word in words_from_span:
        if word not in STOP_WORDS:
            words_in_order.append(word)
print words_in_order

If you don't care about order but you want frequency, you could create a dict (or defaultdict for convenience) of words to counts. 如果您不关心顺序,但想要频率,则可以创建一个要计数的单词的dict(或为方便起见使用defaultdict)。

from collections import defaultdict
word_counts = defaultdict(int)
for item in g_data:
    # You might want to break this out into its own function for modularity.
    for word in item.text.split():
        if word not in STOP_WORDS:
            word_counts[word] += 1
for word, count in word_counts.iteritems():
    print '%s: %d' % (word, count)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM