繁体   English   中英

列表Python中的黑名单,同时从网页中获取数据

[英]Blacklists in Lists Python, while grabbing data from webpages

基本上,我创建了一个非常凌乱的代码来从bing搜索查询中获取链接。 我面临的问题是我收到了太多与必应相关的链接。

我已经尝试过使用当前代码删除这些代码,但是我宁愿选择黑名单。

这是我的代码:

import re, urllib
class MyOpener(urllib.FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15'
myopener = MyOpener()
dork = raw_input("Dork:")
pagevar = ['1','11','23','34','45','46','47','58','69']
for page in pagevar:
    bingdork = "http://www.bing.com/search?q=" + str(dork) + "&first=" + str(page)
    bingdork.replace(" ", "+")
    links = re.findall('''href=["'](.[^"']+)["']''', myopener.open(bingdork).read(), re.I)
    toremove = []
    for i in links:
        if "bing.com" in i:
            toremove.append(i)
        elif "wlflag.ico" in i:
            toremove.append(i)
        elif "/account/web?sh=" in i:
            toremove.append(i)
        elif "/?FORM" in i:
            toremove.append(i)
        elif "javascript:void(0);" in i:
            toremove.append(i)
        elif "javascript:" in i:
            toremove.append(i)
        elif "go.microsoft.com/fwlink" in i:
            toremove.append(i)
        elif "g.msn.com" in i:
            toremove.append(i)
        elif "onlinehelp.microsoft.com" in i:
            toremove.append(i)
        elif "feedback.discoverbing.com" in i:
            toremove.append(i)
        elif "/account/web?sh=" in i:
            toremove.append(i)
        elif "/?scope=web" in i:
            toremove.append(i)
        elif "/explore?q=" in i:
            toremove.append(i)
        elif "https://feedback.discoverbing.com" in i:
            toremove.append(i)
        elif "/images/" in i:
            toremove.append(i)
        elif "/videos/" in i:
            toremove.append(i)
        elif "/maps/" in i:
            toremove.append(i)
        elif "/news/" in i:
            toremove.append(i)
            for i in toremove:
                links.remove(i)
                for i in links:
                    print i

假设我输入了:Dork:cfm id

我将得到的结果是:

http://pastebin.com/eGgUKYwV

我希望得到的结果是:

http://pastebin.com/Xi28BzXs

我想删除以下内容:

/search?q=cfm+id&lf=1&qpvt=cfm+id
/account/web?sh=5&ru=%2fsearch%3fq%3dcfm%2520id%26first%3d69&qpvt=cfm+id
/search?q=cfm+id&rf=1&qpvt=cfm+id
/search?q=cfm+id&first=69&format=rss
/search?q=cfm+id&first=69&format=rss
/?FORM=Z9FD1
javascript:void(0);
/account/general?ru=http%3a%2f%2fwww.bing.com%2fsearch%3fq%3dcfm+id%26first%3d69&FORM=SEFD
/?scope=web&FORM=HDRSC1
/images/search?q=cfm+id&FORM=HDRSC2
/videos/search?q=cfm+id&FORM=HDRSC3

基本上,我需要一个过滤器,该过滤器允许我仅从bing抓取VALID链接,并从bing删除所有废话。

非常感谢,BK PS对不起,如果我的解释不好。

您是否尝试过使用beautifulsoup,lxml或html5lib(首选lxml.etree),伪代码使用css / xpath查询的html解析路由:

html = htmlparse.parse(open(url))
hrefs = []

for a in html.xpath('//a'):
    if a['href'].startswith('http://') or a['href'].startswith('https://'):
       hrefs.append(a['href'])

当然这是伪代码,无论您使用beautifulsouplxml还是html5lib ,都应进行调整

如果您正在寻找的内容更像是基于白名单对页面html进行清理/清理,您可能会喜欢使用CleanText ,则可以使用正则表达式对该程序进行进一步的自定义过滤属性,但这只是练习;)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM