使用Python提取HTML链接

Question

我正在尝试使用Python给定一组站点来提取iframe src。 例如，我的输入将是A.com，B.com，C.com，并且如果这些网站中的每一个都具有链接到D.com，E.com，F.com的iframe，（如果该网站没有，则为“无”有一个iframe），那么我希望输出为以下形式：

Site    Iframe Src
A.com    D.com
B.com    E.com
C.com    F.com

目前，我有这样的事情：

from collections import defaultdict
import urllib2
import re

 def PrintLinks(website):
 counter = 0
 regexp_link= regexp_link = r'''<frame src =((http|ftp)s?://.*?)'''
 pattern = re.compile(regexp_link)
 links = [None]*len(website)
 for x in website:
     html_page = urllib2.urlopen(website[counter])
     html = html_page.read()
     links[counter] = re.findall(pattern,html)
     counter += 1
 return links

def main():
 website=["A.com","B.com","C.com"]

这是最好的方法吗，我将如何使输出成为我想要的格式？ 谢谢！

Answer 1

您不需要使用正则表达式来重新发明轮子，有一些很棒的python软件包可以帮您做到这一点，因为它可能是最著名的BeautifulSoup。

使用pip安装BeautifulSoup和httplib2 ，然后尝试

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

sites=['http://www.site1.com', 'http://www.site2.com', 'http://www.site3.com']
http = httplib2.Http()

for site in sites:
    status, response = http.request(site)
    for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
        print site + ' ' + iframe['src']

使用Python提取HTML链接

问题描述

1 个解决方案

解决方案1
0 2014-05-20 00:06:08

使用Python提取HTML链接

问题描述

1 个解决方案

解决方案1 0 2014-05-20 00:06:08

解决方案1
0 2014-05-20 00:06:08