简体   繁体   English

使用python和BeautifulSoup从网页检索特定链接

[英]retrieve specific links from web page using python and BeautifulSoup

I have been trying to retrieve href link from a page and using as a variable for next href link. 我一直在尝试从页面检索href链接,并用作下一个href链接的变量。 But I stuck at one point where I have multiple href links with the different file extension(like zip,md5 etc) and only needed to a zip extension file. 但是我停留在一个地方,在那里我有多个带有不同文件扩展名(例如zip,md5等)的href链接,只需要一个zip扩展名文件。 here is the code I am trying to implement. 这是我尝试实现的代码。

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://example.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
       if '/abc' in link['href']:
          basename = link['href'].split("/")[11]
          print basename

        status, response = http.request('http://example.com/%basename',basename)
        for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
            if link.has_key('href'):
                if '/abc' in link['href']:
                    basename = link['href'].split("/")[11]
                    print basename

try it: 试试吧:

import os
# YOY CODE here

for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
    if link.has_key('href'):
        if '/abc' in link['href']:
            basename = link['href'].split("/")[11]
            # check file extension
            filename, file_extension = os.path.splitext(basename)
            print basename, file_extension
            if file_extension.lower() == 'zip':
                continue
       # YOUR LAST CODE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM