简体   繁体   中英

How to download specific GIF images (condition: phd*.gif) from a website using Python's BeautifulSoup?

I have the following code that downloads all images from a web-link.

from BeautifulSoup import BeautifulSoup as bs
import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
import os 
import sys

def main(url, out_folder="/test/"):
"""Downloads all the images at 'url' to /test/"""
soup = bs(urlopen(url))
parsed = list(urlparse.urlparse(url))

for image in soup.findAll("img"):
    print "Image: %(src)s" % image
        filename = image["src"].split("/")[-1]
        parsed[2] = image["src"]
        outpath = os.path.join(out_folder, filename)
        if image["src"].lower().startswith("http"):
            urlretrieve(image["src"], outpath)
        else:
            urlretrieve(urlparse.urlunparse(parsed), outpath)

    def _usage():
    print "usage: python dumpimages.py http://example.com [outpath]"

if __name__ == "__main__":
    url = sys.argv[-1]
    out_folder = "/test/"
    if not url.lower().startswith("http"):
        out_folder = sys.argv[-1]
        url = sys.argv[-2]
        if not url.lower().startswith("http"):
            _usage()
            sys.exit(-1) 
    main(url, out_folder)

I want to modify it so that it downloads only images named as 'phd210223.gif' (for example), that is, images satisfying the condition: 'phd*.gif'

And I want to put it in a loop, so that after fetching such images from one webpage, it increments the page ID by 1 and downloads the same from the next page: ' http://www.example.com/phd.php?id=2 '

How can I do this?

Regular expression can help to solve this! when pattern is found in string/url, a match object would be returned, otherwise None.

import re
reg = re.compile('phd.*\.gif$')
str1 = 'path/phd12342343.gif'
str2 = 'path/dhp12424353153.gif'
print re.search(reg,str1)
print re.search(reg,str2)

Instead of checking the name in the loop, you can use BeautifulSoup 's built-in support for regular expressions . Provide the compiled regular expression as a value of src argument:

import re

from bs4 import BeautifulSoup as bs # note, you should use beautifulsoup4

for image in soup.find_all("img", src=re.compile('phd\d+\.gif$')):
    ...

phd\\d+\\.gif$ regular expression would search for text starting with phd , followed by 1 or more digits, followed by dot, followed by gif at the end of the string.

Note that you are using an outdated and unmaintained BeautifulSoup3 , switch to beautifulsoup4 :

pip install beautifulsoup4

I personally prefer using python default tools so I use html.parser, what you need it something like this:

import re, urllib.request, html.parser
class LinksHTMLParser(parse.HTMLParser):
    def __init__(self, length):
            super().__init__()
            self.gifs = list()

    def handle_starttag(self, tag, attrs):
            if tag == "a":
                    for name, value in attrs:
                            if name == "href":
                                gifName = re.split("/", value)[-1]
                                if *gifNameCondition*:
                                    self.gifs.append(value)

    parser = LinksHTMLParser()
    parser.feed(urllib.request.urlopen("YOUR URL HERE").read().decode("utf-8"))
    for gif in parser.gifs:
        urllib.request.urlretrieve(*local path to download gif to*, gif)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM