Extract image from website using Selenium Webdriver (Python)

Question

I need to crawl several thousand subsites and extract information.

Now, unfortunately the information in question is not a regular HTML text, but a image with text rendered on it dynamically.

How can I extract these images to further process them? I'm using Selenium Webdriver on Python.

Answer 1

There are very few things that you cannot do with mechanize plus BeautifulSoup . The further processing of the images can be done with pytesser , I however have not experience there. It would be interesting to have an advise from a knowledgeable person in Python OCR stuff.

import mechanize, BeautifulSoup

browser = mechanize.Browser()
html = browser.open("http://www.dreamstime.com/free-photos")
soup = BeautifulSoup.BeautifulSoup(html)
for ii, image in enumerate(soup.findAll('img')):
    _src = image['src']
    if str(_src).startswith('http://') and str(_src).endswith('.jpg'):
        print 'Storing this image:', _src
        data = browser.open(_src).read()
        fl = 'image' + str(ii) + '.jpg'
        with open(fl, 'wb') as f:
            f.write(data)
        f.closed

Extract image from website using Selenium Webdriver (Python)

Question

1 answers

solution1
0 2013-09-02 10:03:42

Extract image from website using Selenium Webdriver (Python)

Question

1 answers

solution1 0 2013-09-02 10:03:42

solution1
0 2013-09-02 10:03:42