How to get favicon by using beautiful soup and python

Question

I wrote some stupid code for learning just, but it doesn't work for any sites. here is the code:

import urllib2, re
from BeautifulSoup import BeautifulSoup as Soup

class Founder:
    def Find_all_links(self, url):
        page_source = urllib2.urlopen(url)
        a = page_source.read()
        soup = Soup(a)

        a = soup.findAll(href=re.compile(r'/.a\w+'))
        return a
    def Find_shortcut_icon (self, url):
        a = self.Find_all_links(url)
        b = ''
        for i in a:
            strre=re.compile('shortcut icon', re.IGNORECASE)
            m=strre.search(str(i))
            if m:
                b = i["href"]
        return b
    def Save_icon(self, url):
        url = self.Find_shortcut_icon(url)
        print url
        host = re.search(r'[0-9a-zA-Z]{1,20}\.[a-zA-Z]{2,4}', url).group()
        opener = urllib2.build_opener()
        icon = opener.open(url).read()
        file = open(host+'.ico', "wb")
        file.write(icon)
        file.close()
        print '%s icon successfully saved' % host
c = Founder()
print c.Save_icon('http://lala.ru')

The most strange thing is it works for site: http://habrahabr.ru http://5pd.ru

But doesn't work for most others that I've checked.

Answer 1

You're making it far more complicated than it needs to be. Here's a simple way to do it:

import urllib
page = urllib.urlopen("http://5pd.ru/")
soup = BeautifulSoup(page)
icon_link = soup.find("link", rel="shortcut icon")
icon = urllib.urlopen(icon_link['href'])
with open("test.ico", "wb") as f:
    f.write(icon.read())

Answer 2

Thank you, kurd. Here is the code with some changes:

import  urllib2
from BeautifulSoup import BeautifulSoup 

url = "http://www.facebook.com" 
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
icon_link = soup.find("link", rel="shortcut icon")
try:
    icon = urllib2.urlopen(icon_link['href'])
except:
    icon = urllib2.urlopen(url + icon_link['href'])
iconname = url.split(r'/')
iconname = iconname[2].split('.')
iconname = iconname[1] + '.' + iconname[2] + '.ico'
with open(iconname, "wb") as f:
    f.write(icon.read())

Answer 3

Thomas K's answer got me started in the right direction, but I found some websites that didn't say rel="shortcut icon", like 1800contacts.com that says just rel="icon". This works in Python 3 and returns the link. You can write that to file if you want.

from bs4 import BeautifulSoup
import requests

def getFavicon(domain):
    if 'http' not in domain:
        domain = 'http://' + domain
    page = requests.get(domain)
    soup = BeautifulSoup(page.text, features="lxml")
    icon_link = soup.find("link", rel="shortcut icon")
    if icon_link is None:
        icon_link = soup.find("link", rel="icon")
    if icon_link is None:
        return domain + '/favicon.ico'
    return icon_link["href"]

Answer 4

Thank you, Thomas. Here is the code wiith some changes:

import  urllib2
from BeautifulSoup import BeautifulSoup 

page = urllib2.urlopen("http://5pd.ru/")
soup = BeautifulSoup(page.read())
icon_link = soup.find("link", rel="shortcut icon")
icon = urllib2.urlopen(icon_link['href'])
with open("test.ico", "wb") as f:
    f.write(icon.read())

Answer 5

In case anyone wants to use a single check with regex, the following works for me:

import re

from bs4 import BeautifulSoup

html_code = "<Some HTML code you get from somewhere>"

soup = BeautifulSoup(html_code, features="lxml")

for item in soup.find_all('link', attrs={'rel': re.compile("^(shortcut icon|icon)$", re.I)}):
    print(item.get('href'))

This will also account for occurrences of case sensitivity.

How to get favicon by using beautiful soup and python

Question

5 answers

solution1
17 2011-01-12 22:48:11

solution2
2 2017-02-07 09:38:53

solution3
2 2019-04-25 21:29:20

solution4
1 2011-01-13 14:45:39

solution5
1 2020-05-05 02:58:34

How to get favicon by using beautiful soup and python

Question

5 answers

solution1 17 2011-01-12 22:48:11

solution2 2 2017-02-07 09:38:53

solution3 2 2019-04-25 21:29:20

solution4 1 2011-01-13 14:45:39

solution5 1 2020-05-05 02:58:34

solution1
17 2011-01-12 22:48:11

solution2
2 2017-02-07 09:38:53

solution3
2 2019-04-25 21:29:20

solution4
1 2011-01-13 14:45:39

solution5
1 2020-05-05 02:58:34