Python BeautifulSoup - 從頁面獲取內部鏈接

Question

我有一個基本循環來查找我用urllib2.urlopen檢索的頁面上的鏈接，但是我只想跟蹤頁面上的內部鏈接。

任何想法如何使我的下面的循環只獲得在同一個域上的鏈接？

for tag in soupan.findAll('a', attrs={'href': re.compile("^http://")}): 
                webpage = urllib2.urlopen(tag['href']).read()
                print 'Deep crawl ----> ' +str(tag['href'])
                try:
                    code-to-look-for-some-data...

                except Exception, e:
                    print e

Answer 1

>>> import urllib
>>> print urllib.splithost.__doc__
splithost('//host[:port]/path') --> 'host[:port]', '/path'.

如果主機相同或主機為空（用於相對路徑），則url屬於同一主機。

for tag in soupan.findAll('a', attrs={'href': re.compile("^http://")}):

            href = tag['href']
            protocol, url = urllib.splittype(href) # 'http://www.xxx.de/3/4/5' => ('http', '//www.xxx.de/3/4/5')
            host, path =  urllib.splithost(url)    # '//www.xxx.de/3/4/5' => ('www.xxx.de', '/3/4/5')
            if host.lower() != theHostToCrawl and host != '':
                continue

            webpage = urllib2.urlopen(href).read()

            print 'Deep crawl ----> ' +str(tag['href'])
            try:
                code-to-look-for-some-data...

            except:
                import traceback
                traceback.print_exc()

因為你這樣做

'href': re.compile("^http://")

不會使用相對路徑。 就像

<a href="/folder/file.htm"></a>

也許根本不使用re？

Answer 2

針對您的爬蟲的一些建議：將機械化與BeautifulSoup結合使用，這將簡化您的任務。

Python BeautifulSoup - 從頁面獲取內部鏈接

問題描述

2 個解決方案

解決方案1
2 已采納 2012-05-03 16:27:48

解決方案2
0 2012-05-04 08:41:35

Python BeautifulSoup - 從頁面獲取內部鏈接

問題描述

2 個解決方案

解決方案1 2 已采納 2012-05-03 16:27:48

解決方案2 0 2012-05-04 08:41:35

解決方案1
2 已采納 2012-05-03 16:27:48

解決方案2
0 2012-05-04 08:41:35