使用beautifulsoup解析html頁面時丟失的信息

Question

我正在編寫網絡蜘蛛，以從網站獲取一些信息。 當我解析此頁面http://www.tripadvisor.com/Hotels-g294265-oa120-Singapore-Hotels.html#ACCOM_OVERVIEW時，我發現某些信息丟失了，我使用soup.prettify（）打印了html文檔，並且html文檔與我使用urllib2.openurl（）獲得的文檔不一樣，有些東西丟失了。 代碼如下：

    htmlDoc = urllib2.urlopen(sourceUrl).read()
    soup = BeautifulSoup(htmlDoc)

    subHotelUrlTags = soup.findAll(name='a', attrs={'class' : 'property_title'})
    print len(subHotelUrlTags)
    #if len(subHotelUrlTags) != 30:
    #   print soup.prettify()
    for hotelUrlTag in subHotelUrlTags:
        hotelUrls.append(website + hotelUrlTag['href'])

我嘗試使用HtmlParser做同樣的事情，它打印出以下錯誤：

 Traceback (most recent call last):
 File "./spider_new.py", line 47, in <module>
 hotelUrls = getHotelUrls()
 File "./spider_new.py", line 40, in getHotelUrls
 hotelParser.close()
 File "/usr/lib/python2.6/HTMLParser.py", line 112, in close
 self.goahead(1)
 File "/usr/lib/python2.6/HTMLParser.py", line 164, in goahead
 self.error("EOF in middle of construct")
 File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
 raise HTMLParseError(message, self.getpos())
 HTMLParser.HTMLParseError: EOF in middle of construct, at line 3286, column 1

Answer 1

下載並安裝lxml ..

它可以解析此類“故障”網頁。 （HTML可能以某種怪異的方式被破壞了，即使在bs4的幫助下，Python的HTML解析器也不善於理解這種情況。）

另外，如果您安裝了lxml，則無需更改代碼，BeautifulSoup將自動選擇lxml並使用它來解析HTML。

使用beautifulsoup解析html頁面時丟失的信息

問題描述

1 個解決方案

解決方案1
1 已采納 2013-05-07 03:49:46

使用beautifulsoup解析html頁面時丟失的信息

問題描述

1 個解決方案

解決方案1 1 已采納 2013-05-07 03:49:46

解決方案1
1 已采納 2013-05-07 03:49:46