通過beautifulsoup和re進行數據提取

Question

我想從jb hifi中提取spcify信息，這就是我所做的：

from BeautifulSoup import BeautifulSoup
import urllib2
import re



url="http://www.jbhifionline.com.au/support.aspx?post=1&results=10&source=all&bnSearch=Go!&q=ipod&submit=Go"

page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
Item0=soup.findAll('td',{'class':'check_title'})[0]    
print (Item0.renderContents())

輸出是：

Apple iPod Classic 160GB (Black)Â 
<span class="SKU">MC297ZP/A</span>

我想要的是：

Apple iPod Classic 160GB (Black)

我嘗試使用re刪除其他信息

 print(Item0.renderContents()).replace{^<span:,""}

但它不起作用

所以我的問題是如何刪除無用的信息並獲得“Apple ipod classic 160GB（黑色）”

Answer 1

不要使用.renderContents() ; 它充其量只是一個調試工具。

剛剛得到第一個孩子：

>>> Item0.contents[0]
u'Apple iPod Classic 160GB (Black)\xc2\xa0\r\n\t\t\t\t\t\t\t\t\t\t\t'
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)\xc2'

似乎BeautifulSoup沒有完全猜測編碼是否正確，因此非中斷空間（U + 00a0）作為兩個單獨的字節而不是一個存在。 貌似BeautifulSoup猜錯了：

>>> soup.originalEncoding
'iso-8859-1'

您可以使用響應標頭強制編碼; 這個服務器確實設置了字符集：

>>> page.info().getparam('charset')
'utf-8'
>>> page=urllib2.urlopen(url)
>>> soup = BeautifulSoup(page.read(), fromEncoding=page.info().getparam('charset'))
>>> Item0=soup.findAll('td',{'class':'check_title'})[0]
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)'

fromEncoding參數告訴BeautifulSoup使用UTF-8而不是Latin 1，現在正確地剝離了不間斷空格。

通過beautifulsoup和re進行數據提取

問題描述

1 個解決方案

解決方案1
2 已采納 2013-06-01 10:38:18

通過beautifulsoup和re進行數據提取

問題描述

1 個解決方案

解決方案1 2 已采納 2013-06-01 10:38:18

解決方案1
2 已采納 2013-06-01 10:38:18