I'm tring to extract spcify information from jb hifi, here is what I did:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url="http://www.jbhifionline.com.au/support.aspx?post=1&results=10&source=all&bnSearch=Go!&q=ipod&submit=Go"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
Item0=soup.findAll('td',{'class':'check_title'})[0]
print (Item0.renderContents())
the output is :
Apple iPod Classic 160GB (Black)Â
<span class="SKU">MC297ZP/A</span>
what I want is:
Apple iPod Classic 160GB (Black)
and I tried use re to remove the other information
print(Item0.renderContents()).replace{^<span:,""}
but it didn't work
So my problem is how can I remove the useless information and get "Apple ipod classic 160GB(black)"
Don't use .renderContents()
; it's a debugging tool at best.
Just get the first child:
>>> Item0.contents[0]
u'Apple iPod Classic 160GB (Black)\xc2\xa0\r\n\t\t\t\t\t\t\t\t\t\t\t'
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)\xc2'
It appears that BeautifulSoup hasn't quite guessed the encoding correctly, so the non-breaking space (U+00a0) is present as two separate bytes instead of one. It looks like BeautifulSoup guessed wrong:
>>> soup.originalEncoding
'iso-8859-1'
You can force the encoding by using the response headers; this server did set the character set:
>>> page.info().getparam('charset')
'utf-8'
>>> page=urllib2.urlopen(url)
>>> soup = BeautifulSoup(page.read(), fromEncoding=page.info().getparam('charset'))
>>> Item0=soup.findAll('td',{'class':'check_title'})[0]
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)'
The fromEncoding
parameter tells BeautifulSoup to use UTF-8 instead of Latin 1, and now the non-breaking space is correctly stripped.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.