通过beautifulsoup和re进行数据提取

Question

I'm tring to extract spcify information from jb hifi, here is what I did: 我想从jb hifi中提取spcify信息，这就是我所做的：

from BeautifulSoup import BeautifulSoup
import urllib2
import re



url="http://www.jbhifionline.com.au/support.aspx?post=1&results=10&source=all&bnSearch=Go!&q=ipod&submit=Go"

page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
Item0=soup.findAll('td',{'class':'check_title'})[0]    
print (Item0.renderContents())

the output is : 输出是：

Apple iPod Classic 160GB (Black)Â 
<span class="SKU">MC297ZP/A</span>

what I want is: 我想要的是：

Apple iPod Classic 160GB (Black)

and I tried use re to remove the other information 我尝试使用re删除其他信息

 print(Item0.renderContents()).replace{^<span:,""}

but it didn't work 但它不起作用

So my problem is how can I remove the useless information and get "Apple ipod classic 160GB(black)" 所以我的问题是如何删除无用的信息并获得“Apple ipod classic 160GB（黑色）”

Answer 1

Don't use .renderContents() ; 不要使用.renderContents() ; it's a debugging tool at best. 它充其量只是一个调试工具。

Just get the first child: 刚刚得到第一个孩子：

>>> Item0.contents[0]
u'Apple iPod Classic 160GB (Black)\xc2\xa0\r\n\t\t\t\t\t\t\t\t\t\t\t'
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)\xc2'

It appears that BeautifulSoup hasn't quite guessed the encoding correctly, so the non-breaking space (U+00a0) is present as two separate bytes instead of one. 似乎BeautifulSoup没有完全猜测编码是否正确，因此非中断空间（U + 00a0）作为两个单独的字节而不是一个存在。 It looks like BeautifulSoup guessed wrong: 貌似BeautifulSoup猜错了：

>>> soup.originalEncoding
'iso-8859-1'

You can force the encoding by using the response headers; 您可以使用响应标头强制编码; this server did set the character set: 这个服务器确实设置了字符集：

>>> page.info().getparam('charset')
'utf-8'
>>> page=urllib2.urlopen(url)
>>> soup = BeautifulSoup(page.read(), fromEncoding=page.info().getparam('charset'))
>>> Item0=soup.findAll('td',{'class':'check_title'})[0]
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)'

The fromEncoding parameter tells BeautifulSoup to use UTF-8 instead of Latin 1, and now the non-breaking space is correctly stripped. fromEncoding参数告诉BeautifulSoup使用UTF-8而不是Latin 1，现在正确地剥离了不间断空格。

通过beautifulsoup和re进行数据提取

问题描述

1 个解决方案

解决方案1
2 已采纳 2013-06-01 10:38:18

通过beautifulsoup和re进行数据提取

问题描述

1 个解决方案

解决方案1 2 已采纳 2013-06-01 10:38:18

解决方案1
2 已采纳 2013-06-01 10:38:18