How to extract html using beautifulsoup?

Question

The HTML source was

html = """
<td>
 <a href="/urlM5CLw" target="_blank">
  <img alt="I" height="132" src="VZhAy" width="132"/>
 </a>
 <br/>
 <cite title="mac-os-x-lion-icon-pack.en.softonic.com">
  mac-os-x-lion-icon-pac...
 </cite>
 <br/>
 <b>
  Mac
 </b>
 OS X Lion Icon Pack's
 <br/>
 535 × 535 - 135k - png
</td>"""

My python code

soup = BeautifulSoup(html)
text = soup.find('td').renderContents()

By these code I can get string like

<a href="/urlM5CLw" target="_blank"><img alt="I" height="132" src="VZhAy" width="132"/></a><br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 × 535 - 135k - png

But I don't want <a>....</a> , I just need:

<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 × 535 - 135k - png

Answer 1

Try removing the <a> tag and then fetch what you were trying to.

>>> soup.find('a').extract()
>>> text = soup.find('td').renderContents()
>>> text
'<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">mac-os-x-lion-icon-pac...</cite><br/><b>Mac</b> OS X Lion Icon Pack's<br/>535 \xd7 535 - 135k - png'

Answer 2

You can use the Tag.decompose() method to remove the a tag and completely destroy his contents also you may need to decode() your byte string and replace all \\n occurence by '' .

soup = BeautifulSoup(html, 'lxml')
soup.a.decompose()
print(soup.td.renderContents().decode().replace('\n', ''))

yields:

<br/><cite title="mac-os-x-lion-icon-pack.en.softonic.com">  mac-os-x-lion-icon-pac... </cite><br/><b>  Mac </b> OS X Lion Icon Pack's <br/> 535 × 535 - 135k - png

How to extract html using beautifulsoup?

Question

2 answers

solution1
2 ACCPTED 2015-10-23 06:49:35

solution2
0 2015-10-23 09:15:06

How to extract html using beautifulsoup?

Question

2 answers

solution1 2 ACCPTED 2015-10-23 06:49:35

solution2 0 2015-10-23 09:15:06

solution1
2 ACCPTED 2015-10-23 06:49:35

solution2
0 2015-10-23 09:15:06