简体   繁体   中英

How to extract the content using beautifulsoup

I want to try to extract the product name and price from the website using beautifulsoup. But I do not know how to extract the content.

Python code:

from bs4 import BeautifulSoup
import re

div = '<div pagetype="simple_table_nonFashion" class="itemBox" 
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num" 
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9" 
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p 
class="proName clearfix"><a id="pdlink2_679026" pmid="0" 
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint 
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'

soup = BeautifulSoup(div, "lxml")
itemBox = soup.find("div", {"class": "itemBox"})
proPrice = itemBox.find("p", {"class": "proPrice"}).find("em").text
pdlink2 = itemBox.find('a',{"id": re.compile('pdlink2_*')}).text
print(proPrice)
print(pdlink2)

Print out the result:

¥49.90
.preSellOrAppoint {border: 1px solid #FFFFFF;}印尼进口

The picture:

在此处输入图片说明

My expected result is the content:

49.90
印尼进口

With soup.select_one() method:

from bs4 import BeautifulSoup

div = '''<div pagetype="simple_table_nonFashion" class="itemBox" 
id="itemSearchResultCon_679026"><p class="proPrice"><em class="num" 
id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9" 
productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p 
class="proName clearfix"><a id="pdlink2_679026" pmid="0" 
href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint 
{border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'''

soup = BeautifulSoup(div, "lxml")
proPrice = soup.select_one("p.proPrice em").contents[-1]
pdlink2 = soup.select_one('p.proName > a').contents[-1]

print(proPrice)
print(pdlink2)

The output:

49.90
印尼进口

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

Here's the code based on the BeautifulSoup object you provided:

from bs4 import BeautifulSoup
import re

div = '<div pagetype="simple_table_nonFashion" class="itemBox" id="itemSearchResultCon_679026"><p class="proPrice"><em class="num" id="price0_679026" productid="679026" adproductflag="0" yhdprice="49.9" productunit="" diapernum="0" diapernumunit=""><b>¥</b>49.90</em></p><p class="proName clearfix"><a id="pdlink2_679026" pmid="0" href="//item.yhd.com/679026.html"><style type="text/css">.preSellOrAppoint {border: 1px solid #FFFFFF;}</style>印尼进口</a></p></div>'

soup = BeautifulSoup(div, "lxml")
proPrice = soup.b.next_sibling
pdlink2 = soup.style.next_sibling
print(proPrice)
print(pdlink2)

.next_sibling allows you to access the text outside of the <b> and <style> tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM