I am trying to scrape the name, price and description of an item from a webpage.
Here is the HTML
...
<div id="ProductDesc">
<a href="javascript:void(0);" onClick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '', '', '')"><h5 id="productTitle">Split Sport Longsleeve T-shirt</h5></a>
<h5 id="productPrice">$42.00</h5>
<br style="clear:both;" /><br />
Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>
...
Here is the code I have so far:
line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
price = line.h5.extract()
print price.get_text()
desc = line.get_text()
print desc
It outputs:
Split Sport Longsleeve T-shirt
$42.00
Then the error:
Traceback (most recent call last):
...
File "/home/myfile.py", line 35, in siftInfo
print line.get_text()
File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 901, in get_text
strip, types=types)])
File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 876, in _all_strings
for descendant in self.descendants:
File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 1273, in descendants
current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'
I would like the output:
Split Sport Longsleeve T-shirt
$42.00
Style # 53TD4141 Screenprinted longsleeve cotton tee.
Note:
If I print line
instead of print line.get_text()
it returns:
Split Sport Longsleeve T-shirt
$42.00
<div id="ProductDesc">
<a href="javascript:void(0);" onclick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '', '', '')"></a>
<br style="clear:both;"/><br/>
Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>
Edit 1:
If I omit the two lines concerning price and add some parsing out of white space then I get this:
New Code:
line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
desc = line.get_text()
print (' ').join(desc.split())
Output:
Split Sport Longsleeve T-shirt
$42.00 Style # 53TD4141 Screenprinted longsleeve cotton tee.
So, the second line.h5.extract()
is somehow changing the type of line, but the first one is not.
Since it doesn't format well in the comment I am putting it here. This is the code I ran and the output I got:
from bs4 import BeautifulSoup
from urllib.request import urlopen
def mainTest():
url = "http://10deep.com/store/split-sport-longsleeve-t-shirt"
print("url is: " + url)
page=urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
line = soup.find(id="ProductDesc")
name = line.h5.extract()
print(name.get_text())
price = line.h5.extract()
print(price.get_text())
desc = line.get_text()
print(desc)
mainTest()
Output
C:\Python34\python.exe C:/{path}/testPython.py
url is: http://10deep.com/store/split-sport-longsleeve-t-shirt
Split Sport Longsleeve T-shirt
$42.00
Style # 53TD4141 Screenprinted longsleeve cotton tee.
Process finished with exit code 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.