简体   繁体   中英

Beautiful Soup 4's extract() Changing the tag to a NoneType

I am trying to scrape the name, price and description of an item from a webpage.

Here is the HTML

...
<div id="ProductDesc">
                            <a href="javascript:void(0);" onClick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '',  '',  '')"><h5 id="productTitle">Split Sport Longsleeve T-shirt</h5></a>
                            <h5 id="productPrice">$42.00</h5>
                            <br style="clear:both;" /><br />
                        Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>
...

Here is the code I have so far:

line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
price = line.h5.extract()
print price.get_text()
desc = line.get_text()
print desc

It outputs:

Split Sport Longsleeve T-shirt
$42.00

Then the error:

Traceback (most recent call last):
  ...
  File "/home/myfile.py", line 35, in siftInfo
    print line.get_text()
  File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 901, in get_text
    strip, types=types)])
  File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 876, in _all_strings
    for descendant in self.descendants:
  File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 1273, in descendants
    current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'

I would like the output:

Split Sport Longsleeve T-shirt
$42.00
Style # 53TD4141 Screenprinted longsleeve cotton tee.

Note:

If I print line instead of print line.get_text() it returns:

Split Sport Longsleeve T-shirt
$42.00
<div id="ProductDesc">
                            <a href="javascript:void(0);" onclick="loadStyle('imageView','http://tendeep.vaesite.net/__data/03cc09aa3700a50b17caf5963821f603.jpg', '',  '',  '')"></a>

                            <br style="clear:both;"/><br/>
                        Style # 53TD4141 Screenprinted longsleeve cotton tee.</div>

Edit 1:

If I omit the two lines concerning price and add some parsing out of white space then I get this:

New Code:

line = soup.find(id="ProductDesc")
name = line.h5.extract()
print name.get_text()
desc = line.get_text()
print (' ').join(desc.split())

Output:

Split Sport Longsleeve T-shirt
$42.00 Style # 53TD4141 Screenprinted longsleeve cotton tee.

So, the second line.h5.extract() is somehow changing the type of line, but the first one is not.

Since it doesn't format well in the comment I am putting it here. This is the code I ran and the output I got:

from bs4 import BeautifulSoup
from urllib.request import urlopen

def mainTest():
    url = "http://10deep.com/store/split-sport-longsleeve-t-shirt"
    print("url is: " + url)
    page=urllib.request.urlopen(url)

    soup = BeautifulSoup(page.read())
    line = soup.find(id="ProductDesc")
    name = line.h5.extract()
    print(name.get_text())
    price = line.h5.extract()
    print(price.get_text())
    desc = line.get_text()
    print(desc)

mainTest()

Output

C:\Python34\python.exe C:/{path}/testPython.py
url is: http://10deep.com/store/split-sport-longsleeve-t-shirt
Split Sport Longsleeve T-shirt
$42.00




                        Style # 53TD4141 Screenprinted longsleeve cotton tee.

Process finished with exit code 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM