简体   繁体   中英

BeautifulSoup : Weird behavior with <p>

I've the following HTML content :

content  = """
<div>

  <div> <div>div A</div> </div>
  <p>P A</p>

  <div> <div>div B</div> </div>   
  <p> P B1</p>
  <p> P B2</p>

  <div> <div>div C</div> </div>
  <p> P C1 <div>NODE</div> </p>

</div>
"""

Which can be seen like that (Not sure if it helps but I like diagram) : 在此处输入图片说明

If I use the following code :

soup = bs4.BeautifulSoup(content, "lxml")
firstDiv = soup.div
allElem = firstDiv.findAll( recursive = False)
for i, el in enumerate(allElem):
    print "element ", i , " : ", el

I get this :

element  0  :  <div> <div>div A</div> </div>
element  1  :  <p>P A</p>
element  2  :  <div> <div>div B</div> </div>
element  3  :  <p> P B1</p>
element  4  :  <p> P B2</p>
element  5  :  <div> <div>div C</div> </div>
element  6  :  <p> P C1 </p>
element  7  :  <div>NODE</div>

As you can see unlike elements 0, 2 or 5, the element 6 doesn't contains its children. If I change its <p> to <b> or <div> then it acts as excepted. Why this little difference with <p> ? I'm still having that problem (if this is one?) upgrading from 4.3.2 to 4.4.6.

p elements can only contain phrasing content so what you have is actually invalid HTML. Here's an example of how it's parsed :

For example, a form element isn't allowed inside phrasing content, because when parsed as HTML, a form element's start tag will imply a p element's end tag. Thus, the following markup results in two paragraphs, not one:

 <p>Welcome. <form><label>Name:</label> <input></form> 

It is parsed exactly like the following:

 <p>Welcome. </p><form><label>Name:</label> <input></form> 

You can confirm that this is how browsers parse your HTML (pictured is Chrome 64):

Chrome解析无效的HTML

lxml is handling this correctly, as is html5lib . html.parser doesn't implement much of the HTML5 spec and doesn't care about these quirks.

I suggest you stick to lxml and html5lib if you don't want to be frustrated in the future by these parsing differences. It's annoying when what you see in your browser's DOM inspector differs from how your code parses it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM