简体   繁体   中英

BeautifulSoup: findAll doesn't find the tags

I'm sorry about the many questions I post, but I have no idea what to do about this bug: when testing this page , with simple p s

ab=soup.find("article", {"itemprop":"articleBody"})
p=ab.findAll("p")
print(len(p))  #gives 1

There are many p tags, but I get only the first. I tried to copy-paste the whole <article itemprop="articleBody"> html text into a string and passed it to a new BeautifulSoup object. Searching that object for p gave all the desired tags (14).

Why the usual approach doesn't work? Are the p tags loaded dynamically here (but the html code looks pretty normal)?

Your code is giving only one p because when your are parsing soup and trying to see what it has parsed,it is getting only one paragraph see below code

ab = soup.find("article", {"itemprop": "articleBody"})
print ab

the output is

<article class="content link-underline relative body-copy" data-js="content" itemprop="articleBody">
<p>Not every update about a superhero movie is worthy of great attention. Take, for example, <a href="http://www.slashfilm.com/aquaman-setting/">the revelation</a> that not all of <em>Aquaman</em> will take place underwater</p></article>

since you are finding item under article tag and soup close the search when it find the closing article tag, and therefore its returning 1 as len of p which is correct as per your current code

The issue is the parser:

In [21]: req = requests.get("http://www.wired.com/2016/08/cape-watch-99/")

In [22]: soup = BeautifulSoup(req.content, "lxml")

In [23]: len(soup.select("article[itemprop=articleBody] p"))
Out[23]: 26

In [24]: soup = BeautifulSoup(req.content, "html.parser")

In [25]: len(soup.select("article[itemprop=articleBody] p"))
Out[25]: 1
In [26]: soup = BeautifulSoup(req.content, "html5lib")

In [27]: len(soup.select("article[itemprop=articleBody] p"))
Out[27]: 26

You can see html5lib and lxml get all the p tags but the standard html.parser does not handle the broken html as well. Running the article html through validator.w3 you get a lot of output, in particular:

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM