BeautifulSoup: findAll doesn't find the tags

Question

I'm sorry about the many questions I post, but I have no idea what to do about this bug: when testing this page , with simple p s

ab=soup.find("article", {"itemprop":"articleBody"})
p=ab.findAll("p")
print(len(p))  #gives 1

There are many p tags, but I get only the first. I tried to copy-paste the whole <article itemprop="articleBody"> html text into a string and passed it to a new BeautifulSoup object. Searching that object for p gave all the desired tags (14).

Why the usual approach doesn't work? Are the p tags loaded dynamically here (but the html code looks pretty normal)?

Answer 1

Your code is giving only one p because when your are parsing soup and trying to see what it has parsed,it is getting only one paragraph see below code

ab = soup.find("article", {"itemprop": "articleBody"})
print ab

the output is

<article class="content link-underline relative body-copy" data-js="content" itemprop="articleBody">
<p>Not every update about a superhero movie is worthy of great attention. Take, for example, <a href="http://www.slashfilm.com/aquaman-setting/">the revelation</a> that not all of <em>Aquaman</em> will take place underwater</p></article>

since you are finding item under article tag and soup close the search when it find the closing article tag, and therefore its returning 1 as len of p which is correct as per your current code

Answer 2

The issue is the parser:

In [21]: req = requests.get("http://www.wired.com/2016/08/cape-watch-99/")

In [22]: soup = BeautifulSoup(req.content, "lxml")

In [23]: len(soup.select("article[itemprop=articleBody] p"))
Out[23]: 26

In [24]: soup = BeautifulSoup(req.content, "html.parser")

In [25]: len(soup.select("article[itemprop=articleBody] p"))
Out[25]: 1
In [26]: soup = BeautifulSoup(req.content, "html5lib")

In [27]: len(soup.select("article[itemprop=articleBody] p"))
Out[27]: 26

You can see html5lib and lxml get all the p tags but the standard html.parser does not handle the broken html as well. Running the article html through validator.w3 you get a lot of output, in particular:

BeautifulSoup: findAll doesn't find the tags

Question

2 answers

solution1
1 2016-08-19 03:17:58

solution2
1 ACCPTED 2016-08-19 10:23:04

BeautifulSoup: findAll doesn't find the tags

Question

2 answers

solution1 1 2016-08-19 03:17:58

solution2 1 ACCPTED 2016-08-19 10:23:04

solution1
1 2016-08-19 03:17:58

solution2
1 ACCPTED 2016-08-19 10:23:04