简体   繁体   中英

Need help understanding "recursive" with BeautifulSoup in python

I've been struggling for several hours understanding "recursive" with BeautifulSoup in Python. Please help me out! I've read an official document and lots of questions but I still don't get it.

from bs4 import BeautifulSoup
s = "<div>C<p><strong>A</strong>B</p></div>"
soup = BeautifulSoup(s, 'html.parser')
  1. print(soup.find("p", recursive=False)) gives None

Is it because we can't find anything anymore outside of <div></div> ?

  1. print(soup.find("p").find(recursive=False)) gives <strong>A</strong>

If what I thought in the first question was correct, I guessed this would give <p>B</p> because we can't go into a deeper depth. But why does this start from <strong> ? why not <p> ?

Also, how can I extract <p>B</p> ?

When you use recursive=False , it means to only search immediate children of the element that you're calling .find() or .find_all() on. The only immediate child of the top-level soup object is the <div> element. Since it's not a <p> element, it doesn't match the name given, so nothing is found.

In your second example, you first use a recursive search to find the <p> element. Then you call .find() with no name, so it will match any element name. Since you specified recursive=false , it only considers immediate children of <p> . The first child element is <strong>A</strong> , and that's returned.

Recursive = False returns only the children of the element of the tag you are trying to find. For example:

<li>
    <p>1</p>
    <p>2</p>
    <div>
      <p>3</p>
    </div>
</li>

li = soup.find('li')

Now,

print(li.findChildren("p"))

prints [<p>1</p>, <p>2</p>, <p>3</p>]

print(li.findChildren("p", recursive=False))

prints [<p>1</p>, <p>2</p>]


In order to get <p>B</p> from <div>C<p><strong>A</strong>B</p></div> :

s = "<div>C<p><strong>A</strong>B</p></div>"
soup = BeautifulSoup(s, 'html.parser')
soup.strong.decompose()
print(soup.p)

prints <p>B</p>


Explanation:

print(soup.strong)

prints <strong>A</strong>

soup.strong.decompose()

removes <strong>A</strong> ( Beautiful Soup decompose() )

print(soup.p)

prints <p>B</p>

HTML documents are nested, tags have tags inside of them. In the document you've provided ('s'), the structure looks like:

Div
   p
     strong
        `text node`

Recursive is instructing beautifulsoup to check the children of a particular node for matches (or not to if set to false).

  1. There is only one root node (div). Because you tell beautifulsoup NOT to check recursively, it will not look at the div's children, so it returns None since there are no root 'p' elements.

  2. This is actually two instances of 'find' being chained together. The first 'find' looks for a 'p' (and looks recursively, since the default for recursive it True). It finds the 'div>p' as we'd expect. After this, you've called 'find' AGAIN on the result of the first find, which is then searching for anything since you didn't specify the node type you're looking for. The only child of the 'p' is the 'strong' tag, so that is what is returned.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM