简体   繁体   中英

Beautiful Soup does not get full div

BeautifulSoup does something weird and I can't figure out why.

import requests
from bs4 import BeautifulSoup

url = "nsfw"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
cards = soup.find_all("div", {"class": "card-body"})
cards.pop(0)
cards.pop(0)
cards.pop(0)  # i really like to pop
texte = []
print(soup)
for i, card in enumerate(cards):
    texte.append(card.text)
    if i == len(cards)-1:
        print(card)

Now what I expect it to do is get the divs and to put the text of the divs into the array. And it does work. For the first 8 out of 9 divs. The 9th div is extremly shortened. Result of the print:

<div class="card-body" id="card_Part_9"><p class="storytext"><span class="brk2_firstwords">“Door’s open,” Brendan shouted.</span></p>
    <p class="storytext">Jeffrey</p></div>    

But on the website itself it doesn't end there. Here is a screenshot: https://i.imgur.com/CmvYzfJ.png

Why does this happen? What can I do to prevent this? I have already tried to change the parser, but that does not change the result. The site does not use Javascript to load content.

Structure when opening with a browser: https://pastebin.com/N2bPYFBD

But when I print(soup) I get:

<p class="storytext">Jeffrey</p></div></div></div></div></div></div></div></body></html> entered the apartment```

Thought I could post my scribble as well:

from selenium import webdriver

driver = webdriver.Firefox()
driver.get('six-pack-thingy')
elems = driver.find_elements_by_class_name('card-body')

texte = [t.text for t in elems[3:]]

You will have to get some webdriver to run selenium, though. Are you familiar with that?

Seems like the html.parser messes up the DOM. The lxml -parser works for me:

import requests
from bs4 import BeautifulSoup

url = "six-pack-thingy"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
cards = soup.find_all("div", {"class": "card-body"})
texte = [card.text for card in cards[3:]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM