BeautifulSoup does something weird and I can't figure out why.
import requests
from bs4 import BeautifulSoup
url = "nsfw"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
cards = soup.find_all("div", {"class": "card-body"})
cards.pop(0)
cards.pop(0)
cards.pop(0) # i really like to pop
texte = []
print(soup)
for i, card in enumerate(cards):
texte.append(card.text)
if i == len(cards)-1:
print(card)
Now what I expect it to do is get the divs and to put the text of the divs into the array. And it does work. For the first 8 out of 9 divs. The 9th div is extremly shortened. Result of the print:
<div class="card-body" id="card_Part_9"><p class="storytext"><span class="brk2_firstwords">“Door’s open,” Brendan shouted.</span></p>
<p class="storytext">Jeffrey</p></div>
But on the website itself it doesn't end there. Here is a screenshot: https://i.imgur.com/CmvYzfJ.png
Why does this happen? What can I do to prevent this? I have already tried to change the parser, but that does not change the result. The site does not use Javascript to load content.
Structure when opening with a browser: https://pastebin.com/N2bPYFBD
But when I print(soup) I get:
<p class="storytext">Jeffrey</p></div></div></div></div></div></div></div></body></html> entered the apartment```
Thought I could post my scribble as well:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('six-pack-thingy')
elems = driver.find_elements_by_class_name('card-body')
texte = [t.text for t in elems[3:]]
You will have to get some webdriver to run selenium, though. Are you familiar with that?
Seems like the html.parser
messes up the DOM. The lxml
-parser works for me:
import requests
from bs4 import BeautifulSoup
url = "six-pack-thingy"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
cards = soup.find_all("div", {"class": "card-body"})
texte = [card.text for card in cards[3:]]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.