繁体   English   中英

如何使用 bs4 刮这个

[英]How to scrape this using bs4

我必须得到<a class="last" aria-label="Last Page" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a> 从这个网站: https ://webtoon-tr.com/webtoon/

但是当我尝试用这段代码刮掉它时:

from bs4 import BeautifulSoup
import requests

url = "https://webtoon-tr.com/webtoon/"
html = requests.get(url).content
soup = BeautifulSoup(html,"html.parser")

last = soup.find_all("a",{"class":"last"})
print(last)

它只返回一个空列表,当我尝试抓取所有“a”标签时,它只返回 2 个完全不同的东西。

有人可以帮我吗? 对此,我真的非常感激。

尝试使用 request_html 库。

from bs4 import BeautifulSoup
import requests_html

url = "https://webtoon-tr.com/webtoon/"

s = requests_html.HTMLSession()

html = s.get(url)
soup = BeautifulSoup(html.content, "lxml")

last = soup.findAll("a", {"class":"last"})
print(last)
[<a aria-label="Last Page" class="last" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>]

网站受 Cloudflare 保护。 requests、cloudscraper 或 request_html 对我不起作用,只有 selenium:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup


chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

browser.get("https://webtoon-tr.com/webtoon/")
soup = BeautifulSoup(browser.page_source, 'html5lib')
browser.quit()
link = soup.select_one('a.last')
print(link)

这返回

<a aria-label="Last Page" class="last" href="https://webtoon-tr.com/webtoon/page/122/">Son »</a>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM