简体   繁体   English

BeautifulSoup解析器无法访问html元素

[英]BeautifulSoup parser can't access html elements

I am trying to scrape the hrefs of all the listings. 我正在尝试刮所有清单的hrefs。 I am fairly new to beautifulsoup and have done a bit of scraping before, but have done some scraping before. 我对beautifulsoup相当陌生,之前做过一些刮操作,但之前也做了过一些刮操作。 But I can't for the life of me extract. 但是我不能为我的一生提取。 See below my code. 请参阅下面的代码。 the container has length zero when I run this script. 当我运行此脚本时,容器的长度为零。

I try and select the price too (soup.findAll("span", {"class":"amount"}) , but it doesn't reflect. Any advice most welcome :) 我也尝试选择价格(soup.findAll(“ span”,{“ class”:“ amount”}),但没有体现出来。任何建议都非常受欢迎:)

import urllib.request
import urllib.parse
from bs4 import BeautifulSoup

url = 'https://www.takealot.com/computers/laptops-10130'   
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)

respData = str(resp.read())

soup = BeautifulSoup(respData, 'html.parser')

container = soup.find_all("div", {"class": "p-data left"})

The page is rendered with JavaScript. 该页面使用JavaScript呈现。 There are several ways to render and scrape it. 有几种渲染和刮取的方法。

I can scrape it with Selenium. 我可以用硒刮。 First install Selenium: 首先安装Selenium:

sudo pip3 install selenium

Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads you can use a headless version of chrome "Chrome Canary" if you are on Windows or Mac. 然后获取驱动程序https://sites.google.com/a/chromium.org/chromedriver/downloads ,如果您使用的是Windows或Mac,则可以使用无头版本的Chrome浏览器“ Chrome Canary”。

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Chrome()
url = ('https://www.takealot.com/computers/laptops-10130')
browser.get(url)
respData = browser.page_source
browser.quit()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print(container.text)
    print(container.find("span", {"class": "amount"}).text)

Alternatively use PyQt5 或者使用PyQt5

from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
from bs4 import BeautifulSoup
import sys


class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

url = 'https://www.takealot.com/computers/laptops-10130'
r = Render(url)
respData = r.frame.toHtml()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print (container.text)
    print (container.find("span", {"class":"amount"}).text)

Alternatively use dryscrape : 或者使用dryscrape

from bs4 import BeautifulSoup
import dryscrape

url = 'https://www.takealot.com/computers/laptops-10130'
session = dryscrape.Session()
session.visit(url)
respData = session.body()
soup = BeautifulSoup(respData, 'html.parser')
containers = soup.find_all("div", {"class": "p-data left"})
for container in containers:
    print(container.text)
    print(container.find("span", {"class": "amount"}).text)

Outputs in all cases: 在所有情况下的输出:

Dell Inspiron 3162 Intel Celeron 11.6" Wifi Notebook (Various Colours)11.6 Inch Display; Wifi Only (Red; White & Blue Available)R 3,999R 4,999i20% OffeB 39,990Discovery Miles 39,990On Credit: R 372 / monthi
3,999
HP 250 G5 Celeron N3060 Notebook - Dark ash silverNBHPW4M70EAR 4,499R 4,999ieB 44,990Discovery Miles 44,990On Credit: R 419 / monthiIn StockShippingThis item is in stock in our CPT warehouse and can be shipped from there. You can also collect it yourself from our warehouse during the week or over weekends.CPT | ShippingThis item is in stock in our JHB warehouse and can be shipped from there. No collection facilities available, sorry!JHBWhen do I get it?
4,499
Asus Vivobook ...

However when testing with your URL I observed the results were not reproducible every time, occasionally I got no content in "containers" after the page had rendered. 但是,在使用您的URL进行测试时,我发现结果并非每次都可重现,偶尔在页面渲染后,“容器”中也没有内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM