使用 PyQt5 和 QWebEngineView 抓取 javascript 頁面

Question

我正在嘗試將 javascripted 網頁呈現為填充的 HTML 以進行抓取。 研究不同的解決方案（硒、對頁面進行逆向工程等）使我采用了這種技術，但我無法讓它發揮作用。 順便說一句，我是 python 新手，基本上是在剪切/粘貼/實驗階段。 過去的安裝和縮進問題，但我現在卡住了。

在下面的測試代碼中，print(sample_html) 工作並返回目標頁面的原始 html，但 print(render(sample_html)) 始終返回單詞“None”。

有趣的是，如果您在 amazon.com 上運行它，他們會檢測到它不是真正的瀏覽器，並返回帶有有關自動訪問警告的 html。 然而，其他測試頁面提供了應該呈現的真實 html，除非它沒有。

如何解決總是返回“無”的結果？

def render(source_html):
    """Fully render HTML, JavaScript and all."""

    import sys
    from PyQt5.QtWidgets import QApplication
    from PyQt5.QtWebEngineWidgets import QWebEngineView
    
    class Render(QWebEngineView):
        def __init__(self, html):
            self.html = None
            self.app = QApplication(sys.argv)
            QWebEngineView.__init__(self)
            self.loadFinished.connect(self._loadFinished)
            self.setHtml(html)
            self.app.exec_()

        def _loadFinished(self, result):
            # This is an async call, you need to wait for this
            # to be called before closing the app
            self.page().toHtml(self.callable)

        def callable(self, data):
            self.html = data
            # Data has been stored, it's safe to quit the app
            self.app.quit()
            
            return Render(source_html).html

import requests
#url = 'http://webscraping.com'  
#url='http://www.amazon.com'
url='https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
sample_html = requests.get(url).text
print(sample_html)
print(render(sample_html))

編輯：感謝已納入代碼的答復。 但是現在它返回一個錯誤並且腳本掛起，直到我殺死 python 啟動器，然后導致段錯誤：

這是修改后的代碼：

def render(source_url):
    """Fully render HTML, JavaScript and all."""

    import sys
    from PyQt5.QtWidgets import QApplication
    from PyQt5.QtCore import QUrl
    from PyQt5.QtWebEngineWidgets import QWebEngineView

    class Render(QWebEngineView):
        def __init__(self, url):
            self.html = None
            self.app = QApplication(sys.argv)
            QWebEngineView.__init__(self)
            self.loadFinished.connect(self._loadFinished)
            # self.setHtml(html)
            self.load(QUrl(url))
            self.app.exec_()

        def _loadFinished(self, result):
            # This is an async call, you need to wait for this
            # to be called before closing the app
            self.page().toHtml(self._callable)

        def _callable(self, data):
            self.html = data
            # Data has been stored, it's safe to quit the app
            self.app.quit()

    return Render(source_url).html

# url = 'http://webscraping.com'
# url='http://www.amazon.com'
url = "https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1"
print(render(url))

這會引發這些錯誤：

$ python3 -tt fees-pkg-v2.py
Traceback (most recent call last):
  File "fees-pkg-v2.py", line 30, in _callable
    self.html = data
AttributeError: 'method' object has no attribute 'html'
None   (hangs here until force-quit python launcher)
Segmentation fault: 11
$

我已經開始閱讀 python 類以完全理解我在做什么（總是一件好事）。 我在想我的環境中的某些問題可能是問題（OSX Yosemite、Python 3.4.3、Qt5.4.1、sip-4.16.6）。 還有其他建議嗎？

Answer 1

問題是環境。 我已經手動安裝了 Python 3.4.3、Qt5.4.1 和 sip-4.16.6，一定是搞砸了。 安裝 Anaconda 后，腳本開始工作。 再次感謝。

使用 PyQt5 和 QWebEngineView 抓取 javascript 頁面

問題描述

1 個解決方案

解決方案1
0 2017-07-24 14:14:22

使用 PyQt5 和 QWebEngineView 抓取 javascript 頁面

問題描述

1 個解決方案

解決方案1 0 2017-07-24 14:14:22

解決方案1
0 2017-07-24 14:14:22