简体   繁体   English

如何从JavaScript渲染的网页下载?

[英]How to download from JavaScript rendered webpage?

How can I download from links on a JavaScript rendered webpage?如何从 JavaScript 呈现的网页上的链接下载? Python is the preferred language. Python 是首选语言。

So far, I've tried using the Python bindings for Selenium on a headless server.到目前为止,我已经尝试在无头服务器上使用Python 绑定 Selenium This approach is terribly slow, fraught with error, and is incapable of reliably determining download progress or success.这种方法非常慢,充满错误,并且无法可靠地确定下载进度或成功。 Additionally, the headless server interferes with my clipboard (which is a problem).此外,无头服务器会干扰我的剪贴板(这是一个问题)。 I used Firefox as it can be configured to download to a default directory, but I don't think the Chrome situation is any better.我使用 Firefox,因为它可以配置为下载到默认目录,但我认为 Chrome 的情况并没有好多少。

Alternatively, I've tried using WebKit.或者,我尝试过使用 WebKit。

def render(url):
    """Fully render a webpage (JavaScript and all) and return the HTML."""

    import subprocess
    from textwrap import dedent

    script = dedent("""\
    import sys
    from PyQt4.QtCore import QUrl
    from PyQt4.QtGui import QApplication
    from PyQt4.QtWebKit import QWebPage

    class Render(QWebPage):

        def __init__(self, url):
            self.app = QApplication(sys.argv)
            QWebPage.__init__(self)
            self.loadFinished.connect(self._loadFinished)
            self.mainFrame().load(QUrl(url))
            self.app.exec_()

        def _loadFinished(self, result):
            self.frame = self.mainFrame()
            self.app.quit()

    render = Render(sys.argv[1])
    print render.frame.toHtml().toAscii()""").encode()

    process = subprocess.Popen(['python2', '-', url],
                               stderr=subprocess.PIPE,
                               stdin=subprocess.PIPE,
                               stdout=subprocess.PIPE)

    # pipe script into Python's stdin
    return process.communicate(script)[0].decode('latin1')

This would be great if not for the fact that I need the download to be in the same session. Is there any way to preserve the session used to render the page?如果不是因为我需要下载在同一个 session 中,这会很棒。有什么方法可以保留用于呈现页面的 session 吗? PyQt4 and WebKit are just a bunch of shared libraries. PyQt4 和 WebKit 只是一堆共享库。 I'm not sure how to tear up the guts of them or whether such a thing even possible.我不确定如何撕掉他们的内脏,或者这样的事情是否可能。

Right now I'm just doing the following:现在我只是在做以下事情:

with requests.Session() as session:
    html = session.get(url).text
    link = get_url(html)
    download(link, session=session)

Without getting into the details, get_url(html, url) simply extracts the JavaScript from the page, hacks away any calls to the DOM, then executes it in node .在不深入细节的情况下, get_url(html, url)只是从页面中提取 JavaScript,取消对 DOM 的所有调用,然后在node中执行它。 Really nasty stuff...真是恶心的东西。。。

Any way I can safely render a webpage and keep the session?我可以安全地呈现网页并保留 session 的任何方式吗?

I'm also open to doing it completely in node if Python is not appropriate or the JavaScript alternative is much more elegant.如果 Python 不合适或者 JavaScript 替代方案更优雅,我也愿意完全在节点中完成它。 It looks like perhaps node-dom might suffice?看起来node-dom可能就足够了? I'm not really familiar with it enough to tell but I'm interested in any suggestions.我不太熟悉它,但我对任何建议都很感兴趣。

If a direct command-line option is suitable for you instead of going through Python and/or Selenium, Google Chrome can be run in headless mode.如果直接命令行选项适合您而不是通过 Python 和/或 Selenium,Google Chrome 可以在无头模式下运行。 It will do all the javascript rendering before dumping the DOM.它将在转储 DOM 之前完成所有 javascript 渲染。

/usr/local/bin/google-chrome \
  --headless \
  --virtual-time-budget=10000 \
  --timeout=10000 \
  --run-all-compositor-stages-before-draw \
  --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36" \
  --disable-gpu \
  --dump-dom "https://example.com/index.html" > rendered.html

PyQt5 in Python 2 or 3 does the trick in this case. PyQt5 in Python 2 或 3 在这种情况下可以解决问题。 Note the function is overly complex so as to support earlier versions of PyQt5 that use WebKit as well as later versions that use WebEngine.请注意,function 过于复杂,无法支持使用 WebKit 的早期版本 PyQt5 以及使用 WebEngine 的更高版本。

import sys


def render(source_html):
    """Return rendered HTML."""
    try:
        from PyQt5.QtCore import QEventLoop
        from PyQt5.QtWebEngineWidgets import QWebEngineView
        from PyQt5.QtWidgets import QApplication

        class Render(QWebEngineView):
            """Render HTML with PyQt5 WebEngine."""

            def __init__(self, html):
                self.html = None
                self.app = QApplication(sys.argv)
                QWebEngineView.__init__(self)
                self.loadFinished.connect(self._loadFinished)
                self.setHtml(html)
                while self.html is None:
                    self.app.processEvents(
                        QEventLoop.ExcludeUserInputEvents |
                        QEventLoop.ExcludeSocketNotifiers |
                        QEventLoop.WaitForMoreEvents)
                self.app.quit()

            def _callable(self, data):
                self.html = data

            def _loadFinished(self, result):
                self.page().toHtml(self._callable)
    except ImportError:
        from PyQt5.QtWebKitWidgets import QWebPage
        from PyQt5.QtWidgets import QApplication

        class Render(QWebPage):
            """Render HTML with PyQt5 WebKit."""

            def __init__(self, html):
                self.html = None
                self.app = QApplication(sys.argv)
                QWebPage.__init__(self)
                self.loadFinished.connect(self._loadFinished)
                self.mainFrame().setHtml(html)
                self.app.exec_()

            def _loadFinished(self, result):
                self.html = self.mainFrame().toHtml()
                self.app.quit()

    return Render(source_html).html

Or PyQt4 in Python 2.或 Python 中的 PyQt4 2.

import sys
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage


class Render(QWebPage):

    """Fully render HTML, JavaScript and all."""

    def __init__(self, html):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().setHtml(html)
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()

render = Render(html)
result = str(render.frame.toHtml().toAscii())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM