简体   繁体   中英

Scraping javascript page with PyQt5 and QWebEngineView

I'm trying to render a javascripted webpage into populated HTML for scraping. Researching different solutions (selenium, reverse-engineering the page etc.) led me to this technique but I can't get it working. BTW I am new to python, basically at the cut/paste/experiment stage. Got past installation and indentation issues but I'm stuck now.

In the test code below, print(sample_html) works and returns the original html of the target page but print(render(sample_html)) always returns the word 'None'.

Interestingly, if you run this on amazon.com they detect it is not a real browser and return html with a warning about automated access. However the other test pages provide true html that should render, except it doesn't.

How do I troubleshoot the result always returning "None'?

def render(source_html):
    """Fully render HTML, JavaScript and all."""

    import sys
    from PyQt5.QtWidgets import QApplication
    from PyQt5.QtWebEngineWidgets import QWebEngineView
    
    class Render(QWebEngineView):
        def __init__(self, html):
            self.html = None
            self.app = QApplication(sys.argv)
            QWebEngineView.__init__(self)
            self.loadFinished.connect(self._loadFinished)
            self.setHtml(html)
            self.app.exec_()

        def _loadFinished(self, result):
            # This is an async call, you need to wait for this
            # to be called before closing the app
            self.page().toHtml(self.callable)

        def callable(self, data):
            self.html = data
            # Data has been stored, it's safe to quit the app
            self.app.quit()
            
            return Render(source_html).html

import requests
#url = 'http://webscraping.com'  
#url='http://www.amazon.com'
url='https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
sample_html = requests.get(url).text
print(sample_html)
print(render(sample_html))

EDIT: Thanks for the responses which were incorporated into the code. But now it returns an error and the script hangs until I kill the python launcher which then causes a segfault:

This is the revised code:

def render(source_url):
    """Fully render HTML, JavaScript and all."""

    import sys
    from PyQt5.QtWidgets import QApplication
    from PyQt5.QtCore import QUrl
    from PyQt5.QtWebEngineWidgets import QWebEngineView

    class Render(QWebEngineView):
        def __init__(self, url):
            self.html = None
            self.app = QApplication(sys.argv)
            QWebEngineView.__init__(self)
            self.loadFinished.connect(self._loadFinished)
            # self.setHtml(html)
            self.load(QUrl(url))
            self.app.exec_()

        def _loadFinished(self, result):
            # This is an async call, you need to wait for this
            # to be called before closing the app
            self.page().toHtml(self._callable)

        def _callable(self, data):
            self.html = data
            # Data has been stored, it's safe to quit the app
            self.app.quit()

    return Render(source_url).html

# url = 'http://webscraping.com'
# url='http://www.amazon.com'
url = "https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1"
print(render(url))

Which throws these errors:

$ python3 -tt fees-pkg-v2.py
Traceback (most recent call last):
  File "fees-pkg-v2.py", line 30, in _callable
    self.html = data
AttributeError: 'method' object has no attribute 'html'
None   (hangs here until force-quit python launcher)
Segmentation fault: 11
$

I already started reading up on python classes to fully understand what I'm doing (always a good thing). I'm thinking something in my environment could be the problems (OSX Yosemite, Python 3.4.3, Qt5.4.1, sip-4.16.6). Any other suggestions?

The problem was the environment. I had manually installed Python 3.4.3, Qt5.4.1, and sip-4.16.6 and must have mucked something up. After installing Anaconda, the script started working. Thanks again.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM