简体   繁体   中英

Using Python's QtWebkit to render Javascript-based page, get QThread: Destroyed while thread is still running

I'm trying to write a Beautifulsoup crawler for a web page that's loaded via JavaScript, which Beautifulsoup can't parse. To work around this, I followed this tutorial on rendering the page with QtWebkit before extracting all of the hrefs in the page from the resulting HTML with Beautifulsoup.

However, the page scraping is very large, and before it has finished getting these links it throws the error "QThread: Destroyed while thread is still running". Numerous people have posted questions about this error and received answers, however these were all for much more complex projects that had PyQT at the core of the application, so these responses assume familiarity with the library and I'm having real trouble trying to apply them to my case.

It seems like I need to keep the thread from being garbage collected by saving it in a variable, but the correct way to do this eludes me.

Here is my code:

import sys
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from bs4 import BeautifulSoup

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)

    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://www.lolesports.com/en_US/msi/msi_2016/schedule/default'  
r = Render(url)  
result = r.frame.toHtml()

soup = BeautifulSoup(result, "html.parser")
for link in soup.find_all('a'):
    print(link.get('href'))

There is nothing obviously wrong with the script, and apart from a few harmless Qt warning messages, it seems to work more or less as expected.

The Qt message you get is not really diagnostic of anything much, and doesn't necessarily indicate a fatal error condition. So I suspect the script is actually working correctly for you, and you may just be misinterpreting the output.

If you want to get rid of all the Qt messages, put the following line at the top of the script (just below the imports):

qInstallMsgHandler(lambda *args: None)

PS: Here is the exact output I get:

QFont::setPixelSize: Pixel size <= 0 (0)
QFont::setPixelSize: Pixel size <= 0 (0)
QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.
QFont::setPixelSize: Pixel size <= 0 (0)
QFont::setPixelSize: Pixel size <= 0 (0)
http://na.leagueoflegends.com/en/
http://na.leagueoflegends.com/en/
http://na.leagueoflegends.com/en/news/
http://gameinfo.na.leagueoflegends.com/en/game-info/
http://nexus.leagueoflegends.com/
http://www.lolesports.com/en_US
http://boards.na.leagueoflegends.com/en/
http://ulol.leagueoflegends.com/
https://support.riotgames.com/hc/en-us
https://na.merch.riotgames.com/en/
http://na.leagueoflegends.com/en/news/
http://gameinfo.na.leagueoflegends.com/en/game-info/
http://nexus.leagueoflegends.com/
http://www.lolesports.com/en_US
http://boards.na.leagueoflegends.com/en/
http://ulol.leagueoflegends.com/
https://support.riotgames.com/hc/en-us
https://na.merch.riotgames.com/en/
http://signup.na.leagueoflegends.com/en
None
http://www.lolesports.com/en_US
http://eu.lolesports.com/en
http://eu.lolesports.com/pl
http://eu.lolesports.com/en
http://eu.lolesports.com/fr
http://eu.lolesports.com/es
http://eu.lolesports.com/de
http://lan.lolesports.com/
http://las.lolesports.com/
http://lolesports.com.br/
http://www.lolespor.com/
http://oce.lolesports.com/
javascript:;
javascript:;
javascript:;
javascript:;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM