简体   繁体   English

使用Python的QtWebkit呈现基于Javascript的页面,获取QThread:在线程仍在运行时被销毁

[英]Using Python's QtWebkit to render Javascript-based page, get QThread: Destroyed while thread is still running

I'm trying to write a Beautifulsoup crawler for a web page that's loaded via JavaScript, which Beautifulsoup can't parse. 我正在尝试为一个通过JavaScript加载的网页编写一个Beautifulsoup爬虫,这是Beautifulsoup无法解析的。 To work around this, I followed this tutorial on rendering the page with QtWebkit before extracting all of the hrefs in the page from the resulting HTML with Beautifulsoup. 为了解决这个问题,我按照本教程使用QtWebkit渲染页面,然后使用Beautifulsoup从生成的HTML中提取页面中的所有href。

However, the page scraping is very large, and before it has finished getting these links it throws the error "QThread: Destroyed while thread is still running". 但是,页面抓取非常大,在它完成获取这些链接之前,它会抛出错误“QThread:在线程仍在运行时被销毁”。 Numerous people have posted questions about this error and received answers, however these were all for much more complex projects that had PyQT at the core of the application, so these responses assume familiarity with the library and I'm having real trouble trying to apply them to my case. 许多人已经发布了关于这个错误的问题并得到了答案,但是这些都是针对PyQT作为应用程序核心的更复杂的项目,因此这些响应假设熟悉库并且我在尝试应用它们时遇到了麻烦对我而言。

It seems like I need to keep the thread from being garbage collected by saving it in a variable, but the correct way to do this eludes me. 看起来我需要通过将线程保存在变量中来防止线程被垃圾收集,但是正确的方法可以避免我这样做。

Here is my code: 这是我的代码:

import sys
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from bs4 import BeautifulSoup

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)

    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://www.lolesports.com/en_US/msi/msi_2016/schedule/default'  
r = Render(url)  
result = r.frame.toHtml()

soup = BeautifulSoup(result, "html.parser")
for link in soup.find_all('a'):
    print(link.get('href'))

There is nothing obviously wrong with the script, and apart from a few harmless Qt warning messages, it seems to work more or less as expected. 脚本没有明显的错误,除了一些无害的Qt警告消息之外,它似乎或多或少地按预期工作。

The Qt message you get is not really diagnostic of anything much, and doesn't necessarily indicate a fatal error condition. 你得到的Qt消息并不是真正的诊断,并不一定表明致命的错误情况。 So I suspect the script is actually working correctly for you, and you may just be misinterpreting the output. 所以我怀疑脚本实际上正在为你工作,你可能只是误解了输出。

If you want to get rid of all the Qt messages, put the following line at the top of the script (just below the imports): 如果您想要删除所有Qt消息,请将以下行放在脚本的顶部(就在导入的下方):

qInstallMsgHandler(lambda *args: None)

PS: Here is the exact output I get: PS:这是我得到的确切输出:

QFont::setPixelSize: Pixel size <= 0 (0)
QFont::setPixelSize: Pixel size <= 0 (0)
QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.
QFont::setPixelSize: Pixel size <= 0 (0)
QFont::setPixelSize: Pixel size <= 0 (0)
http://na.leagueoflegends.com/en/
http://na.leagueoflegends.com/en/
http://na.leagueoflegends.com/en/news/
http://gameinfo.na.leagueoflegends.com/en/game-info/
http://nexus.leagueoflegends.com/
http://www.lolesports.com/en_US
http://boards.na.leagueoflegends.com/en/
http://ulol.leagueoflegends.com/
https://support.riotgames.com/hc/en-us
https://na.merch.riotgames.com/en/
http://na.leagueoflegends.com/en/news/
http://gameinfo.na.leagueoflegends.com/en/game-info/
http://nexus.leagueoflegends.com/
http://www.lolesports.com/en_US
http://boards.na.leagueoflegends.com/en/
http://ulol.leagueoflegends.com/
https://support.riotgames.com/hc/en-us
https://na.merch.riotgames.com/en/
http://signup.na.leagueoflegends.com/en
None
http://www.lolesports.com/en_US
http://eu.lolesports.com/en
http://eu.lolesports.com/pl
http://eu.lolesports.com/en
http://eu.lolesports.com/fr
http://eu.lolesports.com/es
http://eu.lolesports.com/de
http://lan.lolesports.com/
http://las.lolesports.com/
http://lolesports.com.br/
http://www.lolespor.com/
http://oce.lolesports.com/
javascript:;
javascript:;
javascript:;
javascript:;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM