繁体   English   中英

PyQt4 Scrapy实施

[英]PyQt4 Scrapy Implementation

使用scrapy我遇到了用javascript渲染页面的问题。 例如,对于Forum Franchise网站的链接http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69 ,尝试抓取源html我找不到任何帖子,因为它们似乎呈现页面后“附加”(可能通过javascript)。

所以我在网上寻找解决此问题的方法,结果遇到了https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/

我是PYPQ的新手,但希望能走捷径并复制粘贴一些代码。

当我尝试抓取单个页面时,这非常适合。 但是当我在草率地实现这一点时,出现以下错误:

QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()

如果我只抓取一个页面,则不会发生任何错误,但是当我将爬虫设置为递归模式时,则在第二个链接处我会收到一个错误,指出python.exe停止工作,并且出现了以上错误。

我将搜索可能是什么,并且在某个地方读取QApplication对象应该只启动一次。

有人可以告诉我正确的实现方法是什么?

蜘蛛

# -*- coding: utf-8 -*-
import scrapy
import sys, traceback
from bs4 import BeautifulSoup as bs
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import ThreadItem, PostItem
from crawler.utils import utils


class IdeefranchiseSpider(CrawlSpider):
    name = "ideefranchise"
    allowed_domains = ["idee-franchise.com"]
    start_urls = (
        'http://www.idee-franchise.com/forum/',
        # 'http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69',
    )

    rules = [
        Rule(LinkExtractor(allow='/forum/'), callback='parse_thread', follow=True)
    ]

    def parse_thread(self, response):
        print "Parsing Thread", response.url
        thread = ThreadItem()
        thread['url'] = response.url
        thread['domain'] = self.allowed_domains[0]
        thread['title'] = self.get_thread_title(response)
        thread['forumname'] = self.get_thread_forum_name(response)
        thread['posts'] = self.get_thread_posts(response)
        yield thread

        # paginate if possible
        next_page = response.css('fieldset.display-options > a::attr("href")')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse_thread)

    def get_thread_posts(self, response):
        # using PYQTRenderor to reload page. I think this is where the problem
        # occurs, when i initiate the PYQTPageRenderor object. 
        soup = bs(unicode(utils.PYQTPageRenderor(response.url).get_html()))

        # sleep so that PYQT can render page
        # time.sleep(5)

        # comments
        posts = []
        for item in soup.select("div.post.bg2") + soup.select("div.post.bg1"):
            try:
                post = PostItem()
                post['profile'] = item.select("p.author > strong > a")[0].get_text()
                details = item.select('dl.postprofile > dd')
                post['date'] = details[2].get_text()
                post['content'] = item.select('div.content')[0].get_text()

                # appending the comment
                posts.append(post)
            except:
                e = sys.exc_info()[0]
                self.logger.critical("ERROR GET_THREAD_POSTS %s", e)
                traceback.print_exc(file=sys.stdout)
        return posts

PYPQ实施

import sys
from PyQt4.QtCore import QUrl
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage 

class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        self.app.quit()


class PYQTPageRenderor(object):
    def __init__(self, url):
        self.url = url

    def get_html(self):
        r = Render(self.url)
        return unicode(r.frame.toHtml())

如果您想自己做,正确的实现是创建一个使用PyQt处理请求的下层中间件 Scrapy将实例化它一次。

不应该那么复杂,只是

  1. 在项目的middleware.py文件中创建QTDownloader类

  2. 构造函数应创建QApplication对象。

  3. process_request方法应该执行url加载和HTML提取。 请注意,您将返回带有HTML字符串的Response对象。

  4. 您可以在类的_cleanup方法中进行适当的清理。

  5. 最后,通过将中间件添加到项目的settings.py文件的DOWNLOADER_MIDDLEWARES变量中来激活它。

如果您不想编写自己的解决方案,则可以使用使用Selenium进行下载的现有中间件,例如scrapy-webdriver 如果您不想使用可见的浏览器,则可以指示它使用PhantomJS。

EDIT1:因此,如Rejected所指出的,执行此操作的正确方法是使用下载处理程序。 这个想法是相似的,但是下载应该以download_request方法进行,并且应该通过将其添加到DOWNLOAD_HANDLERS来启用。 WebdriverDownloadHandler为例。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM