简体   繁体   English

scrapy 可以用来从使用 AJAX 的网站上抓取动态内容吗?

[英]Can scrapy be used to scrape dynamic content from websites that are using AJAX?

I have recently been learning Python and am dipping my hand into building a web-scraper.我最近一直在学习 Python 并且正在尝试构建一个网络抓取工具。 It's nothing fancy at all;一点也不花哨。 its only purpose is to get the data off of a betting website and have this data put into Excel.它的唯一目的是从博彩网站获取数据并将这些数据放入 Excel。

Most of the issues are solvable and I'm having a good little mess around.大多数问题都是可以解决的,而且我遇到了一些麻烦。 However I'm hitting a massive hurdle over one issue.但是,我在一个问题上遇到了巨大的障碍。 If a site loads a table of horses and lists current betting prices this information is not in any source file.如果一个网站载入一张马匹表并列出当前投注价格,则此信息不在任何源文件中。 The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server.线索是这些数据有时是实时的,数字显然是从某个远程服务器更新的。 The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.我 PC 上的 HTML 只是有一个漏洞,他们的服务器正在推送我需要的所有有趣数据。

Now my experience with dynamic web content is low, so this thing is something I'm having trouble getting my head around.现在我对动态 web 内容的体验很低,所以这件事让我难以理解。

I think Java or Javascript is a key, this pops up often.我认为 Java 或 Javascript 是一个键,这个经常弹出。

The scraper is simply a odds comparison engine.刮板只是一个赔率比较引擎。 Some sites have APIs but I need this for those that don't.有些网站有 API,但我需要为那些没有的网站提供 API。 I'm using the scrapy library with Python 2.7我将 scrapy 库与 Python 2.7 一起使用

I do apologize if this question is too open-ended.如果这个问题过于开放,我深表歉意。 In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it?简而言之,我的问题是:如何使用 scrapy 来抓取此动态数据以便我可以使用它? So that I can scrape this betting odds data in real-time?这样我就可以实时抓取这个投注赔率数据?

Here is a simple example of scrapy with an AJAX request.这是一个带有 AJAX 请求的scrapy的简单示例。 Let see the site rubin-kazan.ru .让我们看看网站rubin-kazan.ru

All messages are loaded with an AJAX request.所有消息都使用 AJAX 请求加载。 My goal is to fetch these messages with all their attributes (author, date, ...):我的目标是获取这些消息及其所有属性(作者、日期……):

在此处输入图像描述

When I analyze the source code of the page I can't see all these messages because the web page uses AJAX technology.当我分析页面的源代码时,我看不到所有这些消息,因为该网页使用了 AJAX 技术。 But I can with Firebug from Mozilla Firefox (or an equivalent tool in other browsers) to analyze the HTTP request that generate the messages on the web page:但我可以使用 Mozilla Firefox 中的 Firebug(或其他浏览器中的等效工具)来分析在网页上生成消息的 HTTP 请求:

在此处输入图像描述

It doesn't reload the whole page but only the parts of the page that contain messages.它不会重新加载整个页面,而只会重新加载包含消息的页面部分。 For this purpose I click an arbitrary number of page on the bottom:为此,我单击底部的任意数量的页面:

在此处输入图像描述

And I observe the HTTP request that is responsible for message body:我观察到负责消息正文的 HTTP 请求:

在此处输入图像描述

After finish, I analyze the headers of the request (I must quote that this URL I'll extract from source page from var section, see the code below):完成后,我分析请求的标头(我必须引用我将从 var 部分的源页面中提取的此 URL,请参见下面的代码):

在此处输入图像描述

And the form data content of the request (the HTTP method is "Post"):以及请求的表单数据内容(HTTP方法为“Post”):

在此处输入图像描述

And the content of response, which is a JSON file:以及响应的内容,它是一个 JSON 文件:

在此处输入图像描述

Which presents all the information I'm looking for.它提供了我正在寻找的所有信息。

From now, I must implement all this knowledge in scrapy.从现在开始,我必须在scrapy中实现所有这些知识。 Let's define the spider for this purpose:让我们为此目的定义蜘蛛:

class spider(BaseSpider):
    name = 'RubiGuesst'
    start_urls = ['http://www.rubin-kazan.ru/guestbook.html']

    def parse(self, response):
        url_list_gb_messages = re.search(r'url_list_gb_messages="(.*)"', response.body).group(1)
        yield FormRequest('http://www.rubin-kazan.ru' + url_list_gb_messages, callback=self.RubiGuessItem,
                          formdata={'page': str(page + 1), 'uid': ''})

    def RubiGuessItem(self, response):
        json_file = response.body

In parse function I have the response for first request.parse函数中,我有第一个请求的响应。 In RubiGuessItem I have the JSON file with all information.RubiGuessItem我有包含所有信息的 JSON 文件。

Webkit based browsers (like Google Chrome or Safari) has built-in developer tools.基于 Webkit 的浏览器(如 Google Chrome 或 Safari)具有内置的开发人员工具。 In Chrome you can open it Menu->Tools->Developer Tools .在 Chrome 中,您可以打开它Menu->Tools->Developer Tools The Network tab allows you to see all information about every request and response: Network选项卡允许您查看有关每个请求和响应的所有信息:

在此处输入图像描述

In the bottom of the picture you can see that I've filtered request down to XHR - these are requests made by javascript code.在图片的底部,您可以看到我已将请求过滤到XHR - 这些是由 javascript 代码发出的请求。

Tip: log is cleared every time you load a page, at the bottom of the picture, the black dot button will preserve log.提示:每次加载页面都会清除日志,在图片底部,黑点按钮会保存日志。

After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data.在分析请求和响应后,您可以从您的网络爬虫模拟这些请求并提取有价值的数据。 In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.在许多情况下,获取数据比解析 HTML 更容易,因为该数据不包含表示逻辑并且被格式化为可以被 javascript 代码访问。

Firefox has similar extension, it is called firebug . Firefox 也有类似的扩展名,叫做firebug Some will argue that firebug is even more powerful but I like the simplicity of webkit.有些人会争辩说 firebug 更强大,但我喜欢 webkit 的简单性。

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness).很多时候,我们在抓取时遇到问题,页面上呈现的内容是用 Javascript 生成的,因此 scrapy 无法抓取它(例如 ajax 请求、jQuery 疯狂)。

However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.但是,如果您将 Scrapy 与 Web 测试框架 Selenium 一起使用,那么我们就能够抓取在普通 Web 浏览器中显示的任何内容。

Some things to note:需要注意的一些事项:

  • You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly.您必须安装 Python 版本的 Selenium RC 才能正常工作,并且您必须正确设置 Selenium。 Also this is just a template crawler.这也只是一个模板爬虫。 You could get much crazier and more advanced with things but I just wanted to show the basic idea.你可能会变得更疯狂、更先进,但我只是想展示基本的想法。 As the code stands now you will be doing two requests for any given url.按照现在的代码,您将对任何给定的 url 执行两个请求。 One request is made by Scrapy and the other is made by Selenium.一个请求是由 Scrapy 发出的,另一个是由 Selenium 发出的。 I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.我相信有办法解决这个问题,这样你就可以让 Selenium 做一个也是唯一一个请求,但我没有费心去实现它,通过做两个请求,你也可以用 Scrapy 抓取页面。

  • This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy.这非常强大,因为现在您可以抓取整个渲染的 DOM,并且您仍然可以使用 Scrapy 中所有不错的抓取功能。 This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.这当然会使爬行速度变慢,但取决于您需要多少渲染的 DOM,等待可能是值得的。

     from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.http import Request from selenium import selenium class SeleniumSpider(CrawlSpider): name = "SeleniumSpider" start_urls = ["http://www.domain.com"] rules = ( Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True), ) def __init__(self): CrawlSpider.__init__(self) self.verificationErrors = [] self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com") self.selenium.start() def __del__(self): self.selenium.stop() print self.verificationErrors CrawlSpider.__del__(self) def parse_page(self, response): item = Item() hxs = HtmlXPathSelector(response) #Do some XPath selection with Scrapy hxs.select('//div').extract() sel = self.selenium sel.open(response.url) #Wait for javscript to load in Selenium time.sleep(2.5) #Do some crawling of javascript created content with Selenium sel.get_text("//div") yield item # Snippet imported from snippets.scrapy.org (which no longer works) # author: wynbennett # date : Jun 21, 2011

Reference: http://snipplr.com/view/66998/参考: http ://snipplr.com/view/66998/

Another solution would be to implement a download handler or download handler middleware.另一种解决方案是实现下载处理程序或下载处理程序中间件。 (see scrapy docs for more information on downloader middleware) The following is an example class using selenium with headless phantomjs webdriver: (有关下载器中间件的更多信息,请参阅scrapy 文档)以下是使用 selenium 和 headless phantomjs webdriver 的示例类:

1) Define class within the middlewares.py script. 1)middlewares.py脚本中定义类。

from selenium import webdriver
from scrapy.http import HtmlResponse

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        driver = webdriver.PhantomJS(executable_path='D:\phantomjs.exe')
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

2) Add JsDownload() class to variable DOWNLOADER_MIDDLEWARE within settings.py : 2)JsDownload()类添加到settings.py中的变量DOWNLOADER_MIDDLEWARE

DOWNLOADER_MIDDLEWARES = {'MyProj.middleware.MiddleWareModule.MiddleWareClass': 500}

3) Integrate the HTMLResponse within your_spider.py . 3)your_spider.py集成到HTMLResponse中。 Decoding the response body will get you the desired output.解码响应正文将为您提供所需的输出。

class Spider(CrawlSpider):
    # define unique name of spider
    name = "spider"

    start_urls = ["https://www.url.de"] 

    def parse(self, response):
        # initialize items
        item = CrawlerItem()

        # store data as items
        item["js_enabled"] = response.body.decode("utf-8") 

Optional Addon:可选插件:
I wanted the ability to tell different spiders which middleware to use so I implemented this wrapper:我希望能够告诉不同的蜘蛛使用哪个中间件,所以我实现了这个包装器:

def check_spider_middleware(method):
@functools.wraps(method)
def wrapper(self, request, spider):
    msg = '%%s %s middleware step' % (self.__class__.__name__,)
    if self.__class__ in spider.middleware:
        spider.log(msg % 'executing', level=log.DEBUG)
        return method(self, request, spider)
    else:
        spider.log(msg % 'skipping', level=log.DEBUG)
        return None

return wrapper

for wrapper to work all spiders must have at minimum:为了使包装器工作,所有蜘蛛必须至少具有:

middleware = set([])

to include a middleware:包括一个中间件:

middleware = set([MyProj.middleware.ModuleName.ClassName])

Advantage:优势:
The main advantage to implementing it this way rather than in the spider is that you only end up making one request.以这种方式而不是在蜘蛛中实现它的主要优点是您最终只发出一个请求。 In AT's solution for example: The download handler processes the request and then hands off the response to the spider.以 AT 的解决方案为例:下载处理程序处理请求,然后将响应交给蜘蛛。 The spider then makes a brand new request in it's parse_page function -- That's two requests for the same content.然后蜘蛛在它的 parse_page 函数中发出一个全新的请求——这是对相同内容的两个请求。

I was using a custom downloader middleware, but wasn't very happy with it, as I didn't manage to make the cache work with it.我正在使用自定义下载器中间件,但对它不是很满意,因为我没有设法使缓存与它一起工作。

A better approach was to implement a custom download handler.更好的方法是实现自定义下载处理程序。

There is a working example here . 这里有一个工作示例。 It looks like this:它看起来像这样:

# encoding: utf-8
from __future__ import unicode_literals

from scrapy import signals
from scrapy.signalmanager import SignalManager
from scrapy.responsetypes import responsetypes
from scrapy.xlib.pydispatch import dispatcher
from selenium import webdriver
from six.moves import queue
from twisted.internet import defer, threads
from twisted.python.failure import Failure


class PhantomJSDownloadHandler(object):

    def __init__(self, settings):
        self.options = settings.get('PHANTOMJS_OPTIONS', {})

        max_run = settings.get('PHANTOMJS_MAXRUN', 10)
        self.sem = defer.DeferredSemaphore(max_run)
        self.queue = queue.LifoQueue(max_run)

        SignalManager(dispatcher.Any).connect(self._close, signal=signals.spider_closed)

    def download_request(self, request, spider):
        """use semaphore to guard a phantomjs pool"""
        return self.sem.run(self._wait_request, request, spider)

    def _wait_request(self, request, spider):
        try:
            driver = self.queue.get_nowait()
        except queue.Empty:
            driver = webdriver.PhantomJS(**self.options)

        driver.get(request.url)
        # ghostdriver won't response when switch window until page is loaded
        dfd = threads.deferToThread(lambda: driver.switch_to.window(driver.current_window_handle))
        dfd.addCallback(self._response, driver, spider)
        return dfd

    def _response(self, _, driver, spider):
        body = driver.execute_script("return document.documentElement.innerHTML")
        if body.startswith("<head></head>"):  # cannot access response header in Selenium
            body = driver.execute_script("return document.documentElement.textContent")
        url = driver.current_url
        respcls = responsetypes.from_args(url=url, body=body[:100].encode('utf8'))
        resp = respcls(url=url, body=body, encoding="utf-8")

        response_failed = getattr(spider, "response_failed", None)
        if response_failed and callable(response_failed) and response_failed(resp, driver):
            driver.close()
            return defer.fail(Failure())
        else:
            self.queue.put(driver)
            return defer.succeed(resp)

    def _close(self):
        while not self.queue.empty():
            driver = self.queue.get_nowait()
            driver.close()

Suppose your scraper is called "scraper".假设你的刮板叫做“刮板”。 If you put the mentioned code inside a file called handlers.py on the root of the "scraper" folder, then you could add to your settings.py:如果您将上述代码放在“scraper”文件夹根目录下名为 handlers.py 的文件中,则可以添加到您的 settings.py 中:

DOWNLOAD_HANDLERS = {
    'http': 'scraper.handlers.PhantomJSDownloadHandler',
    'https': 'scraper.handlers.PhantomJSDownloadHandler',
}

And voilà, the JS parsed DOM, with scrapy cache, retries, etc.瞧,JS 解析的 DOM,带有scrapy 缓存、重试等。

how can scrapy be used to scrape this dynamic data so that I can use it?如何使用scrapy 来抓取这些动态数据以便我可以使用它?

I wonder why no one has posted the solution using Scrapy only.我想知道为什么没有人只使用 Scrapy 发布解决方案。

Check out the blog post from Scrapy team SCRAPING INFINITE SCROLLING PAGES .查看 Scrapy 团队SCRAPING INFINITE SCROLLING PAGES的博客文章。 The example scraps http://spidyquotes.herokuapp.com/scroll website which uses infinite scrolling.该示例废弃了使用无限滚动的http://spidyquotes.herokuapp.com/scroll网站。

The idea is to use Developer Tools of your browser and notice the AJAX requests, then based on that information create the requests for Scrapy .这个想法是使用浏览器的开发者工具并注意 AJAX 请求,然后基于该信息创建对 Scrapy 的请求

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

Data that generated from external url which is API calls HTML response as POST method.从 API 的外部 url 生成的数据调用 HTML 响应作为 POST 方法。

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'  
    def start_requests(self):
        url = 'https://howlongtobeat.com/search_results?page=1'
        payload = "queryString=&t=games&sorthead=popular&sortd=0&plat=&length_type=main&length_min=&length_max=&v=&f=&g=&detail=&randomize=0"
        headers = {
            "content-type":"application/x-www-form-urlencoded",
            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
        }

        yield scrapy.Request(url,method='POST', body=payload,headers=headers,callback=self.parse)

    def parse(self, response):
        cards = response.css('div[class="search_list_details"]')

        for card in cards: 
            game_name = card.css('a[class=text_white]::attr(title)').get()
            yield {
                "game_name":game_name
            }
           

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

yes, Scrapy can scrape dynamic websites, website that are rendered through javaScript.是的,Scrapy 可以抓取动态网站,通过 JavaScript 渲染的网站。

There are Two approaches to scrapy these kind of websites.有两种方法可以抓取这类网站。

First,第一的,

you can use splash to render Javascript code and then parse the rendered HTML.您可以使用splash呈现 Javascript 代码,然后解析呈现的 HTML。 you can find the doc and project here Scrapy splash, git你可以在这里找到文档和项目Scrapy splash, git

Second,第二,

As everyone is stating, by monitoring the network calls , yes, you can find the api call that fetch the data and mock that call in your scrapy spider might help you to get desired data.正如每个人所说,通过监控network calls ,是的,您可以找到获取数据的 api 调用,并在您的 scrapy 蜘蛛中模拟该调用可能会帮助您获得所需的数据。

I handle the ajax request by using Selenium and the Firefox web driver.我使用 Selenium 和 Firefox Web 驱动程序处理 ajax 请求。 It is not that fast if you need the crawler as a daemon, but much better than any manual solution.如果您需要爬虫作为守护进程,它并没有那么快,但比任何手动解决方案都要好得多。 I wrote a short tutorial here for reference在这里写了一个简短的教程供参考

There are a few more modern alternatives in 2022 that I think should be mentioned, and I would like to list some pros and cons for the methods discussed in the more popular answers to this question.我认为应该提及 2022 年的一些更现代的替代方案,我想列出该问题更流行的答案中讨论的方法的一些优缺点。

  1. The top answer and several others discuss using the browsers dev tools or packet capturing software to try to identify patterns in response url 's, and try to re-construct them to use as scrapy.Request s.最佳答案和其他几个讨论使用浏览器dev tools或数据包捕获软件来尝试识别响应url的模式,并尝试重新构造它们以用作scrapy.Request s。

    • Pros: This is still the best option in my opinion, and when it is available it is quick and often times simpler than even the traditional approach ie extracting content from the HTML using xpath and css selectors.优点:在我看来,这仍然是最好的选择,当它可用时,它比传统方法更快,而且通常比传统方法更简单,即使用xpathcss选择器从 HTML 中提取内容。

    • Cons: Unfortunately this is only available on a fraction of dynamic sites and frequently websites have security measures in place that make using this strategy difficult.缺点:不幸的是,这仅在一小部分动态站点上可用,而且通常网站都有适当的安全措施,这使得使用此策略变得困难。

  2. Using Selenium Webdriver is the other approach mentioned a lot in previous answers.使用Selenium Webdriver是之前答案中经常提到的另一种方法。

    • Pros: It's easy to implement, and integrate into the scrapy workflow.优点:易于实施,并集成到 scrapy 工作流程中。 Additionally there are a ton of examples, and requires very little configuration if you use 3rd-party extensions like scrapy-selenium此外,还有大量示例,如果您使用scrapy-selenium等 3rd 方扩展,则只需很少的配置

    • Cons: It's slow.缺点:它很慢。 One of scrapy's key features is it's asynchronous workflow that makes it easy to crawl dozens or even hundreds of pages in seconds. scrapy 的关键特性之一是它的异步工作流,可以轻松地在几秒钟内抓取数十个甚至数百个页面。 Using selenium cuts this down significantly.使用 selenium 可以显着减少这种情况。

There are two new methods that defenitely worth consideration, scrapy-splash and scrapy-playwright .有两种新方法绝对值得考虑, scrapy-splashscrapy-playwright

scrapy-splash :刮擦飞溅

  • A scrapy plugin that integrates splash , a javascript rendering service created and maintained by the developers of scrapy, into the scrapy workflow.一个 scrapy 插件,集成了scrapy开发人员创建和维护的 javascript 渲染服务到 scrapy 工作流中。 The plugin can be installed from pypi with pip3 install scrapy-splash , while splash needs to run in it's own process, and is easiest to run from a docker container.该插件可以使用pip3 install scrapy-splash从 pypi 安装,而 splash 需要在它自己的进程中运行,并且最容易从 docker 容器运行。

scrapy-playwright : scrapy 剧作家

  • Playwright is a browser automation tool kind of like selenium , but without the crippling decrease in speed that comes with using selenium. Playwright has no issues fitting into the asynchronous scrapy workflow making sending requests just as quick as using scrapy alone. Playwright 是一种类似于selenium的浏览器自动化工具,但不会像使用 selenium 那样导致速度大幅下降。Playwright 可以毫无问题地适应异步 scrapy 工作流,发送请求与单独使用 scrapy 一样快。 It is also much easier to install and integrate than selenium. The scrapy-playwright plugin is maintained by the developers of scrapy as well, and after installing via pypi with pip3 install scrapy-playwright is as easy as running playwright install in the terminal.它也比 selenium 更容易安装和集成scrapy-playwright插件也由 scrapy 的开发人员维护,通过 pypi 使用pip3 install scrapy-playwright就像在终端中运行playwright install一样简单。

More details and many examples can be found at each of the plugin's github pages https://github.com/scrapy-plugins/scrapy-playwright and https://github.com/scrapy-plugins/scrapy-splash .可以在插件的每个 github 页https://github.com/scrapy-plugins/scrapy-playwrighthttps://github.com/scrapy-plugins/scrapy-splash找到更多详细信息和许多示例。

ps Both projects tend to work better in a linux environment in my experience. ps 根据我的经验,这两个项目往往在 linux 环境中工作得更好。 for windows users i recommend using it with The Windows Subsystem for Linux(wsl) .对于 windows 用户,我建议将它与Windows Linux 子系统(wsl)一起使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM