Python Scrapy 返回 200 但关闭蜘蛛没有

Question

Scrapy 的新手，并试图抓取一些简单的 Html 表。 我在同一页面中为两个不同的表找到了具有相同架构的站点，但是刮擦似乎在其中一种情况下有效，但在另一种情况下无效。 这是链接： https://fbref.com/en/comps/12/stats/La-Liga-Stats

我的代码有效（第一个表，顶部的那个）：

import scrapy


class PostSpider(scrapy.Spider):

    name = 'stats'

    start_urls = [
        'https://fbref.com/en/comps/12/stats/La-Liga-Stats',
    ]

    def parse(self, response):
       for row in response.xpath('//*[@id="stats_standard_squads"]//tbody/tr'):
           yield {
               'players': row.xpath('td[2]//text()').extract_first(),
               'possession': row.xpath('td[3]//text()').extract_first(),
               'played': row.xpath('td[4]//text()').extract_first(),
               'starts': row.xpath('td[5]//text()').extract_first(),
               'minutes': row.xpath('td[6]//text()').extract_first(),
               'goals': row.xpath('td[7]//text()').extract_first(),
               'assists': row.xpath('td[8]//text()').extract_first(),
               'penalties': row.xpath('td[9]//text()').extract_first(),
           }

现在由于某种原因，当我尝试抓取下表时（使用相关的 xPath 选择器），它什么也不返回：

import scrapy


class PostSpider(scrapy.Spider):

    name = 'stats'

    start_urls = [
        'https://fbref.com/en/comps/12/stats/La-Liga-Stats',
    ]

    def parse(self, response):

       for row in response.xpath('//*[@id="stats_standard"]//tbody/tr'):
           yield {
               'player': row.xpath('td[2]//text()').extract_first(),
               'nation': row.xpath('td[3]//text()').extract_first(),
               'pos': row.xpath('td[4]//text()').extract_first(),
               'squad': row.xpath('td[5]//text()').extract_first(),
               'age': row.xpath('td[6]//text()').extract_first(),
               'born': row.xpath('td[7]//text()').extract_first(),
               '90s': row.xpath('td[8]//text()').extract_first(),
               'att': row.xpath('td[9]//text()').extract_first(),
           }

这是我执行scrapy crawl stats时来自终端的日志：

2020-07-23 17:35:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fbref.com/robots.txt> (referer: None)
2020-07-23 17:35:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fbref.com/en/comps/12/stats/La-Liga-Stats> (referer: None)
2020-07-23 17:35:34 [scrapy.core.engine] INFO: Closing spider (finished)

发生这种情况的原因是什么？ 据我所知，这些表具有相同的结构。

Answer 1

问题是源代码中没有id="stats_standard"在这里view-source:https://fbref.com/en/comps/12/stats/La-Liga-Stats在实时 HTML 代码中。 它可以作为注释代码使用。

试试response.css('.placeholder::text').getall() 。 您需要使用正则表达式对其进行解析，或者您可以使用from scrapy import Selector中的库。

from scrapy import Selector    
Selector(text=you_raw_html)

Python Scrapy 返回 200 但关闭蜘蛛没有

问题描述

1 个解决方案

解决方案1
1 2020-07-23 09:54:19

Python Scrapy 返回 200 但关闭蜘蛛没有

问题描述

1 个解决方案

解决方案1 1 2020-07-23 09:54:19

解决方案1
1 2020-07-23 09:54:19