无法从scrapy中的所有页面获取数据

Question

I am unable to fetch all pages using below code it only gives data upto page 90 and then show arribute error.我无法使用以下代码获取所有页面，它只提供第 90 页的数据，然后显示参数错误。 I am using next button url to move to the next page.我正在使用下一个按钮 url 移动到下一页。 But after page 90 it is giving error that i have mentioned below.但是在第 90 页之后，它给出了我在下面提到的错误。

Running this code:运行此代码：

import scrapy
import re

class PaginationSpider(scrapy.Spider):
    name = 'pagination'
    allowed_domains = ['www.farfetch.com']
    start_urls = ['https://www.farfetch.com/de/shopping/men/shoes-2/items.aspx?page=1']

    total_pages_pattern = r'"totalPages":(\d+)'
    current_page_pattern = r"page=(\d+)"

    def parse(self, response):
        
        number_of_pages= int(re.search(self.total_pages_pattern, str(response.body)).group(1))
        current_page = int(re.search(self.current_page_pattern, response.url).group(1))
        
        for brand in response.xpath("//h3[@itemprop='brand']//text()"):

            yield {
                "brand":brand.get()
            }

        if current_page <= number_of_pages:

            next_page = "https://www.farfetch.com/de/shopping/men/shoes-2/items.aspx?page=" + str(current_page+1)
            
            print("Current_page:" + str(current_page))

            yield response.follow(url=next_page, callback=self.parse)

Error :错误： 错误图片

Answer 1

    current_page = int(re.search(self.current_page_pattern, response.url).group(1))

re.search() method will return a Re object if the pattern matches the string.如果模式匹配字符串， re.search()方法将返回一个 Re 对象。 If there is no match, it will return None .如果没有匹配项，它将返回None 。 So, when the pattern doesn't match, you are calling .group(1) in None .因此，当模式不匹配时，您将在None中调用.group(1) 。

That's why you are getting an AttributeError .这就是您收到AttributeError的原因。

I didn't execute you code, but you can probably solve it by adding a if statement.我没有执行你的代码，但你可以通过添加 if 语句来解决它。

无法从scrapy中的所有页面获取数据

问题描述

1 个解决方案

解决方案1
1 2020-10-16 15:04:27

无法从scrapy中的所有页面获取数据

问题描述

1 个解决方案

解决方案1 1 2020-10-16 15:04:27

解决方案1
1 2020-10-16 15:04:27