Scrapy XPath迭代（外壳工程）

Question

I am trying to scrape some info from the companieshouse of the UK using scrapy. 我正在尝试使用scrapy从英国的companyhouse收集一些信息。 I made a connection with the website through the shell and throught he command 我通过外壳与网站建立了联系，并通过他的命令

 scrapy shell https://beta.companieshouse.gov.uk/search?q=a

and with 与

response.xpath('//*[@id="results"]').extract()

I managed to get the results back. 我设法得到了结果。

I tried to put this into a program so i could export it to a csv or json. 我试图将其放入程序，以便将其导出到csv或json。 But I am having trouble getting it to work.. This is what i got; 但是，我无法使其正常工作。

import scrapy


class QuotesSpider(scrapy.Spider):
name = "gov2"

def start_requests(self):
    start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']

def parse(self, response):
    products = response.xpath('//*[@id="results"]').extract()
    print(products)

Very simple but tried a lot. 很简单，但是尝试了很多。 Any insight would be appreciated!! 任何见识将不胜感激！

Answer 1

These lines of code are the problem: 这些代码行就是问题所在：

def start_requests(self):
    start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']

The start_requests method should return an iterable of Request s; start_requests方法应返回Request的可迭代； yours returns None . 您的返回None 。

The default start_requests creates this iterable from urls specified in start_urls , so simply defining that as a class variable (outside of any function) and not overriding start_requests will work as you want. 默认的start_requests从start_urls指定的url创建此可迭代项，因此只需将其定义为类变量（在任何函数之外），并且不覆盖start_requests就可以根据需要工作。

Answer 2

Try to do: 试着做：

import scrapy


class QuotesSpider(scrapy.Spider):

    name = "gov2"
    start_urls = ["https://beta.companieshouse.gov.uk/search?q=a"]

    def parse(self, response):
        products = response.xpath('//*[@id="results"]').extract()
        print(products)

Scrapy XPath迭代（外壳工程）

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-03-13 19:55:34

解决方案2
0 2019-03-13 20:08:40

Scrapy XPath迭代（外壳工程）

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-03-13 19:55:34

解决方案2 0 2019-03-13 20:08:40

解决方案1
2 已采纳 2019-03-13 19:55:34

解决方案2
0 2019-03-13 20:08:40