I am trying to scrape some info from the companieshouse of the UK using scrapy. I made a connection with the website through the shell and throught he command
scrapy shell https://beta.companieshouse.gov.uk/search?q=a
and with
response.xpath('//*[@id="results"]').extract()
I managed to get the results back.
I tried to put this into a program so i could export it to a csv or json. But I am having trouble getting it to work.. This is what i got;
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
def parse(self, response):
products = response.xpath('//*[@id="results"]').extract()
print(products)
Very simple but tried a lot. Any insight would be appreciated!!
These lines of code are the problem:
def start_requests(self):
start_urls = ['https://beta.companieshouse.gov.uk/search?q=a']
The start_requests
method should return an iterable of Request
s; yours returns None
.
The default start_requests
creates this iterable from urls specified in start_urls
, so simply defining that as a class variable (outside of any function) and not overriding start_requests
will work as you want.
Try to do:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "gov2"
start_urls = ["https://beta.companieshouse.gov.uk/search?q=a"]
def parse(self, response):
products = response.xpath('//*[@id="results"]').extract()
print(products)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.