Selecting elements from a Scrapy Spider response

Question

could anybody help me to figure out how to extract just the links from this page scraped using Scrapy?

I have emended the spider code as follows, but am struggling to figure out how to use the Scrapy selectors to yield only the links I want.

import scrapy

class RMWSpider(scrapy.Spider):
    name = "RMW"

    def start_requests(self):
        urls = [
            'http://search.people.com.cn/cnpeople/search.do?pageNum=1&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        links = response.css("ul").getall()
        for link in links:
            yield {
                'link': link.css('a')
            }

Ideally, I want a .json file with a list of links of the search results. Any more general tips on how to understand the use of selectors in Scrapy would also be really helpful.

Would appreciate any help anyone can offer as always. Thanks!

Answer 1

I think this is what you need:

URL_SELECTOR = "a::attr(href)"
urls = your_response.css(URL_SELECTOR).extract()

You should definetly search for the scrapy documentation, here you find something about selectors: Scrapy selectors

What I found very useful for the begginig is scrapy shell: Scrapy shell doc , where you can test commands and see the outputs :)

Hope that it will solve your problem.

Selecting elements from a Scrapy Spider response

Question

1 answers

solution1
0 ACCPTED 2020-01-09 17:42:04

Selecting elements from a Scrapy Spider response

Question

1 answers

solution1 0 ACCPTED 2020-01-09 17:42:04

solution1
0 ACCPTED 2020-01-09 17:42:04