简体   繁体   中英

Selecting elements from a Scrapy Spider response

could anybody help me to figure out how to extract just the links from this page scraped using Scrapy?

I have emended the spider code as follows, but am struggling to figure out how to use the Scrapy selectors to yield only the links I want.

import scrapy

class RMWSpider(scrapy.Spider):
    name = "RMW"

    def start_requests(self):
        urls = [
            'http://search.people.com.cn/cnpeople/search.do?pageNum=1&keyword=%C8%F0%B5%E4&siteName=news&facetFlag=true&nodeType=belongsId&nodeId=0'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        links = response.css("ul").getall()
        for link in links:
            yield {
                'link': link.css('a')
            }

Ideally, I want a .json file with a list of links of the search results. Any more general tips on how to understand the use of selectors in Scrapy would also be really helpful.

Would appreciate any help anyone can offer as always. Thanks!

I think this is what you need:

URL_SELECTOR = "a::attr(href)"
urls = your_response.css(URL_SELECTOR).extract()

You should definetly search for the scrapy documentation, here you find something about selectors: Scrapy selectors

What I found very useful for the begginig is scrapy shell: Scrapy shell doc , where you can test commands and see the outputs :)

Hope that it will solve your problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM