简体   繁体   English

Scrapy 在 xpath 查询时返回 None

[英]Scrapy returning None on querying by xpath

Hi so i am using srapy to scrape a website https://www.centralbankofindia.co.in and I am getting a response but on finding address by XPath I am getting None嗨,所以我正在使用 srapy 来抓取网站https://www.centralbankofindia.co.in ,我收到了回复,但是在通过 XPath 找到地址时,我没有收到任何回复

    start_urls = [
    "https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page={}".format(
        i
    )
    for i in range(0, 5)
]
brand_name = "Central Bank of India"
spider_type = "chain"
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[1]/td[2]/div/span[2]
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[2]/td[2]/div/span[2]
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[3]/td[2]/div/span[2]
def parse(self, response, **kwargs):
    """Parse response."""
    # print(response.text)
    for id in range(1, 11):
        address = self.get_text(
            response,
            f'//*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[{id}]/td[2]/div/span[2]',
        )
        print(address)

    def get_text(self, response, path):
    sol = response.xpath(path).extract_first()
    return sol

The span class for address in the website doesn't have a unique id, is that what is causing the problem?网站地址的跨度 class 没有唯一 ID,这是导致问题的原因吗?

I think you created too complex xpath .我认为您创建xpath太复杂了。 You should skip some elements and use // instead.您应该跳过一些元素并使用//代替。

Some browsers may show tbody in DevTools but it may not exists in HTML which scrapy gets from server so better always skip it.有些浏览器可能会在DevTools中显示tbody ,但它可能不存在于 HTML 中,而scrapy从服务器获取,因此最好始终跳过它。

And you could use extract() instead of tr[{id}] and extract_first()你可以使用extract()而不是tr[{id}]extract_first()

This xpath works for me.这个 xpath 对我有用。

all_items = response.xpath('//*[@id="block-cbi-content"]//td[2]//span[2]/text()').extract()
        
for address in all_items:
    print(address)

BTW: I used text() in xpath to get address without HTML tags.顺便说一句:我在xpath中使用text()来获取没有 HTML 标签的地址。


Full working code.完整的工作代码。

You can put all in one file and run it as python script.py without creating project .您可以将所有内容放在一个文件中,并在不创建project的情况下将其作为python script.py运行。

It saves results in output.csv .它将结果保存在output.csv中。

In start_urls I set only link to first page because parse() searchs link to next page in HTML - so it can get all pages instead of range(0, 5)start_urls中,我只设置指向第一页的链接,因为parse()搜索指向 HTML 中下一页的链接 - 因此它可以获得所有页面而不是range(0, 5)

#!/usr/bin/env python3

import scrapy

class MySpider(scrapy.Spider):
    
    start_urls = [
        # f"https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page={i}"
        # for i in range(0, 5)
        
        # only first page - links to other pages it will find in HTML
        "https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page=0"
    ]
    
    name = "Central Bank of India"
    
    def parse(self, response):
        print(f'url: {response.url}')
        
        all_items = response.xpath('//*[@id="block-cbi-content"]//td[2]//span[2]/text()').extract()
        
        for address in all_items:
            print(address)
            yield {'address': address}

        # get link to next page
        
        next_page = response.xpath('//a[@rel="next"]/@href').extract_first()
        
        if next_page:
            print(f'Next Page: {next_page}')
            yield response.follow(next_page)
            
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(MySpider)
c.start()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM