[英]Scrapy returning None on querying by xpath
Hi so i am using srapy to scrape a website https://www.centralbankofindia.co.in and I am getting a response but on finding address by XPath I am getting None嗨,所以我正在使用 srapy 来抓取网站https://www.centralbankofindia.co.in ,我收到了回复,但是在通过 XPath 找到地址时,我没有收到任何回复
start_urls = [
"https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page={}".format(
i
)
for i in range(0, 5)
]
brand_name = "Central Bank of India"
spider_type = "chain"
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[1]/td[2]/div/span[2]
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[2]/td[2]/div/span[2]
# //*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[3]/td[2]/div/span[2]
def parse(self, response, **kwargs):
"""Parse response."""
# print(response.text)
for id in range(1, 11):
address = self.get_text(
response,
f'//*[@id="block-cbi-content"]/div/div/div/div[3]/div/table/tbody/tr[{id}]/td[2]/div/span[2]',
)
print(address)
def get_text(self, response, path):
sol = response.xpath(path).extract_first()
return sol
The span class for address in the website doesn't have a unique id, is that what is causing the problem?网站地址的跨度 class 没有唯一 ID,这是导致问题的原因吗?
I think you created too complex xpath
.我认为您创建xpath
太复杂了。 You should skip some elements and use //
instead.您应该跳过一些元素并使用//
代替。
Some browsers may show tbody
in DevTools
but it may not exists in HTML which scrapy
gets from server so better always skip it.有些浏览器可能会在DevTools
中显示tbody
,但它可能不存在于 HTML 中,而scrapy
从服务器获取,因此最好始终跳过它。
And you could use extract()
instead of tr[{id}]
and extract_first()
你可以使用extract()
而不是tr[{id}]
和extract_first()
This xpath works for me.这个 xpath 对我有用。
all_items = response.xpath('//*[@id="block-cbi-content"]//td[2]//span[2]/text()').extract()
for address in all_items:
print(address)
BTW: I used text()
in xpath
to get address without HTML tags.顺便说一句:我在xpath
中使用text()
来获取没有 HTML 标签的地址。
Full working code.完整的工作代码。
You can put all in one file and run it as python script.py
without creating project
.您可以将所有内容放在一个文件中,并在不创建project
的情况下将其作为python script.py
运行。
It saves results in output.csv
.它将结果保存在output.csv
中。
In start_urls
I set only link to first page because parse()
searchs link to next page in HTML - so it can get all pages instead of range(0, 5)
在start_urls
中,我只设置指向第一页的链接,因为parse()
搜索指向 HTML 中下一页的链接 - 因此它可以获得所有页面而不是range(0, 5)
#!/usr/bin/env python3
import scrapy
class MySpider(scrapy.Spider):
start_urls = [
# f"https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page={i}"
# for i in range(0, 5)
# only first page - links to other pages it will find in HTML
"https://www.centralbankofindia.co.in/en/branch-locator?field_state_target_id=All&combine=&page=0"
]
name = "Central Bank of India"
def parse(self, response):
print(f'url: {response.url}')
all_items = response.xpath('//*[@id="block-cbi-content"]//td[2]//span[2]/text()').extract()
for address in all_items:
print(address)
yield {'address': address}
# get link to next page
next_page = response.xpath('//a[@rel="next"]/@href').extract_first()
if next_page:
print(f'Next Page: {next_page}')
yield response.follow(next_page)
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(MySpider)
c.start()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.