简体   繁体   中英

xpath correct using scrapy

import scrapy
from scrapy.http import Request

class PushpaSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://ap-rk.com/advokat-panfilov-vladimir-vladimirovich-moskva-otzyvy-telefon-adres-chasy-raboty-foto/']

    def parse(self, response):
        website=response.xpath("//td//strong[contains(.,'Официальный сайт:')]/following-sibling::td/text()").get()
        yield{
            'website':website
        }

I want to try to extract link but they will give me nothing these is the page link https://ap-rk.com/advokat-panfilov-vladimir-vladimirovich-moskva-otzyvy-telefon-adres-chasy-raboty-foto/

Nobody can stop you using a complex solution like scrapy, when the same result can be achieved in 2 lines of code (well, 3):

import pandas as pd

dfs = pd.read_html('https://ap-rk.com/advokat-panfilov-vladimir-vladimirovich-moskva-otzyvy-telefon-adres-chasy-raboty-foto/')
dfs[0]

This returns:

0   1
0   NaN Телефон: +7 (495) 646-0697
1   Рабочий адрес:  Москва
2   NaN г. Москва, ул. Бутырский вал, дом № 68, офис №403
3   Специализация:  корпоративное право предпринимательское право
4   Об адвокате:    NaN
5   Информация: Адвокатская палата: Республики Башкортостан Но...
6   Электронная почта:  order@chelovekizakon.ru
7   Официальный сайт:   https://chelovekizakon.ru

EDIT: as my attempt to suggesting a lesser complexity solution made the OP angry, here is the correct XPATH to locate the url in question:

"//td/strong[text()='Официальный сайт:']/parent::td//following-sibling::td"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM