简体   繁体   中英

xpath how to format path

I would like to get @src value '/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg' from webpage

from lxml import html
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
session = requests.session()
page = session.get(URL)
HTMLn = html.fromstring(page.content)
print    HTMLn.xpath('//html/body/div[1]/div/div/div[3]/div[19]/div/a[2]/div/div/img/@src')[0]

but I can't. No matter how I format xpath, i tdooesnt work.

I used a combination of requests and beautiful soup libraries. They both are wonderful and I would recommend them for scraping and parsing/extracting HTML. If you have a complex scraping job, scrapy is really good.

So for your specific example, I can do

from bs4 import BeautifulSoup
import requests

URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
r = requests.get(URL)

soup = BeautifulSoup(r.text, "html.parser")
specific_element = soup.find_all('a', class_="product-icon")[14]
res = specific_element.find('img')["data-src"]
print(res)

It will print out

/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg

In the spirit of @pmuntima's answer , if you already know it's the 14th sourced image, but want to stay with lxml , then you can:

print HTMLn.xpath('//img/@data-src')[14]

To get that particular image. It similarly reports:

/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg

If you want to do your indexing in XPath (possibly more efficient in very large result sets), then:

print HTMLn.xpath('(//img/@data-src)[14]')[0]

It's a little bit uglier, given the need to parenthesize in the XPath, and then to index out the first element of the list that .xpath always returns.

Still, as discussed in the comments above, strictly numerical indexing is generally a fragile scraping pattern.


Update: So why is the XPath given by browser inspect tools not leading to the right element? Because the content seen by a browser, after a dynamic JavaScript-based update process, is different from the content seen by your request. Your request is not running JS, and is doing no such updates. Different content, different address needed--if the address is static and fragile, at any rate.

Part of the updates here seem to be taking src URIs, which initially point to an "I'm loading!" gif, and replacing them with the "real" src values, which are found in the data-src attribute to begin.

So you need two changes:

  1. a stronger way to address the content you want (a way that doesn't break when you move from browser inspect to program fetch) and
  2. to fetch the URIs you want from data-src not src , because in your program fetch, the JS has not done its load-and-switch trick the way it did in the browser.

If you know text associated with the target image, that can be the trick. Eg:

search_phrase = 'DECK SANTA CRUZ STAR WARS EMPIRE STRIKES BACK POSTER'
path = '//img[contains(@alt, "{}")]/@data-src'.format(search_phrase)
print HTMLn.xpath(path)[0]

This works because the alt attribute contains the target text. You look for images that have the search phrase contained in their alt attributes, then fetch the corresponding data-src values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM