简体   繁体   中英

XPath from Chrome results in an empty list in scrapy

I'm inspecting a page by Chrome Dev Tools and have xpath of an element on the page. I disable javascript deliberately so DOM doesn't get changed. However, xpath I Chrome gives for the element results in [] in scrapy, although the element, of course, exists. What might be the problem?

In particular, xpath //*[@id="prddeatailed_container"]/table[1]/tbody/tr[1]/td/div/table/tbody/tr[2]/td[1]/span for this http://cheaptool.ru/product/sadovyj-pylesos-billy-goat-lb351/ - the price 29 990.

$ scrapy shell 'http://cheaptool.ru/product/sadovyj-pylesos-billy-goat-lb351'

In [2]: xp1 = '//*[@id="prddeatailed_container"]/table[1]/tbody/tr[1]/td/div/table/tbody/tr[2]/td[1]/span'

In [3]: aaa = response.xpath(xp1)

In [4]: aaa
Out[4]: []

UPDATE: It turned out in the result html there was no tbody. Why did Chrome showed it in xpath? How to make it the real html in xpath?

"I disable javascript deliberately so DOM doesn't get changed"

Besides javascript, DOM can also get changed because browsers usually has algorithms to fix the html source so that it can be rendered reasonably well by the browser.

"@user3616725, the question is not what to use, but why doesn't it work"

Common case is as what you discovered while I'm writing this answer, Chrome added <tbody> tag automatically. See the following discussion for explanation about this behavior :

Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?

"It turned out in the result html there was no tbody. Why did Chrome showed it in xpath? How to make it the real html in xpath?"

The html result as rendered by Chrome indeed has <tbody> , that's why Chrome showed it in xpath. Chrome dev tools works against final DOM which may be different from the actual HTML source, so you simply can't rely on xpath from Chrome for use in Scrapy.

Since you mention tbody , a lot of HTML don't follow the rule of using tbody and usually Chrome fix it by adding tbody automatically to it. If you print the response HTML, you won't find any tbody .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM