简体   繁体   English

为什么这个xpath在python中使用lxml失败?

[英]Why does this xpath fail using lxml in python?

Here is an example web page I am trying to get data from. 这是我试图从中获取数据的示例网页。 http://www.makospearguns.com/product-p/mcffgb.htm http://www.makospearguns.com/product-p/mcffgb.htm

The xpath was taken from chrome development tools, and firepath in firefox is also able to find it, but using lxml it just returns an empty list for 'text'. xpath取自chrome开发工具,firefox中的firepath也能找到它,但是使用lxml它只返回'text'的空列表。

from lxml import html
import requests

site_url = 'http://www.makospearguns.com/product-p/mcffgb.htm'
xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'

page = requests.get(site_url)
tree = html.fromstring(page.text) 
text = tree.xpath(xpath)

Printing out the tree text with 用。打印出树文本

print(tree.text_content().encode('utf-8'))

shows that the data is there, but it seems the xpath isn't working to find it. 显示数据存在,但似乎xpath无法找到它。 Is there something I am missing? 有什么我想念的吗? Most other sites I have tried work fine using lxml and the xpath taken from chrome dev tools, but a few I have found give empty lists. 我尝试过的大多数其他网站使用lxml和从chrome dev工具中获取的xpath都可以正常工作,但是我找到了一些空列表。

1. Browsers frequently change the HTML 1.浏览器经常更改HTML

Browsers quite frequently change the HTML served to it to make it "valid". 浏览器经常更改提供给它的HTML以使其“有效”。 For example, if you serve a browser this invalid HTML: 例如,如果您为浏览器提供此无效HTML:

<table>
  <p>bad paragraph</p>
  <tr><td>Note that cells and rows can be unclosed (and valid) in HTML
</table>

To render it, the browser is helpful and tries to make it valid HTML and may convert this to: 为了呈现它,浏览器是有用的,并尝试使其成为有效的HTML并可能将其转换为:

<p>bad paragraph</p>
<table>
  <tbody>
    <tr>
      <td>Note that cells and rows can be unclosed (and valid) in HTML</td>
    </tr>
  </tbody>
</table>

The above is changed because <p> aragraphs cannot be inside <table> s and <tbody> s are recommended. 以上是因为<p> aragraphs不能在<table> s里面而且推荐<tbody> s。 What changes are applied to the source can vary wildly by browser. 浏览器对源应用的更改可能会有很大差异。 Some will put invalid elements before tables, some after, some inside cells, etc... 有些会在表格之前放置无效元素,有些则在单元格之内,等等...

2. Xpaths aren't fixed, they are flexible in pointing to elements. 2. Xpath不是固定的,它们可以灵活地指向元素。

Using this 'fixed' HTML: 使用这个“固定”的HTML:

<p>bad paragraph</p>
<table>
  <tbody>
    <tr>
      <td>Note that cells and rows can be unclosed (and valid) in HTML</td>
    </tr>
  </tbody>
</table>

If we try to target the text of <td> cell, all of the following will give you approximately the right information: 如果我们尝试定位<td>单元格的文本,以下所有内容将为您提供大致正确的信息:

//td
//tr/td
//tbody/tr/td
/table/tbody/tr/td
/table//*/text()

And the list goes on... 而这样的例子不胜枚举...

however, in general browser will give you the most precise (and least flexible) XPath that lists every element from the DOM. 但是,通常浏览器会为您提供最精确(也是最不灵活)的XPath,它列出了DOM中的每个元素。 In this case: 在这种情况下:

/table[0]/tbody[0]/tr[0]/td[0]/text()

3. Conclusion: Browser given Xpaths are usually unhelpful 3.结论:给出Xpath的浏览器通常是无益的

This is why the XPaths produced by developer tools will frequently give you the wrong Xpath when trying to use the raw HTML. 这就是为什么开发人员工具生成的XPath在尝试使用原始HTML时经常会给你错误的Xpath。

The solution, always refer to the raw HTML and use a flexible, but precise XPath. 解决方案始终引用原始HTML并使用灵活但精确的XPath。

Examine the actual HTML that holds the price: 检查保存价格的实际HTML:

<table border="0" cellspacing="0" cellpadding="0">
    <tr>
        <td>
            <font class="pricecolor colors_productprice">
                <div class="product_productprice">
                    <b>
                        <font class="text colors_text">Price:</font>
                        <span itemprop="price">$149.95</span>
                    </b>
                </div>
            </font>
            <br/>
            <input type="image" src="/v/vspfiles/templates/MAKO/images/buttons/btn_updateprice.gif" name="btnupdateprice" alt="Update Price" border="0"/>
        </td>
    </tr>
</table>

If you want the price, there is actually only one place to look! 如果你想要价格,实际上只有一个地方可以看!

//span[@itemprop="price"]/text()

And this will return: 这将返回:

$149.95

The xpath is simply wrong xpath完全错了

Here is snippet from the page: 以下是该页面的摘录:

<form id="vCSS_mainform" method="post" name="MainForm" action="/ProductDetails.asp?ProductCode=MCFFGB" onsubmit="javascript:return QtyEnabledAddToCart_SuppressFormIE();">
      <img src="/v/vspfiles/templates/MAKO/images/clear1x1.gif" width="5" height="5" alt="" /><br />
      <table width="100%" cellpadding="0" cellspacing="0" border="0" id="v65-product-parent">
        <tr>
          <td colspan="2" class="vCSS_breadcrumb_td"><b>
&nbsp; 
<a href="http://www.makospearguns.com/">Home</a> > 

You can see, that element with id being "v65-product-parent" is of type table and has subelement tr`. 你可以看到, id"v65-product-parent" is of type and has subelement属于table "v65-product-parent" is of type and has subelement tr`。

There can be only one element with such id (otherwise it would be broken xml). 只有一个元素具有这样的id (否则它将被破坏xml)。

The xpath is expecting tbody as child of given element (table) and there is none in whole page. xpath期望tbody作为给定元素(表)的子元素,并且整个页面中没有。

This can be tested by 这可以通过测试

>>> "tbody" in page.text
False

How Chrome came to that XPath? Chrome是如何进入XPath的?

If you simply download this page by 如果你只是下载这个页面

$ wget http://www.makospearguns.com/product-p/mcffgb.htm

and review content of it, it does not contain a single element named tbody 并查看它的内容,它不包含名为tbody的单个元素

But if you use Chrome Developer Tools, you find some. 但是,如果您使用Chrome开发者工具,则可以找到一些。

How it comes here? 它是怎么来的?

This often happens, if JavaScript comes into play and generates some page content when in the browser. 如果JavaScript在浏览器中发挥作用并在浏览器中生成一些页面内容时,通常会发生这种情况。 But as LegoStormtroopr noted, this is not our case and this time it is the browser, which modifies document to make it correct. 但正如LegoStormtroopr所说,这不是我们的情况,这次是浏览器,它修改文档以使其正确。

How to get content of page dynamically modified within browser? 如何在浏览器中动态修改页面内容?

You have to give some sort of browser a chance. 你必须给某种浏览器一个机会。 Eg if you use selenium , you would get it. 例如,如果你使用selenium ,你会得到它。

byselenium.py

from selenium import webdriver
from lxml import html

url = "http://www.makospearguns.com/product-p/mcffgb.htm"
xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'

browser = webdriver.Firefox()
browser.get(url)
html_source = browser.page_source
print "test tbody", "tbody" in html_source

tree = html.fromstring(html_source) 
text = tree.xpath(xpath)
print text

what prints 打印什么

$ python byselenimum.py 
test tbody True
['$149.95']

Conclusions 结论

Selenium is great when it comes to changes within browser. 在浏览器中进行更改时,Selenium非常棒。 However it is a bit heavy tool and if you can do it simpler way, do it that way. 然而,它是一个有点繁重的工具,如果你能做到更简单的方式,那就这样做。 Lego Stormrtoopr have proposed such a simpler solution working on simply fetched web page. Lego Stormrtoopr已经提出了这样一个简单的解决方案,可以处理简单的网页。

I had a similar issue (Chrome inserting tbody elements when you do Copy as XPath). 我遇到了类似的问题(当您将Copy复制为XPath时,Chrome会插入tbody元素)。 As others answered, you have to look at the actual page source, though the browser-given XPath is a good place to start. 正如其他人回答的那样,你必须查看实际的页面源代码,尽管浏览器给出的XPath是一个很好的起点。 I've found that often, removing tbody tags fixes it, and to test this I wrote a small Python utility script to test XPaths: 我发现通常,删除tbody标签修复它,并测试这个我编写了一个小的Python实用程序脚本来测试XPath:

#!/usr/bin/env python
import sys, requests
from lxml import html
if (len(sys.argv) < 3):
     print 'Usage: ' + sys.argv[0] + ' url xpath'
     sys.exit(1)
else:
    url = sys.argv[1]
    xp = sys.argv[2]

page = requests.get(url)
tree = html.fromstring(page.text)
nodes = tree.xpath(xp)

if (len(nodes) == 0):
     print 'XPath did not match any nodes'
else:
     # tree.xpath(xp) produces a list, so always just take first item
     print (nodes[0]).text_content().encode('ascii', 'ignore')

(that's Python 2.7, in case the non-function "print" didn't give it away) (这是Python 2.7,以防非功能“打印”没有放弃)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM