简体   繁体   中英

Get minimal XPath of an element

I'm trying to create a function which returns XPATH of an element. Unfortunately it returns absolute xpath which is not enough.

I want to get as minimal xpath as possible (or better - more "clever", not necessarily minimal). For example, if element has id, then return xpath depended on its id.

I want to use this xpath multiple times and absolute xpath is very vulnerable according to page changes.

Or if it's parent has id, then return parents xpath by id and concat with /child .

Is it possible with lxml module or another module?

For example XPath helper wizard extension can do that better.

def _load_root(url):
    r = requests.get(url)
    r.encoding = 'utf-8'
    html = r.content
    return etree.fromstring(html, etree.HTMLParser())

def get_xpath_by_text(text,url):
    root = _load_root(url)
    e = root.xpath('.//*[contains(text(),"{}")]'.format(text))
    print root.getpath(e)

/html/body/div[1]/div[1]/div[1]/div[2]/div[1]/div[1]/div[2]/div[2]/div[1]/div/div[1]/div[2]/div[2]/div[2]/div[1]/div[1]/table/tr[6]/td[2]/div[1]

Do you know how to do that?

You are asking for two contradictory things, as far as I can see: a minimal XPath, and an XPath that is stable against changes to the document.

The minimal XPath for an element is typically something like (//*)[134] , but this is very sensitive to document changes.

You can get an XPath relative to the nearest ancestor with an id() attribute using a recursive algorithm like:

function minimalXpath(Node node) {
  if (exists(node/@id))
    then "id(" + node/@id + ")"
  else if (node is root)
    then ""
  else minimalXPath(node.getParent()) + "/" + node.getName() +
    "[" + node.getSiblingPosition() + "]"
} 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM