简体   繁体   English

获取在HTML文档中形成字符串的节点的xpath

[英]Get xpath for nodes that form a string in HTML document

Problem: I want to find the xpath of node(s) that form a text string in an HTML document. 问题:我想找到在HTML文档中形成文本字符串的节点的xpath。 Language used is python (lxml to parse the document) 使用的语言是python(用于解析文档的lxml)

To illustrate the idea consider the document: 为了说明这个想法,请考虑以下文档:

 <HTML> <HEAD> <TITLE>sample document</TITLE> </HEAD> <BODY BGCOLOR="FFFFFF"> <HR> <a href="http://google.com">Goog</a> <H1>This is one header</H1> <H2>This is a another Header</H2> <P>Travel from <P> <B>SFO to JFK</B> <BR> <B><I>on May 2, 2015 at 2:00 pm. For details go to confirm.com </I></B> <HR> <div style="color:#0000FF"> <h3>Traveler <b> name </b> is <p> John Doe </p> </div> ..... 

Now, given the strings "SFO to JFK on May 2, 2015" and "Traveler name is John Doe", how can I get the Xpath for the first node in the set of nodes that form the string. 现在,给定字符串“ SFO到JFK,2015年5月2日”和“旅行者名字是John Doe”,如何获得构成该字符串的节点集中的第一个节点的Xpath。 (if that is difficult even the set of nodes will do) (如果那是困难的,即使节点集也可以做到)

Sample outputs: 样本输出:

"SFO to JFK on May 2, 2015" -> /html/body/p/p/b
"Traveler name is John Doe" -> /html/body/p/p/div/h3

As a followup, instead of the strings above, if we have a regex, what would be the approach to solve the problem? 作为后续,如果没有正则表达式,则可以使用上述方法代替上面的字符串来解决问题?

Note : In terms of python implementation, I was approaching the problem as in the snippet below 注意 :关于python的实现,我正在按照下面的代码段处理问题

import lxml.html as lh
from StringIO import StringIO
from lxml import etree

elem_tree = lh.parse(StringIO(html_document))
xpath = etree.XPath(_the_xpath_here)
list_of_nodes = xpath(elem_tree)

You can try this approach : 您可以尝试以下方法:

import lxml.html as lh
from lxml import etree

elem_tree = lh.parse("Q12.html")
input_string = ["SFO to JFK on May 2, 2015", "Traveler name is John Doe"]

for i in input_string:
    xpath = "//*[contains(normalize-space(.), '{0}') and not(.//*[contains(normalize-space(.), '{0}')])]/*"
    node = elem_tree.xpath(xpath.format(i))[0]

    print '{0} -> {1}'.format(i, elem_tree.getpath(node))

    #Output:
    #SFO to JFK on May 2, 2015 -> /html/body/p[2]/b[1]
    #Traveler name is John Doe -> /html/body/div/h3

Brief explanation : 简要说明 :

  • contains(normalize-space(.), '{0}') : filter nodes containing the text (one of the input_string ) contains(normalize-space(.), '{0}') :包含文本的过滤节点( input_string

  • not(.//*[contains(normalize-space(.), '{0}')]) : select the node if any of it's descendant doesn't contain the text . not(.//*[contains(normalize-space(.), '{0}')]) :如果节点的任何后代不包含text ,则选择该节点。 In other words, select the inner most node containing the text . 换句话说,选择包含文本的最里面的节点。

  • getpath() : " returns a structural, absolute XPath expression to find the element. " getpath() :“ 返回一个结构化的绝对XPath表达式来查找元素。

UPDATE : 更新:

Replace trailing /* in xpath variable string to this : xpath变量字符串中的尾随/*替换为:

/descendant-or-self::*[contains('{0}', text()) or contains(text(), '{0}')]

Worked for HTML structure posted in question as well as the one you linked in below comment. 适用于有问题的HTML结构以及您在以下评论中链接的HTML结构。 However, addressing general cases having different characteristic of what has been demonstrated by both sample HTML is beyond the scope of xpath query in this answer. 但是,解决具有两种示例HTML所演示内容不同特征的一般情况超出了此答案中xpath查询的范围。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM