简体   繁体   English

如何跨HTML标签边界查找文本(结果是使用XPath指针)?

[英]How to find text across HTML tag boundaries (with XPath pointers as result)?

I have HTML like this: 我有这样的HTML:

<div>Lorem ipsum <b>dolor sit</b> amet.</div>

How can I find a plain text based match for my search string ipsum dolor in this HTML? 如何在此HTML中为我的搜索字符串ipsum dolor找到基于纯文本的匹配项? I need the start and end XPath node pointers for the match, plus character indexes to point inside these start and stop nodes. 我需要用于匹配的开始和结束XPath节点指针,以及指向这些开始和停止节点内部的字符索引。 I use Nokogiri to work with the DOM, but any solution for Ruby is fine. 我使用Nokogiri处理DOM,但是任何针对Ruby的解决方案都可以。

Difficulty: 困难:

  • I can't node.traverse {|node| … } 我无法进行node.traverse {|node| … } node.traverse {|node| … } through the DOM and do a plain text search whenever a text node comes across, because my search string can cross tag boundaries. node.traverse {|node| … }遍历DOM并在遇到文本节点时进行纯文本搜索,因为我的搜索字符串可以跨越标签边界。

  • I can't do a plain text search after converting the HTML to plain text, because I need the XPath indexes as result. 将HTML转换为纯文本后,我无法进行纯文本搜索,因为我需要XPath索引作为结果。

I could implement it myself with basic tree traversal, but before I do I'm asking if there is a Nokogiri function or trick to do it more comfortably. 我可以用基本的树遍历自己实现它,但是在此之前,我先问是否有Nokogiri函数或技巧来使它更舒适地实现。

您可以执行以下操作:

doc.search('div').find{|div| div.text[/ipsum dolor/]}

In the end, we used code as follows. 最后,我们使用如下代码。 It is shown for the example given in the question, but also works in the generic case of arbitrary-depth HTML tag nesting. 它是为问题中给出的示例显示的,但也适用于任意深度HTML标签嵌套的一般情况。 (Which is what we need.) (这就是我们所需要的。)

In addition, we implemented it in a way that can ignore excess (≥2) whitespace characters in a row. 另外,我们以一种可以忽略一行中多余(≥2)个空格字符的方式来实现它。 Which is why we have to search for the end of the match and can't just use the length of the search string / quote and the start of the match position: the number of whitespace characters in the search string and search match might differ. 这就是为什么我们必须搜索匹配项的末尾,而不能仅仅使用搜索字符串/引号的长度和匹配位置的开始:搜索字符串和搜索匹配项中空格字符的数量可能会有所不同。

doc = Nokogiri::HTML.fragment("<div>Lorem ipsum <b>dolor sit</b> amet.</div>")
quote = 'ipsum dolor'


# Find search string in document text, "plain text in plain text".

quote_query = 
  quote.split(/[[:space:]]+/).map { |w| Regexp.quote(w) }.join('[[:space:]]+')
start_index = doc.text.index(/#{quote_query}/i)
end_index = start_index+doc.text[/#{quote_query}/i].size


# Find XPath values and character indexes for start and stop of search match.
# For that, walk through all text nodes and count characters until reaching 
# the start and end positions of the search match.

start_xpath, start_offset, end_xpath, end_offset = nil
i = 0

doc.xpath('.//text() | text()').each do |x|
  offset = 0
  x.text.split('').each do
    if i == start_index
      e = x.previous
      sum = 0
      while e
        sum+= e.text.size
        e = e.previous
      end
      start_xpath = x.path.gsub(/^\?/, '').gsub(
        /#{Regexp.quote('/text()')}.*$/, ''
      )
      start_offset = offset+sum
    elsif i+1 == end_index
      e = x.previous
      sum = 0
      while e
        sum+= e.text.size
        e = e.previous
      end
      end_xpath = x.path.gsub(/^\?/, '').gsub(
        /#{Regexp.quote('/text()')}.*$/, ''
      )
      end_offset = offset+1+sum
    end
    offset+=1
    i+=1
  end
end

At this point, we can retrieve the desired XPath values for the start and stop of the search match (and in addition, character offsets pointing to the exact character inside the XPath designated element for the start and stop of the search match). 此时,我们可以检索搜索匹配的开始和结束所需的XPath值(此外,字符偏移指向XPath指定元素内搜索匹配的开始和结束的确切字符)。 We get: 我们得到:

puts start_xpath
  /div
puts start_offset
  6
puts end_xpath
  /div/b
puts end_offset
  5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM