How to find text across HTML tag boundaries (with XPath pointers as result)?

Question

I have HTML like this:

<div>Lorem ipsum <b>dolor sit</b> amet.</div>

How can I find a plain text based match for my search string ipsum dolor in this HTML? I need the start and end XPath node pointers for the match, plus character indexes to point inside these start and stop nodes. I use Nokogiri to work with the DOM, but any solution for Ruby is fine.

Difficulty:

I can't node.traverse {|node| … } node.traverse {|node| … } through the DOM and do a plain text search whenever a text node comes across, because my search string can cross tag boundaries.
I can't do a plain text search after converting the HTML to plain text, because I need the XPath indexes as result.

I could implement it myself with basic tree traversal, but before I do I'm asking if there is a Nokogiri function or trick to do it more comfortably.

Answer 1

您可以执行以下操作：

doc.search('div').find{|div| div.text[/ipsum dolor/]}

Answer 2

In the end, we used code as follows. It is shown for the example given in the question, but also works in the generic case of arbitrary-depth HTML tag nesting. (Which is what we need.)

In addition, we implemented it in a way that can ignore excess (≥2) whitespace characters in a row. Which is why we have to search for the end of the match and can't just use the length of the search string / quote and the start of the match position: the number of whitespace characters in the search string and search match might differ.

doc = Nokogiri::HTML.fragment("<div>Lorem ipsum <b>dolor sit</b> amet.</div>")
quote = 'ipsum dolor'


# Find search string in document text, "plain text in plain text".

quote_query = 
  quote.split(/[[:space:]]+/).map { |w| Regexp.quote(w) }.join('[[:space:]]+')
start_index = doc.text.index(/#{quote_query}/i)
end_index = start_index+doc.text[/#{quote_query}/i].size


# Find XPath values and character indexes for start and stop of search match.
# For that, walk through all text nodes and count characters until reaching 
# the start and end positions of the search match.

start_xpath, start_offset, end_xpath, end_offset = nil
i = 0

doc.xpath('.//text() | text()').each do |x|
  offset = 0
  x.text.split('').each do
    if i == start_index
      e = x.previous
      sum = 0
      while e
        sum+= e.text.size
        e = e.previous
      end
      start_xpath = x.path.gsub(/^\?/, '').gsub(
        /#{Regexp.quote('/text()')}.*$/, ''
      )
      start_offset = offset+sum
    elsif i+1 == end_index
      e = x.previous
      sum = 0
      while e
        sum+= e.text.size
        e = e.previous
      end
      end_xpath = x.path.gsub(/^\?/, '').gsub(
        /#{Regexp.quote('/text()')}.*$/, ''
      )
      end_offset = offset+1+sum
    end
    offset+=1
    i+=1
  end
end

At this point, we can retrieve the desired XPath values for the start and stop of the search match (and in addition, character offsets pointing to the exact character inside the XPath designated element for the start and stop of the search match). We get:

puts start_xpath
  /div
puts start_offset
  6
puts end_xpath
  /div/b
puts end_offset
  5

How to find text across HTML tag boundaries (with XPath pointers as result)?

Question

2 answers

solution1
1 2017-09-08 02:06:56

solution2
0 ACCPTED 2017-09-11 17:03:29

How to find text across HTML tag boundaries (with XPath pointers as result)?

Question

2 answers

solution1 1 2017-09-08 02:06:56

solution2 0 ACCPTED 2017-09-11 17:03:29

solution1
1 2017-09-08 02:06:56

solution2
0 ACCPTED 2017-09-11 17:03:29