How to correctly parse this bad html in Nokogiri?

Question

I'm trying to parse this HTML with Nokogiri:

<div class="times">
<span style="color:"><span style="padding:0 ">&lrm;</span><!--  -->16:45&lrm;</span>
<span style="color:"><span style="padding:0 "> &nbsp;&lrm;</span><!--  -->19:30&lrm;</span> 
<span style="color:"><span style="padding:0 "> &nbsp;&lrm;</span><!--  -->22:10&lrm;</span>
</div>

I only want to get the times, inserted in an array.

I set up a gsub like this:

 block.css('div.times span').text.gsub(" ","").gsub("&nbsp","")

But then I end up with a single string and I'm kind of stuck. Is there an efficient way to do this?

Answer 1

最简单的可能是：

block.at('div.times').text.scan /\d{2}:\d{2}/

Answer 2

One thing you could do is to leave the whitespace in the string, and then use String#split to convert it to an array:

block.css('div.times span').text.gsub("&nbsp","").split(' ')

In this case you might need to strip out the left-to-right markers as well, and I don't think you need to replace the non-breaking spaces, so you could try this:

block.css('div.times span').text.gsub("\u200e", '').split(' ')

( \‎ is the left-to-right marker).

An alternative with Nokogiri is to use xpath instead of CSS, which will enable you to select just the text nodes you want directly, then use map to convert to an array of strings:

block.xpath('//div[@class="times"]/span/text()').map(&:text)

How to correctly parse this bad html in Nokogiri?

Question

2 answers

solution1
2 2012-06-30 09:00:13

solution2
1 ACCPTED 2012-06-30 11:16:22

How to correctly parse this bad html in Nokogiri?

Question

2 answers

solution1 2 2012-06-30 09:00:13

solution2 1 ACCPTED 2012-06-30 11:16:22

solution1
2 2012-06-30 09:00:13

solution2
1 ACCPTED 2012-06-30 11:16:22