简体   繁体   中英

How to correctly parse this bad html in Nokogiri?

I'm trying to parse this HTML with Nokogiri:

<div class="times">
<span style="color:"><span style="padding:0 ">&lrm;</span><!--  -->16:45&lrm;</span>
<span style="color:"><span style="padding:0 "> &nbsp;&lrm;</span><!--  -->19:30&lrm;</span> 
<span style="color:"><span style="padding:0 "> &nbsp;&lrm;</span><!--  -->22:10&lrm;</span>
</div>

I only want to get the times, inserted in an array.

I set up a gsub like this:

 block.css('div.times span').text.gsub(" ","").gsub("&nbsp","")

But then I end up with a single string and I'm kind of stuck. Is there an efficient way to do this?

最简单的可能是:

block.at('div.times').text.scan /\d{2}:\d{2}/

One thing you could do is to leave the whitespace in the string, and then use String#split to convert it to an array:

block.css('div.times span').text.gsub("&nbsp","").split(' ')

In this case you might need to strip out the left-to-right markers as well, and I don't think you need to replace the non-breaking spaces, so you could try this:

block.css('div.times span').text.gsub("\u200e", '').split(' ')

( \‎ is the left-to-right marker).

An alternative with Nokogiri is to use xpath instead of CSS, which will enable you to select just the text nodes you want directly, then use map to convert to an array of strings:

block.xpath('//div[@class="times"]/span/text()').map(&:text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM