简体   繁体   English

如何在Nokogiri中正确解析这个糟糕的HTML?

[英]How to correctly parse this bad html in Nokogiri?

I'm trying to parse this HTML with Nokogiri: 我正试图用Nokogiri解析这个HTML:

<div class="times">
<span style="color:"><span style="padding:0 ">&lrm;</span><!--  -->16:45&lrm;</span>
<span style="color:"><span style="padding:0 "> &nbsp;&lrm;</span><!--  -->19:30&lrm;</span> 
<span style="color:"><span style="padding:0 "> &nbsp;&lrm;</span><!--  -->22:10&lrm;</span>
</div>

I only want to get the times, inserted in an array. 我只想得到时间,插入一个数组。

I set up a gsub like this: 我设置了一个像这样的gsub:

 block.css('div.times span').text.gsub(" ","").gsub("&nbsp","")

But then I end up with a single string and I'm kind of stuck. 但后来我最终得到了一根弦,我有点陷入困境。 Is there an efficient way to do this? 有没有一种有效的方法来做到这一点?

最简单的可能是:

block.at('div.times').text.scan /\d{2}:\d{2}/

One thing you could do is to leave the whitespace in the string, and then use String#split to convert it to an array: 您可以做的一件事是将空格留在字符串中,然后使用String#split将其转换为数组:

block.css('div.times span').text.gsub("&nbsp","").split(' ')

In this case you might need to strip out the left-to-right markers as well, and I don't think you need to replace the non-breaking spaces, so you could try this: 在这种情况下,您可能还需要删除从左到右的标记,我认为您不需要替换不间断的空格,因此您可以尝试这样做:

block.css('div.times span').text.gsub("\u200e", '').split(' ')

( \‎ is the left-to-right marker). \‎是从左到右的标记)。

An alternative with Nokogiri is to use xpath instead of CSS, which will enable you to select just the text nodes you want directly, then use map to convert to an array of strings: 使用Nokogiri的替代方法是使用xpath而不是CSS,这将使您能够直接选择所需的文本节点,然后使用map转换为字符串数组:

block.xpath('//div[@class="times"]/span/text()').map(&:text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM