简体   繁体   English

Ruby,Nokogiri:我如何在整个nokogiri解析,erb模板和编码HTML文件中确保UTF8

[英]Ruby, Nokogiri: how do i ensure UTF8 throughout nokogiri parsing, erb template, and encoding HTML file

I finally managed to parse parts of a website: 我终于设法解析了网站的一部分:

get '/' do
  url = '<website>'
  data = Nokogiri::HTML(open(url))
  @rows = data.css("td[valign=top] table tr") 
  erb :muster
end

Now I am trying to extract a certain line in my view. 现在我想在我的视图中提取某一行。 Therefore i put in my HTML code: 因此我输入了我的HTML代码:

<%= @rows[2] %> 

And it actually returns the code, but it has problems with UTF8: 它实际上返回代码,但它有UTF8的问题:

<td class="class_name">&nbsp;</td>

instead it says 相反它说

<td class="class_name">�</td>

How do I ensure UTF8 during nokogiri parsing, erb, and HTML generation? 如何在nokogiri解析,erb和HTML生成期间确保UTF8?

See: http://www.nokogiri.org/tutorials/parsing_an_html_xml_document.html#encoding 请参阅: http//www.nokogiri.org/tutorials/parsing_an_html_xml_document.html#encoding

It looks like in your case, the document is declaring that it's encoded using iso8859: 看起来在您的情况下,文档声明它是使用iso8859编码的:

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">

You can do the following to force Nokogiri to treat the stream as UTF-8: 您可以执行以下操作以强制Nokogiri将流视为UTF-8:

data = Nokogiri::HTML(open(url), nil, Encoding::UTF_8.to_s)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM