ruby 1.9：UTF-8 中的无效字节序列

Question

I'm writing a crawler in Ruby (1.9) that consumes lots of HTML from a lot of random sites.我正在 Ruby (1.9) 中编写一个爬虫程序，它从许多随机站点消耗大量 HTML。
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup).在尝试提取链接时，我决定只使用.scan(/href="(.*?)"/i)而不是 nokogiri/hpricot（主要加速）。 The problem is that I now receive a lot of " invalid byte sequence in UTF-8 " errors.问题是我现在收到很多“ invalid byte sequence in UTF-8 ”错误。
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.据我了解， net/http库没有任何特定于编码的选项，而且进来的东西基本上没有正确标记。
What would be the best way to actually work with that incoming data?实际处理传入数据的最佳方式是什么？ I tried .encode with the replace and invalid options set, but no success so far...我尝试.encode设置了替换和无效选项，但到目前为止没有成功......

Answer 1

In Ruby 1.9.3 it is possible to use String.encode to "ignore" the invalid UTF-8 sequences. 在Ruby 1.9.3中，可以使用String.encode来“忽略”无效的UTF-8序列。 Here is a snippet that will work both in 1.8 ( iconv ) and 1.9 ( String#encode ) : 这是一个可以在1.8（ iconv ）和1.9（ String #coding ）中工作的片段：

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-8', 'UTF-8', :invalid => :replace)
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

or if you have really troublesome input you can do a double conversion from UTF-8 to UTF-16 and back to UTF-8: 或者如果你有非常麻烦的输入，你可以进行从UTF-8到UTF-16并返回到UTF-8的双重转换：

require 'iconv' unless String.method_defined?(:encode)
if String.method_defined?(:encode)
  file_contents.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
  file_contents.encode!('UTF-8', 'UTF-16')
else
  ic = Iconv.new('UTF-8', 'UTF-8//IGNORE')
  file_contents = ic.iconv(file_contents)
end

Answer 2

The accepted answer nor the other answer work for me. 接受的答案或其他答案对我有用。 I found this post which suggested 我找到了这篇帖子

string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

This fixed the problem for me. 这解决了我的问题。

Answer 3

My current solution is to run: 我目前的解决方案是运行：

my_string.unpack("C*").pack("U*")

This will at least get rid of the exceptions which was my main problem 这将至少摆脱我的主要问题的例外

Answer 4

Try this: 试试这个：

def to_utf8(str)
  str = str.force_encoding('UTF-8')
  return str if str.valid_encoding?
  str.encode("UTF-8", 'binary', invalid: :replace, undef: :replace, replace: '')
end

Answer 5

I recommend you to use a HTML parser. 我建议你使用HTML解析器。 Just find the fastest one. 找到最快的一个。

Parsing HTML is not as easy as it may seem. 解析HTML并不像看起来那么容易。

Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the " " symbol. 浏览器在UTF-8 HTML文档中解析无效的UTF-8序列，只需添加“ ”符号。 So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string. 因此，一旦HTML中的无效UTF-8序列被解析，结果文本就是一个有效的字符串。

Even inside attribute values you have to decode HTML entities like amp 即使在属性值内部，您也必须解码像放大器这样的HTML实体

Here is a great question that sums up why you can not reliably parse HTML with a regular expression: RegEx match open tags except XHTML self-contained tags 这是一个很好的问题，总结了为什么你不能用正则表达式可靠地解析HTML： RegEx匹配除XHTML自包含标签之外的开放标签

Answer 6

This seems to work: 这似乎有效：

def sanitize_utf8(string)
  return nil if string.nil?
  return string if string.valid_encoding?
  string.chars.select { |c| c.valid_encoding? }.join
end

Answer 7

attachment = file.read

begin
   # Try it as UTF-8 directly
   cleaned = attachment.dup.force_encoding('UTF-8')
   unless cleaned.valid_encoding?
     # Some of it might be old Windows code page
     cleaned = attachment.encode( 'UTF-8', 'Windows-1252' )
   end
   attachment = cleaned
 rescue EncodingError
   # Force it to UTF-8, throwing out invalid bits
   attachment = attachment.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
 end

Answer 8

I've encountered string, which had mixings of English, Russian and some other alphabets, which caused exception. 我遇到过字符串，其中混合了英语，俄语和其他一些字母，这引起了异常。 I need only Russian and English, and this currently works for me: 我只需要俄语和英语，这对我来说很有用：

ec1 = Encoding::Converter.new "UTF-8","Windows-1251",:invalid=>:replace,:undef=>:replace,:replace=>""
ec2 = Encoding::Converter.new "Windows-1251","UTF-8",:invalid=>:replace,:undef=>:replace,:replace=>""
t = ec2.convert ec1.convert t

Answer 9

While Nakilon's solution works, at least as far as getting past the error, in my case, I had this weird f-ed up character originating from Microsoft Excel converted to CSV that was registering in ruby as a (get this) cyrillic K which in ruby was a bolded K. To fix this I used 'iso-8859-1' viz. 虽然Nakilon的解决方案起作用，至少就错过了错误，在我的情况下，我有这个怪异的f-ed up字符源自Microsoft Excel转换为CSV，在ruby中注册为（获取此）西里尔文K in in红宝石是一个粗体K.为了解决这个问题，我用'iso-8859-1'即可。 CSV.parse(f, :encoding => "iso-8859-1") , which turned my freaky deaky cyrillic K's into a much more manageable /\\xCA/ , which I could then remove with string.gsub!(/\\xCA/, '') CSV.parse(f, :encoding => "iso-8859-1") ，它将我那怪异的deaky西里尔文K变成了一个更易于管理的/\\xCA/ ，然后我可以用string.gsub!(/\\xCA/, '')删除string.gsub!(/\\xCA/, '')

Answer 10

Before you use scan , make sure that the requested page's Content-Type header is text/html , since there can be links to things like images which are not encoded in UTF-8. 在使用scan之前，请确保所请求页面的Content-Type标头是text/html ，因为可以链接到未以UTF-8编码的图像等内容。 The page could also be non-html if you picked up a href in something like a <link> element. 如果你在<link>元素中选择一个href ，页面也可以是非html。 How to check this varies on what HTTP library you are using. 如何检查这取决于您使用的HTTP库。 Then, make sure the result is only ascii with String#ascii_only? 然后，确保结果只是ascii with String#ascii_only? (not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). （不是UTF-8，因为HTML只应该使用ascii，否则可以使用实体）。 If both of those tests pass, it is safe to use scan . 如果这两个测试都通过，则使用scan是安全的。

Answer 11

There is also the scrub method to filter invalid bytes.还有scrub方法过滤无效字节。

string.scrub('')

Answer 12

If you don't "care" about the data you can just do something like: 如果您不“关心”数据，您可以执行以下操作：

search_params = params[:search].valid_encoding? ? params[:search].gsub(/\\W+/, '') : "nothing"

I just used valid_encoding? 我刚刚使用了valid_encoding? to get passed it. 通过它。 Mine is a search field, and so i was finding the same weirdness over and over so I used something like: just to have the system not break. 我是一个搜索领域，所以我一遍又一遍地发现同样的怪异所以我使用了类似的东西：只是为了让系统不破坏。 Since i don't control the user experience to autovalidate prior to sending this info (like auto feedback to say "dummy up!") I can just take it in, strip it out and return blank results. 由于我在发送此信息之前无法控制用户体验自动电子邮件（如自动反馈说“虚拟！”），我可以将其取出，将其删除并返回空白结果。

ruby 1.9：UTF-8 中的无效字节序列

问题描述

12 个解决方案

解决方案1
171 2012-01-15 22:30:33

解决方案2
79 2013-08-26 23:02:58

解决方案3
23 2012-01-13 20:44:54

解决方案4
8 2014-05-12 13:45:08

解决方案5
4 2010-06-06 01:36:06

解决方案6
3 2013-05-15 12:41:51

解决方案7
3 2014-07-24 09:16:05

解决方案8
2 2012-01-08 13:51:34

解决方案9
1 2012-10-16 03:53:22

解决方案10
0 2010-06-06 00:45:59

解决方案11
0 2022-10-06 13:45:29

解决方案12
-1 2013-08-29 14:13:14

ruby 1.9：UTF-8 中的无效字节序列

问题描述

12 个解决方案

解决方案1 171 2012-01-15 22:30:33

解决方案2 79 2013-08-26 23:02:58

解决方案3 23 2012-01-13 20:44:54

解决方案4 8 2014-05-12 13:45:08

解决方案5 4 2010-06-06 01:36:06

解决方案6 3 2013-05-15 12:41:51

解决方案7 3 2014-07-24 09:16:05

解决方案8 2 2012-01-08 13:51:34

解决方案9 1 2012-10-16 03:53:22

解决方案10 0 2010-06-06 00:45:59

解决方案11 0 2022-10-06 13:45:29

解决方案12 -1 2013-08-29 14:13:14

解决方案1
171 2012-01-15 22:30:33

解决方案2
79 2013-08-26 23:02:58

解决方案3
23 2012-01-13 20:44:54

解决方案4
8 2014-05-12 13:45:08

解决方案5
4 2010-06-06 01:36:06

解决方案6
3 2013-05-15 12:41:51

解决方案7
3 2014-07-24 09:16:05

解决方案8
2 2012-01-08 13:51:34

解决方案9
1 2012-10-16 03:53:22

解决方案10
0 2010-06-06 00:45:59

解决方案11
0 2022-10-06 13:45:29

解决方案12
-1 2013-08-29 14:13:14