Ruby 2.0.0 String＃Match ArgumentError：UTF-8中的无效字节序列

Question

I see this a lot and haven't figured out a graceful solution. 我看到了很多，并没有想出一个优雅的解决方案。 If user input contains invalid byte sequences, I need to be able to have it not raise an exception. 如果用户输入包含无效的字节序列，我需要能够让它不引发异常。 For example: 例如：

# @raw_response comes from user and contains invalid UTF-8
# for example: @raw_response = "\xBF"  
regex.match(@raw_response)
ArgumentError: invalid byte sequence in UTF-8

Numerous similar questions have been asked and the result appears to be encoding or force encoding the string. 已经提出了许多类似的问题，结果似乎是对字符串进行编码或强制编码。 Neither of these work for me however: 然而，这些对我来说都不起作用：

regex.match(@raw_response.force_encoding("UTF-8"))
ArgumentError: invalid byte sequence in UTF-8

or 要么

regex.match(@raw_response.encode("UTF-8", :invalid=>:replace, :replace=>"?"))
ArgumentError: invalid byte sequence in UTF-8

Is this a bug with Ruby 2.0.0 or am I missing something? 这是Ruby 2.0.0的错误还是我错过了什么？

What is strange is it appear to be encoding correctly, but match continues to raise an exception: 奇怪的是它似乎正确编码，但匹配继续引发异常：

@raw_response.encode("UTF-8", :invalid=>:replace, :replace=>"?").encoding
 => #<Encoding:UTF-8>

Answer 1

In Ruby 2.0 the encode method is a no-op when encoding a string to its current encoding: 在Ruby 2.0中， encode方法在将字符串编码为其当前编码时是无操作的：

Please note that conversion from an encoding enc to the same encoding enc is a no-op, ie the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes. 请注意，从编码enc到相同编码enc是无操作，即接收器在没有任何更改的情况下返回，并且即使存在无效字节也不会引发异常。

This changed in 2.1, which also added the scrub method as an easier way to do this. 这在2.1中有所改变，它还添加了scrub方法作为一种更简单的方法。

If you are unable to upgrade to 2.1, you'll have to encode into a different encoding and back in order to remove invalid bytes, something like: 如果您无法升级到2.1，则必须编码为不同的编码并返回以删除无效字节，例如：

if ! s.valid_encoding?
  s = s.encode("UTF-16be", :invalid=>:replace, :replace=>"?").encode('UTF-8')
end

Answer 2

Since you're using Rails and not just Ruby you can also use tidy_bytes . 既然你使用Rails而不仅仅是Ruby，你也可以使用tidy_bytes 。 This works with Ruby 2.0 and also will probably give you back sensible data instead of just replacement characters. 这适用于Ruby 2.0，也可能会为您提供合理的数据，而不仅仅是替换字符。

Ruby 2.0.0 String＃Match ArgumentError：UTF-8中的无效字节序列

问题描述

2 个解决方案

解决方案1
44 已采纳 2014-06-04 12:47:22

解决方案2
6 2014-11-12 19:14:42

Ruby 2.0.0 String＃Match ArgumentError：UTF-8中的无效字节序列

问题描述

2 个解决方案

解决方案1 44 已采纳 2014-06-04 12:47:22

解决方案2 6 2014-11-12 19:14:42

解决方案1
44 已采纳 2014-06-04 12:47:22

解决方案2
6 2014-11-12 19:14:42