Ruby 1.9：具有未知輸入編碼的正則表達式

Question

是否有一種可接受的方法來處理Ruby 1.9中的正則表達式，其輸入的編碼是未知的？ 假設我的輸入恰好是UTF-16編碼：

x  = "foo<p>bar</p>baz"
y  = x.encode('UTF-16LE')
re = /<p>(.*)<\/p>/

x.match(re) 
=> #<MatchData "<p>bar</p>" 1:"bar">

y.match(re)
Encoding::CompatibilityError: incompatible encoding regexp match (US-ASCII regexp with UTF-16LE string)

我目前的方法是在內部使用UTF-8並在必要時重新編碼（副本）輸入：

if y.methods.include?(:encode)  # Ruby 1.8 compatibility
  if y.encoding.name != 'UTF-8'
    y = y.encode('UTF-8')
  end
end

y.match(/<p>(.*)<\/p>/u)
=> #<MatchData "<p>bar</p>" 1:"bar">

然而，這對我來說有點尷尬，我想問一下是否有更好的方法。

Answer 1

據我所知，沒有更好的方法可以使用。 但是，我可以建議稍作修改嗎？

而不是改變輸入的編碼，為什么不改變正則表達式的編碼？ 每次遇到新編碼時翻譯一個正則表達式字符串比翻譯數百或數千行輸入以匹配正則表達式的編碼要少得多。

# Utility function to make transcoding the regex simpler.
def get_regex(pattern, encoding='ASCII', options=0)
  Regexp.new(pattern.encode(encoding),options)
end



  # Inside code looping through lines of input.
  # The variables 'regex' and 'line_encoding' should be initialized previously, to
  # persist across loops.
  if line.methods.include?(:encoding)  # Ruby 1.8 compatibility
    if line.encoding != last_encoding
      regex = get_regex('<p>(.*)<\/p>',line.encoding,16) # //u = 00010000 option bit set = 16
      last_encoding = line.encoding
    end
  end
  line.match(regex)

在病理情況下（輸入編碼改變每一行），這將同樣緩慢，因為你每次通過循環重新編碼正則表達式。 但是在99.9％的情況下，編碼對於數百或數千行的整個文件是恆定的，這將導致重新編碼的大量減少。

Answer 2

按照此頁面的建議： http ： //gnuu.org/2009/02/02/ruby-19-common-problems-pt-1-encoding/並添加

# encoding: utf-8

到你的rb文件的頂部。

Ruby 1.9：具有未知輸入編碼的正則表達式

問題描述

2 個解決方案

解決方案1
9 已采納 2009-12-22 00:26:05

解決方案2
0 2010-06-29 12:20:35

Ruby 1.9：具有未知輸入編碼的正則表達式

問題描述

2 個解決方案

解決方案1 9 已采納 2009-12-22 00:26:05

解決方案2 0 2010-06-29 12:20:35

解決方案1
9 已采納 2009-12-22 00:26:05

解決方案2
0 2010-06-29 12:20:35