简体   繁体   English

将gsub()模式从ruby 1.8转换为2.0

[英]Converting gsub() pattern from ruby 1.8 to 2.0

I have a ruby program that I'm trying to upgrade form ruby 1.8 to ruby 2.0.0-p247. 我有一个ruby程序,我正在尝试将表单ruby 1.8升级到ruby 2.0.0-p247。

This works just fine in 1.8.7: 这在1.8.7中运行得很好:

 begin
   ARGF.each do |line|
     # a collection of pecluliarlities, appended as they appear in data
     line.gsub!("\x92", "'")
     line.gsub!("\x96", "-")
     puts line
   end
 rescue => e
   $stderr << "exception on line #{$.}:\n"
   $stderr << "#{e.message}:\n"
   $stderr << @line
 end

But under ruby 2.0, this results in this an exxeption when encountering the 96 or 92 encoded into a data file that otherwise contains what appears to be ASCII: 但是在ruby 2.0下,当遇到96或92编码到数据文件中时,这会产生这种情况,否则该数据包含看似ASCII的内容:

 invalid byte sequence in UTF-8

I have tried all manner of things: double backslashes, using a regex object instead of the string, force_encoding(), etc. and am stumped. 我已经尝试了各种方式:双反斜杠,使用正则表达式对象而不是字符串,force_encoding()等,并且我很难过。

Can anybody fill in the missing puzzle piece for me? 任何人都可以为我填写丢失的拼图吗?

Thanks. 谢谢。

=============== additions: 2013-09-25 ============ ===============补充:2013-09-25 ============

Changing \\x92 to \’ did not fix the problem. 将\\ x92更改为\\ u2019并未解决问题。

The program does not error until it actually hits a 92 or 96 in the input file, so I'm confused as to how the character pattern in the string is the problem when there are hundreds of thousands of lines of input data that are matched against the patterns without incident. 程序在输入文件中实际命中92或96之前不会出错,因此当数十万行输入数据与之匹配时,我对如何解决字符串中的字符模式感到困惑。没有事件的模式。

It's not the regex that's throwing the exception, it's the Ruby compiler. 抛出异常的不是正则表达式,而是Ruby编译器。 \\x92 and \\x96 are how you would represent ' and in the windows-1252 encoding, but Ruby expects the string to be UTF-8 encoded. \\x92\\x96是你如何表示'在windows-1252编码中,但Ruby期望字符串是UTF-8编码的。 You need to get out of the habit of putting raw byte values like \\x92 in your string literals. 你需要\\x92在字符串文字中放置像\\x92这样的原始字节值的习惯。 Non-ASCII characters should be specified by Unicode escape sequences (in this case, \’ and \– ). 非ASCII字符应由Unicode转义序列指定(在本例中为\’\– )。

It's a Unicode world now, stop thinking of text in terms of bytes and think in terms of characters instead. 现在它是一个Unicode世界,不再考虑字节的文本,而是用字符来思考。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM