简体   繁体   中英

Converting gsub() pattern from ruby 1.8 to 2.0

I have a ruby program that I'm trying to upgrade form ruby 1.8 to ruby 2.0.0-p247.

This works just fine in 1.8.7:

 begin
   ARGF.each do |line|
     # a collection of pecluliarlities, appended as they appear in data
     line.gsub!("\x92", "'")
     line.gsub!("\x96", "-")
     puts line
   end
 rescue => e
   $stderr << "exception on line #{$.}:\n"
   $stderr << "#{e.message}:\n"
   $stderr << @line
 end

But under ruby 2.0, this results in this an exxeption when encountering the 96 or 92 encoded into a data file that otherwise contains what appears to be ASCII:

 invalid byte sequence in UTF-8

I have tried all manner of things: double backslashes, using a regex object instead of the string, force_encoding(), etc. and am stumped.

Can anybody fill in the missing puzzle piece for me?

Thanks.

=============== additions: 2013-09-25 ============

Changing \\x92 to \’ did not fix the problem.

The program does not error until it actually hits a 92 or 96 in the input file, so I'm confused as to how the character pattern in the string is the problem when there are hundreds of thousands of lines of input data that are matched against the patterns without incident.

It's not the regex that's throwing the exception, it's the Ruby compiler. \\x92 and \\x96 are how you would represent ' and in the windows-1252 encoding, but Ruby expects the string to be UTF-8 encoded. You need to get out of the habit of putting raw byte values like \\x92 in your string literals. Non-ASCII characters should be specified by Unicode escape sequences (in this case, \’ and \– ).

It's a Unicode world now, stop thinking of text in terms of bytes and think in terms of characters instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM