Ruby string encoding in Ruby 1.8.7

Question

I am creating a Ruby string using the Ruby C API (from Objective C) and it happens to hold Finnish characters.

Once in Ruby I call a gem that does some manipulation which truncates the string but the encoded characters get truncated improperly - very much like in this question:

How to get a Ruby substring of a Unicode string?

An example string is H pääsee syvemmälle A elämään - the umlauts get changed into things like \\30333 but when truncated this ends up as \\303 which is a problem.

I don't want to hack the gem to get round this issue as I have tested with the same string opened directly in Ruby and it worked fine.

So I know that I'm passing in something incorrectly to Ruby.

Here is how I turn the NSString into a VALUE to be used in Ruby.

- (VALUE) toRubyValue {
    size_t data_length = [self lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
    size_t buffer_length = data_length + 1;
    char buf[buffer_length];
    [self getCString:buf maxLength:buffer_length encoding:NSUTF8StringEncoding];
    return rb_str_new(buf, data_length);
}

I'm on Ruby 1.8.7

What is the best way to address this problem - I'm happy to do it in either in Ruby or C (or Objective C) but I would rather not use any Ruby Gems that have native C extensions

Answer 1

I don't think you're passing something incorrectly to Ruby. You are creating a UTF-8 encoded Ruby 1.8 string. Ruby 1.8 doesn't care about encodings though and treats strings as arrays of bytes. This means that any incorrect piece of Ruby code can produce the results you talk about. 'Hacking' the gem is really your only option.

And upgrading to 1.9 or even 2.0 your best way out.

Ruby string encoding in Ruby 1.8.7

Question

1 answers

solution1
1 ACCPTED 2013-05-13 06:39:24

Ruby string encoding in Ruby 1.8.7

Question

1 answers

solution1 1 ACCPTED 2013-05-13 06:39:24

solution1
1 ACCPTED 2013-05-13 06:39:24