简体   繁体   English

如何使用UTF-8读取NSInputStream?

[英]How to read a NSInputStream with UTF-8?

I try to read a large file in iOS using NSInputStream to separate the files line by newlines (I don't want to use componentsSeparatedByCharactersInSet as it uses too much memory). 我尝试使用NSInputStream在iOS中读取大文件,以换行符分隔文件行(我不想使用componentsSeparatedByCharactersInSet因为它使用了太多内存)。

But as not all lines seem to be UTF-8 encoded (as they can appear just as ASCII, same bytes) I often get the Incorrect NSStringEncoding value 0x0000 detected. Assuming NSASCIIStringEncoding. Will stop this compatiblity mapping behavior in the near future. 但是由于并非所有的行似乎都是UTF-8编码的(因为它们可以像ASCII一样出现,相同的字节),所以我经常会Incorrect NSStringEncoding value 0x0000 detected. Assuming NSASCIIStringEncoding. Will stop this compatiblity mapping behavior in the near future. Incorrect NSStringEncoding value 0x0000 detected. Assuming NSASCIIStringEncoding. Will stop this compatiblity mapping behavior in the near future. warning. 警告。

My question is: Is there a way to surpress this warning by eg setting a compiler flag? 我的问题是:是否可以通过设置编译器标志来消除此警告?

Furthermore: Is it save to append/concatenate two buffer reads, as reading from the byte stream, then converting the buffer to string and then appending the string could make the string corrupted? 此外:将两个缓冲区读取追加/连接起来是否很容易,例如从字节流中读取,然后将缓冲区转换为字符串,然后追加字符串可能会使字符串损坏?

Below an example method that demonstrates that the byte to string conversion will discard the first and second half of the UTF-8 character, as being invalid. 下面的示例方法演示了字节到字符串的转换将把UTF-8字符的前一半和后一半视为无效。

- (void)NSInputStreamTest {
  uint8_t testString[] = {0xd0, 0x91}; // @"Б"

  // Test 1: Read max 1 byte at a time of UTF-8 string
  uint8_t buf1[1], buf2[1];
  NSString *s1, *s2, *s3;
  NSInteger c1, c2;
  NSInputStream *inStream = [[NSInputStream alloc] initWithData:[[NSData alloc] initWithBytes:testString length:2]];

  [inStream open];
  c1 = [inStream read:buf1 maxLength:1];
  s1 = [[NSString alloc] initWithBytes:buf1 length:1 encoding:NSUTF8StringEncoding];
  NSLog(@"Test 1: Read %d byte(s): %@", c1, s1);
  c2 = [inStream read:buf2 maxLength:1];
  s2 = [[NSString alloc] initWithBytes:buf2 length:1 encoding:NSUTF8StringEncoding];
  NSLog(@"Test 1: Read %d byte(s): %@", c2, s2);
  s3 = [s1 stringByAppendingString:s2];
  NSLog(@"Test 1: Concatenated: %@", s3);
  [inStream close];

  // Test 2: Read max 2 bytes at a time of UTF-8 string
  uint8_t buf4[2];
  NSString *s4;
  NSInteger c4;
  NSInputStream *inStream2 = [[NSInputStream alloc] initWithData:[[NSData alloc] initWithBytes:testString length:2]];

  [inStream2 open];
  c4 = [inStream2 read:buf4 maxLength:2];
  s4 = [[NSString alloc] initWithBytes:buf4 length:2 encoding:NSUTF8StringEncoding];
  NSLog(@"Test 2: Read %d byte(s): %@", c4, s4);
  [inStream2 close];
}

Output: 输出:

2013-02-10 21:16:23.412 Test[11144:c07] Test 1: Read 1 byte(s): (null)
2013-02-10 21:16:23.413 Test[11144:c07] Test 1: Read 1 byte(s): (null)
2013-02-10 21:16:23.413 Test[11144:c07] Test 1: Concatenated: (null)
2013-02-10 21:16:23.413 Test[11144:c07] Test 2: Read 2 byte(s): Б

First of all, in line: s3 = [s1 stringByAppendingString:s2]; 首先,在一行中: s3 = [s1 stringByAppendingString:s2]; you are trying to concatenate to 'nil' values. 您正在尝试将值连接为“ nil”。 The result would be 'nil' also. 结果也将为“ nil”。 So, you may want to concatenate bytes instead of strings: 因此,您可能需要串联字节而不是字符串:

uint8_t buf3[2];
buf3[0] = buf1[0];
buf3[1] = buf2[0];
s3 = [[NSString alloc] initWithBytes:buf3 length:2 encoding:NSUTF8StringEncoding];

Output: 输出:

2015-11-06 12:57:40.304 Test[10803:883182] Test 1: Read 1 byte(s): (null)
2015-11-06 12:57:40.305 Test[10803:883182] Test 1: Read 1 byte(s): (null)
2015-11-06 12:57:40.305 Test[10803:883182] Test 1: Concatenated: Б

Secondary, length of UTF-8 character may lay in [1..6] bytes. 其次,UTF-8字符的长度可以位于[1..6]字节中。

(1 byte)   0aaa aaaa         //if symbol lays in 0x00 .. 0x7F (ASCII)
(2 bytes)  110x xxxx 10xx xxxx
(3 bytes)  1110 xxxx 10xx xxxx 10xx xxxx
(4 bytes)  1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
(5 bytes)  1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
(6 bytes)  1111 110x 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx

So, if you are intended to read from NSInputStream raw bytes and then translate them into UTF-8 NSString, you probably want to read byte by byte from NSInputStream until you will get valid string: 因此,如果打算从NSInputStream读取原始字节,然后将其转换为UTF-8 NSString,则可能要从NSInputStream逐字节读取,直到获得有效的字符串为止:

#define MAX_UTF8_BYTES 6
NSString *utf8String;
NSMutableData *_data = [[NSMutableData alloc] init]; //for easy 'appending' bytes

int bytes_read = 0;
while (!utf8String) {
    if (bytes_read > MAX_UTF8_BYTES) {
        NSLog(@"Can't decode input byte array into UTF8.");
        return;
    }
    else {
        uint8_t byte[1];
        [_inputStream read:byte maxLength:1];
        [_data appendBytes:byte length:1];
        utf8String = [NSString stringWithUTF8String:[_data bytes]];
        bytes_read++;
    }
}

ASCII (and hence the newline character) is a subset of UTF-8, so there should not be any conflict. ASCII(以及换行符)是UTF-8的子集,因此不应有任何冲突。

It should be possible to divide your stream at the newline characters, as you would in a simple ASCII stream. 就像在简单的ASCII流中一样,应该可以按换行符分隔流。 Then you can convert each chunk ("line") into an NSString using UTF-8. 然后,您可以使用UTF-8将每个块(“行”)转换为NSString

Are you sure the encoding errors are not real, ie, that your stream may actually contain erroneous characters with respect to a UTF-8 encoding? 您确定编码错误不是真实的,即,相对于UTF-8编码,您的流实际上可能包含错误的字符吗?

Edited to add from the comments: 编辑后从评论中添加:

This presumes that the lines consist of sufficiently few characters to keep a whole line in memory before converting from UTF-8. 这假定行由足够少的字符组成,以便在从UTF-8转换之前将整行保留在内存中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM