簡體   English   中英

如何使用UTF-8讀取NSInputStream?

[英]How to read a NSInputStream with UTF-8?

我嘗試使用NSInputStream在iOS中讀取大文件,以換行符分隔文件行(我不想使用componentsSeparatedByCharactersInSet因為它使用了太多內存)。

但是由於並非所有的行似乎都是UTF-8編碼的(因為它們可以像ASCII一樣出現,相同的字節),所以我經常會Incorrect NSStringEncoding value 0x0000 detected. Assuming NSASCIIStringEncoding. Will stop this compatiblity mapping behavior in the near future. Incorrect NSStringEncoding value 0x0000 detected. Assuming NSASCIIStringEncoding. Will stop this compatiblity mapping behavior in the near future. 警告。

我的問題是:是否可以通過設置編譯器標志來消除此警告?

此外:將兩個緩沖區讀取追加/連接起來是否很容易,例如從字節流中讀取,然后將緩沖區轉換為字符串,然后追加字符串可能會使字符串損壞?

下面的示例方法演示了字節到字符串的轉換將把UTF-8字符的前一半和后一半視為無效。

- (void)NSInputStreamTest {
  uint8_t testString[] = {0xd0, 0x91}; // @"Б"

  // Test 1: Read max 1 byte at a time of UTF-8 string
  uint8_t buf1[1], buf2[1];
  NSString *s1, *s2, *s3;
  NSInteger c1, c2;
  NSInputStream *inStream = [[NSInputStream alloc] initWithData:[[NSData alloc] initWithBytes:testString length:2]];

  [inStream open];
  c1 = [inStream read:buf1 maxLength:1];
  s1 = [[NSString alloc] initWithBytes:buf1 length:1 encoding:NSUTF8StringEncoding];
  NSLog(@"Test 1: Read %d byte(s): %@", c1, s1);
  c2 = [inStream read:buf2 maxLength:1];
  s2 = [[NSString alloc] initWithBytes:buf2 length:1 encoding:NSUTF8StringEncoding];
  NSLog(@"Test 1: Read %d byte(s): %@", c2, s2);
  s3 = [s1 stringByAppendingString:s2];
  NSLog(@"Test 1: Concatenated: %@", s3);
  [inStream close];

  // Test 2: Read max 2 bytes at a time of UTF-8 string
  uint8_t buf4[2];
  NSString *s4;
  NSInteger c4;
  NSInputStream *inStream2 = [[NSInputStream alloc] initWithData:[[NSData alloc] initWithBytes:testString length:2]];

  [inStream2 open];
  c4 = [inStream2 read:buf4 maxLength:2];
  s4 = [[NSString alloc] initWithBytes:buf4 length:2 encoding:NSUTF8StringEncoding];
  NSLog(@"Test 2: Read %d byte(s): %@", c4, s4);
  [inStream2 close];
}

輸出:

2013-02-10 21:16:23.412 Test[11144:c07] Test 1: Read 1 byte(s): (null)
2013-02-10 21:16:23.413 Test[11144:c07] Test 1: Read 1 byte(s): (null)
2013-02-10 21:16:23.413 Test[11144:c07] Test 1: Concatenated: (null)
2013-02-10 21:16:23.413 Test[11144:c07] Test 2: Read 2 byte(s): Б

首先,在一行中: s3 = [s1 stringByAppendingString:s2]; 您正在嘗試將值連接為“ nil”。 結果也將為“ nil”。 因此,您可能需要串聯字節而不是字符串:

uint8_t buf3[2];
buf3[0] = buf1[0];
buf3[1] = buf2[0];
s3 = [[NSString alloc] initWithBytes:buf3 length:2 encoding:NSUTF8StringEncoding];

輸出:

2015-11-06 12:57:40.304 Test[10803:883182] Test 1: Read 1 byte(s): (null)
2015-11-06 12:57:40.305 Test[10803:883182] Test 1: Read 1 byte(s): (null)
2015-11-06 12:57:40.305 Test[10803:883182] Test 1: Concatenated: Б

其次,UTF-8字符的長度可以位於[1..6]字節中。

(1 byte)   0aaa aaaa         //if symbol lays in 0x00 .. 0x7F (ASCII)
(2 bytes)  110x xxxx 10xx xxxx
(3 bytes)  1110 xxxx 10xx xxxx 10xx xxxx
(4 bytes)  1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
(5 bytes)  1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx
(6 bytes)  1111 110x 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx

因此,如果打算從NSInputStream讀取原始字節,然后將其轉換為UTF-8 NSString,則可能要從NSInputStream逐字節讀取,直到獲得有效的字符串為止:

#define MAX_UTF8_BYTES 6
NSString *utf8String;
NSMutableData *_data = [[NSMutableData alloc] init]; //for easy 'appending' bytes

int bytes_read = 0;
while (!utf8String) {
    if (bytes_read > MAX_UTF8_BYTES) {
        NSLog(@"Can't decode input byte array into UTF8.");
        return;
    }
    else {
        uint8_t byte[1];
        [_inputStream read:byte maxLength:1];
        [_data appendBytes:byte length:1];
        utf8String = [NSString stringWithUTF8String:[_data bytes]];
        bytes_read++;
    }
}

ASCII(以及換行符)是UTF-8的子集,因此不應有任何沖突。

就像在簡單的ASCII流中一樣,應該可以按換行符分隔流。 然后,您可以使用UTF-8將每個塊(“行”)轉換為NSString

您確定編碼錯誤不是真實的,即,相對於UTF-8編碼,您的流實際上可能包含錯誤的字符嗎?

編輯后從評論中添加:

這假定行由足夠少的字符組成,以便在從UTF-8轉換之前將整行保留在內存中。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM