简体   繁体   English

如何在iPhone上读取大型UTF-8文件?

[英]How can I read a large UTF-8 file on an iPhone?

My app downloads a file in UTF-8 format, which is too large to read using the NSString initWithContentsOfFile method. 我的应用程序以UTF-8格式下载文件,该文件太大而无法使用NSString initWithContentsOfFile方法读取。 The problem I have is that the NSFileHandle readDataOfLength method reads a specified number of bytes, and I may end up only reading part of a UTF-8 character. NSFileHandle readDataOfLength的问题是NSFileHandle readDataOfLength方法读取指定数量的字节,我可能最终只读取UTF-8字符的一部分。 What is the best solution here? 这里最好的解决方案是什么?

LATER: 后来:

Let it be recorded in the ship's log that the following code works: 让它记录在船舶的日志中,以下代码有效:

    NSData *buf = [NSData dataWithContentsOfFile:path
                                      options:NSDataReadingMappedIfSafe
                                        error:nil];

NSString *data = [[[NSString alloc] 
                   initWithBytesNoCopy:(void *)buf.bytes 
                   length:buf.length 
                   encoding:NSUTF8StringEncoding 
                   freeWhenDone:NO] autorelease];

My main problem was actually to do with the encoding, not the task of reading the file. 我的主要问题实际上是编码,而不是读取文件的任务。

You can use NSData +dataWithContentsOfFile:options:error: with the NSDataReadingMappedIfSafe option to map your file to memory rather than loading it. 您可以使用NSData +dataWithContentsOfFile:options:error:使用NSDataReadingMappedIfSafe选项将文件映射到内存而不是加载它。 So that'll use the virtual memory manager in iOS to ensure that bits of the file are swapped in and out of RAM in the same way that a desktop OS handles its on-disk virtual memory file. 因此,我们将使用iOS中的虚拟内存管理器来确保文件的各个部分以与桌面操作系统处理其磁盘上虚拟内存文件相同的方式交换进RAM。 So you don't need enough RAM to keep the entire file in memory at once, you just need the file to be small enough to fit in the processor's address space (so, gigabytes). 因此,您不需要足够的RAM来将整个文件保存在内存中,您只需要将文件足够小以适应处理器的地址空间(因此,千兆字节)。 You'll get an object that acts exactly like a normal NSData , which should save you most of the hassle related to using an NSFileHandle and manually streaming. 您将获得一个与普通NSData完全相同的对象,它可以为您节省大部分与使用NSFileHandle和手动流相关的麻烦。

You'll probably then need to convert portions to NSString since you can realistically expect that to convert from UTF-8 to another format (though it might not; it's worth having a go with -initWithData:encoding: and seeing whether NSString is smart enough just to keep a reference to the original data and to expand from UTF-8 on demand), which I think is what your question is really getting at. 您可能需要将部分转换为NSString因为您可以真实地期望从UTF-8转换为另一种格式(尽管它可能不会;值得使用-initWithData:encoding:并查看NSString是否足够智能只是为了保持对原始数据的引用并根据需要从UTF-8扩展,我认为这是你真正得到的问题。

I'd suggest you use -initWithBytes:length:encoding: to convert a reasonable number of bytes to a string. 我建议你使用-initWithBytes:length:encoding:将合理的字节数转换为字符串。 You can then use -lengthOfBytesUsingEncoding: to find out how many bytes it actually made sense of and advance your read pointer appropriately. 然后,您可以使用-lengthOfBytesUsingEncoding:找出它实际感知的字节数并适当地推进您的读指针。 It's a safe assumption that NSString will discard any part characters at the end of the bytes you provide. 可以肯定的是, NSString将丢弃您提供的字节末尾的任何部分字符。

EDIT: so, something like: 编辑:所以,像:

// map the file, rather than loading it
NSData *data = [NSData dataWithContentsOfFile:...whatever...
                         options:NSDataReadingMappedIfSafe
                         error:&youdDoSomethingSafeHere];

// we'll maintain a read pointer to our current location in the data
NSUinteger readPointer = 0;

// continue while data remains
while(readPointer < [data length])
{
    // work out how many bytes are remaining
    NSUInteger distanceToEndOfData = [data length] - readPointer;

    // grab at most 16kb of them, being careful not to read too many
    NSString *newPortion = 
         [[NSString alloc] initWithBytes:(uint8_t *)[data bytes] + readPointer
                 length:distanceToEndOfData > 16384 ? 16384 : distanceToEndOfData
                 encoding:NSUTF8StringEncoding];

    // do whatever we want with the string
    [self doSomethingWithFragment:newPortion];

    // advance our read pointer by the number of bytes actually read, and
    // clean up
    readPointer += [newPortion lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
    [newPortion release];
}

Of course, an implicit assumption is that all UTF-8 encodings are unique, which I have to admit not to being knowledgable enough to say for absolute certain. 当然,一个隐含的假设是所有UTF-8编码都是唯一的,我不得不承认这些编码不足以说明绝对确定。

It's actually really easy to tell if you have split a multibyte character in UTF-8. 实际上很容易判断你是否在UTF-8中拆分了多字节字符。 Continuation characters all have the two most significant bits set like this: 10xxxxxx . 连续字符都有两个最重要的位设置如下: 10xxxxxx So if the last octet of the buffer has that pattern, scan backwards to find an octet that does not have that form. 因此,如果缓冲区的最后一个八位字节具有该模式,则向后扫描以查找不具有该格式的八位字节。 This is the first octet of the character. 这是角色的第一个八位字节。 The position of the most significant 0 in the octet tells you how many octets are in the character 八位字节中最重要的0的位置告诉您字符中有多少个八位字节

0xxxxxxx => 1 octet (ASCII)
110xxxxx => 2 octets
1110xxxx => 3 octets

and so on up to 6 octets. 等等,最多6个八位字节。

So it's fairly trivial to figure out how many extra octets to read to get to a character boundary. 因此,弄清楚有多少额外的八位字节要读到字符边界是相当简单的。

One approach would be to 一种方法是

  1. read up to a certain point - 读到某一点 -
  2. then examine the last byte(s) to determine if it is splitting a UTF-8 character 然后检查最后一个字节以确定它是否正在拆分UTF-8字符
  3. if not - read the next chunk 如果没有 - 阅读下一个块
  4. if yes, get the next byte and fix - then read the next chunk 如果是,请获取下一个字节并修复 - 然后读取下一个块

utf8 is self synchronizing - just read a little more or less as needed, then read the byte values to determine the boundaries for any code point. utf8是自同步的 - 只需根据需要读取或多或少,然后读取字节值以确定任何代码点的边界。

also, you could use fopen and use a small, manageable buffer on the stack for this and memory will not be an issue. 另外,你可以使用fopen并在堆栈上使用一个小的,可管理的缓冲区,而内存也不会成为问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM