简体   繁体   中英

Parsing a binary MOBI file: best approach?

It contains METADATA in between binary data. I'm able to parse the first line with the title Agent_of_Chang2e, but I need to get the metadata on the bottom of the header as well. I know there are not standard specifics for it.

在此处输入图片说明

This code isn't able to decode the bottom lines. For example I get the following wrong formatted text:

 FÃHANGE</b1èrX)¯ÌiadenÕniverse<sup><smalÀ|®¿8</¡Îovelÿ·?=SharonÌeeándÓteveÍiller8PblockquoteßßÚ>TIa÷orkyfiction.Áll@eãacÐ0hðortrayedén{n)áreïrzus0¢°usly.Ôhatíean0authhmxétlõp.7N_\\ 

©ß© 1988âyÓOOKãsòeserved.0ðart)publicaZmayâehproduc

  NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
    char buffer[1024];
    FILE* file = fopen([path UTF8String], "r");
    if (file != 0)
    {
        while(fgets(buffer, 1024, file) != NULL)
        {
            NSString* string = [[NSString alloc] initWithCString: buffer encoding:NSASCIIStringEncoding];
            NSLog(@"%@",string);
            [string release];
        }
        fclose(file);
    }
    [pool drain];

Use NSTask or system() to pass the file through the strings utility and parse the output of that:

strings /bin/bash | more
...
...
677778899999999999999999999999999999:::;;<<====>>>>>>>>>>>????????   
@(#)PROGRAM:bash  PROJECT:bash-92
...
...

nielsbot already posted a link to the format specification .

As you can read there, the file is not text file, but binary encoded. Parsing it with NSString instances is no good idea.

You have to read the file binary, ie using NSData :

NSData content = [NSData dataWithContentsOfFile:path];

Then you have to take out the relevant information by yourself. For example, if you want to read the uncompressed text length, you will find in the linked document that this information starts at position 4 and has a length of 4.

int32_t uncompressedTextLength; // 4 bytes are 32 bit.
[content getBytes:&uncompressedLenght range:NSMakeRange(4, 4)];

Maybe you have to deal with endianess.

First, I am pretty sure the texts will be UTF-8 or UTF-16 encoded.

Second, you cannot just take random 1024 bytes and expect them to work as a text. What about byte order (big endian vs little endian)?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM