简体   繁体   English

如何在Linux上的C中获取文件中的字符数(而不是字节数)

[英]How to get the number of characters in a file (not bytes) in C on Linux

I would like to get the number of characters in a file. 我想获取文件中的字符数。 By characters I mean "real" characters, not bytes. 字符我的意思是“真正的”字符,而不是字节。 Assuming I know the file encoding. 假设我知道文件编码。

I tried to use mbstowcs() but it doesn't work because it uses the system locale (or the one defined with setlocale). 我尝试使用mbstowcs()但它不起作用,因为它使用系统区域设置(或使用setlocale定义的系统区域设置)。 Because setlocale is not thread-safe, I don't think it's a good idea to use it before calling mbstowcs() . 因为setlocale不是线程安全的,所以在调用mbstowcs()之前我不认为使用它是个好主意。 Even if it was tread-safe, I would have to be sure that my program won't "jump" (signal, etc) between the calls of setlocale() (one call to set it to the encoding of the file, and on call to revert to the previous one). 即使它是安全的,我也必须确保我的程序不会在setlocale()的调用之间“跳转”(信号等setlocale() (一次调用将其设置为文件的编码,然后打开打电话恢复到前一个)。

So, to take an example, imagine we have a file ru.txt encoded using a russian encoding (KOI8 for example). 因此,举一个例子,假设我们有一个使用俄语编码(例如KOI8)编码的文件ru.txt So, I would like to open the file and get the numbers of characters, assuming the encoding of the file is KOI8. 所以,我想打开文件并获取字符数,假设文件的编码是KOI8。

It could be so easy if mbstowcs() could take a source_encoding argument... 如果source_encoding mbstowcs()可以采用source_encoding参数,那可能很容易......

EDIT: An other problem using mbstowcs() is that the locale corresponding to the encoding of the file has to be installed on the system... 编辑:使用mbstowcs()的另一个问题是必须在系统上安装与文件编码对应的语言环境...

I'd suggest using iconv(3): 我建议使用iconv(3):

NAME
   iconv - perform character set conversion

SYNOPSIS
   #include <iconv.h>

   size_t iconv(iconv_t cd,
                char **inbuf, size_t *inbytesleft,
                char **outbuf, size_t *outbytesleft);

and convert to utf32. 并转换为utf32。 You get 4 byte output for every character converted (plus 2 for the BOM). 对于每个转换的字符,您将得到4字节输出(对于BOM,加2)。 It should be possible to convert the input piece by piece using a fix size outbuf, if one choses outbytesleft carefully (ie 4 * inbytesleft + 2 :-). 应该可以使用固定大小outbuf逐个转换输入,如果一个人仔细选择outbytesleft(即4 * inbytesleft + 2 :-)。

To calculate the number of UTF8 characters in a file just pass it's content to this function: 要计算文件中UTF8字符的数量,只需将其内容传递给此函数:

int CalcUTF8Chars( const std::string& S )
{
    int Count = 0;

    for ( size_t i = 0; i != S.length(); i++ )
    {
        if ( ( S[i] & 0xC0 ) != 0x80 ) { Count++; }
    }

    return Count;
}

No external dependencies. 没有外部依赖。

Update: 更新:

In case you want to handle other different encodings you have two choices: 如果您想处理其他不同的编码,您有两种选择:

  1. Use a third-party library that can handle it, for example, ICU http://site.icu-project.org/ 使用可以处理它的第三方库,例如,ICU http://site.icu-project.org/

  2. Write the calculation functions yourself for every encoding you want to use. 为您要使用的每个编码自己编写计算函数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM