无法读取C中的UNICODE文本文件

Question

(I looked at previous posts and tried what they suggested but to no avail.) （我查看了以前的帖子，并尝试了他们的建议，但无济于事。）

I'm attempting to read in a file containing only Japanese characters. 我正在尝试读取仅包含日语字符的文件。 Here is what that file looks like: 该文件如下所示：

わたしわエドワドオ'ハゲンです。これはなんですか？ ゲ。ゲ？ゲ？ゲ？

When I attempt to read it, nothing is displayed as output in the console and when debugging, the read buffer is just garbage. 当我尝试读取它时，控制台中没有任何输出显示，而在调试时，读取缓冲区只是垃圾。 Here is the function I am using to read in the file: 这是我用来读取文件的功能：

wchar_t* ReadTextFileW(wchar_t* filePath, size_t numBytesToRead, size_t maxBufferSize, const wchar_t* mode, int seekOffset, int seekOrigin)
{
    size_t numItems = 0;
    size_t bufferSize = 0;
    wchar_t* buffer = NULL;
    FILE* file = NULL;

    //Ensure the filePath does NOT lead to a device.
    if (IsPathADevice(filePath) == false)
    {
        //0 indicates to read as much as possible (the max specified).
        if (numBytesToRead == 0)
        {
            numBytesToRead = maxBufferSize;
        }

        if (filePath != NULL && mode != NULL)
        {
            //Ensure there are no errors in opening the file.
            if (_wfopen_s(&file, filePath, mode) == 0)
            {
                //Set the cursor location (back to the beginning of the file by default).
                if (fseek(file, seekOffset, seekOrigin) != 0)
                {
                    //Error: Could not change file cursor position.
                    fclose(file);
                    return NULL;
                }

                //Calculate the size of the buffer in bytes.
                bufferSize = numBytesToRead * sizeof(wchar_t);

                //Create the buffer to store file data in.
                buffer = (wchar_t*)_aligned_malloc(bufferSize, BYTE_ALIGNMENT);

                //Ensure the buffer was allocated.
                if (buffer == NULL)
                {
                    //Error: Buffer could not be allocated.
                    fclose(file);
                    return NULL;
                }

                //Clear any garbage data in the buffer.
                memset(buffer, 0, bufferSize);

                //Read the data from the file.
                numItems = fread_s(buffer, bufferSize, sizeof(wchar_t), numBytesToRead, file);

                //Check for read errors.
                if (numItems <= 0)
                {
                    //Error: File could not be read.
                    fclose(file);
                    _aligned_free(buffer);
                    return NULL;
                }

                //Ensure the file is closed without errors.
                if (fclose(file) != 0)
                {
                    //Error: File did not close properly.
                    _aligned_free(buffer);
                    return NULL;
                }

            }
        }
    }

    return buffer;
}

To call this function, I am doing the following. 要调用此函数，我正在做以下事情。 Perhaps I'm not using setlocale() correctly but from what I read it seems like I am. 也许我没有正确使用setlocale（），但是从我的阅读看来，我是。 Just to re-iterate, the problem I'm having is that garbage seems to be read in and nothing is displayed in console: 只是重申一下，我遇到的问题是，似乎已读取垃圾并且控制台中什么也没有显示：

    setlocale(LC_ALL, "jp");
    wchar_t* retVal = ReadTextFileW(L"C:\\jap.txt");
    printf("%S\n", retVal);
    _aligned_free(retVal);

I also have the following defined at the top of my .cpp 我的.cpp顶部也定义了以下内容

#define UNICODE
#define _UNICODE

SOLVED: 解决了：

To fix this, as ryyker mentioned, you need to know the encoding you used to create the original file. 要解决此问题，就像ryyker提到的那样，您需要知道用于创建原始文件的编码。 In notepad and notepad++ there is a drop down menu for encoding. 在记事本和记事本++中，有一个用于编码的下拉菜单。 By default (and what is mostly used) is UTF-8. 默认情况下（以及最常用的是）UTF-8。

Once you know the encoding you can change the read mode of _wfopen_s() to the following. 一旦知道了编码，就可以将_wfopen_s（）的读取模式更改为以下模式。

wchar_t* retVal = ReadWide::ReadTextFileW(L"C:\\jap.txt", 0, 1024, L"r, ccs=UTF-8");
MessageBoxW(NULL, retVal, NULL, 0);
_aligned_free(retVal);

You must use the message box to print foreign characters. 您必须使用消息框来打印外来字符。

Answer 1

This is an excerpt discussing content on encoding for Japanese language , created using Notepad++ (stated in comments as being used by OP) 这是摘录，讨论了使用Notepad ++创建的日语编码内容 （在注释中表示为OP使用）

Double Byte encodings, also called, by usage, Double Byte Character Set (DBCS) 双字节编码，按用法也称为双字节字符集（DBCS）

Some of them preexisted Unicode, and were designed to encode character sets with a large number of characters, mainly found in Far East languages with ideographic or syllabic scripts: 其中一些预先存在Unicode，并且被设计为使用大量字符对字符集进行编码，这些字符集主要在表意文字或音节文字的远东语言中找到：
 The 2 Bytes Universal Character Set : UCS-2 Big Endian and UCS-2 Little Endian The Japanese Code Page : Shift-JIS ( Windows-932 ) The Chinese Code Pages : Simplified Chinese GB2312 ( Windows-936 ), Traditionnal Chinese Big5 ( Windows-950 ) The Korean Code Pages : Windows 949, EUC-KR 

It would appear that Shift-JIS might be the encoding you are trying to read. 看来Shift-JIS可能是您尝试读取的编码。 From here 从这里

Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft... Shift JIS（Shift日本工业标准，也称为SJIS，MIME名称Shift_JIS）是日语的字符编码，最初由一家名为ASCII Corporation的日本公司与Microsoft联合开发。

In general, you need to determine the encoding used to create the multi-byte characters in a file, before they can be correctly read back out by a function in C, or any other language. 通常，您需要确定用于在文件中创建多字节字符的编码，然后才能使用C或任何其他语言的函数正确地读出它们。 This link may help . 此链接可能会有所帮助 。

Answer 2

You read the file content and basically copied it into an allocated memory buffer. 您读取文件内容并将其基本上复制到分配的内存缓冲区中。

But a key point is: what encoding is used to store the Japanese text in the file? 但关键是：文件中使用哪种编码来存储日语文本？

For example, if the text is encoded in UTF-8, you should convert from UTF-8 to UTF-16 (using for example the MultiByteToWideChar Win32 API) , as you seem having a wchar_t buffer in memory. 例如，如果文本以UTF-8编码，则应该从UTF-8转换为UTF-16（例如使用MultiByteToWideChar Win32 API），因为您似乎在内存中有wchar_t缓冲区。

If you are using a recent version of Visual Studio, you can also specify some encoding information in the mode string passed to _wfopen_s (using the ccs flag). 如果使用的是Visual Studio的最新版本，则还可以在传递给_wfopen_s 的模式字符串中指定一些编码信息（使用ccs标志）。

EDIT Since you are printing the content of the read buffer using printf, make sure that the buffer is NUL-terminated. 编辑由于您正在使用printf打印读取缓冲区的内容，因此请确保缓冲区为NUL终止。

无法读取C中的UNICODE文本文件

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-10-13 14:41:15

解决方案2
0 2016-10-13 14:20:17

无法读取C中的UNICODE文本文件

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-10-13 14:41:15

解决方案2 0 2016-10-13 14:20:17

解决方案1
2 已采纳 2016-10-13 14:41:15

解决方案2
0 2016-10-13 14:20:17