简体   繁体   中英

Unable to read UNICODE text file in C

(I looked at previous posts and tried what they suggested but to no avail.)

I'm attempting to read in a file containing only Japanese characters. Here is what that file looks like:

わたし わ エドワド オ'ハゲン です。 これ は なん です か?

When I attempt to read it, nothing is displayed as output in the console and when debugging, the read buffer is just garbage. Here is the function I am using to read in the file:

wchar_t* ReadTextFileW(wchar_t* filePath, size_t numBytesToRead, size_t maxBufferSize, const wchar_t* mode, int seekOffset, int seekOrigin)
{
    size_t numItems = 0;
    size_t bufferSize = 0;
    wchar_t* buffer = NULL;
    FILE* file = NULL;

    //Ensure the filePath does NOT lead to a device.
    if (IsPathADevice(filePath) == false)
    {
        //0 indicates to read as much as possible (the max specified).
        if (numBytesToRead == 0)
        {
            numBytesToRead = maxBufferSize;
        }

        if (filePath != NULL && mode != NULL)
        {
            //Ensure there are no errors in opening the file.
            if (_wfopen_s(&file, filePath, mode) == 0)
            {
                //Set the cursor location (back to the beginning of the file by default).
                if (fseek(file, seekOffset, seekOrigin) != 0)
                {
                    //Error: Could not change file cursor position.
                    fclose(file);
                    return NULL;
                }

                //Calculate the size of the buffer in bytes.
                bufferSize = numBytesToRead * sizeof(wchar_t);

                //Create the buffer to store file data in.
                buffer = (wchar_t*)_aligned_malloc(bufferSize, BYTE_ALIGNMENT);

                //Ensure the buffer was allocated.
                if (buffer == NULL)
                {
                    //Error: Buffer could not be allocated.
                    fclose(file);
                    return NULL;
                }

                //Clear any garbage data in the buffer.
                memset(buffer, 0, bufferSize);

                //Read the data from the file.
                numItems = fread_s(buffer, bufferSize, sizeof(wchar_t), numBytesToRead, file);

                //Check for read errors.
                if (numItems <= 0)
                {
                    //Error: File could not be read.
                    fclose(file);
                    _aligned_free(buffer);
                    return NULL;
                }

                //Ensure the file is closed without errors.
                if (fclose(file) != 0)
                {
                    //Error: File did not close properly.
                    _aligned_free(buffer);
                    return NULL;
                }

            }
        }
    }

    return buffer;
}

To call this function, I am doing the following. Perhaps I'm not using setlocale() correctly but from what I read it seems like I am. Just to re-iterate, the problem I'm having is that garbage seems to be read in and nothing is displayed in console:

    setlocale(LC_ALL, "jp");
    wchar_t* retVal = ReadTextFileW(L"C:\\jap.txt");
    printf("%S\n", retVal);
    _aligned_free(retVal);

I also have the following defined at the top of my .cpp

#define UNICODE
#define _UNICODE

SOLVED:

To fix this, as ryyker mentioned, you need to know the encoding you used to create the original file. In notepad and notepad++ there is a drop down menu for encoding. By default (and what is mostly used) is UTF-8.

Once you know the encoding you can change the read mode of _wfopen_s() to the following.

wchar_t* retVal = ReadWide::ReadTextFileW(L"C:\\jap.txt", 0, 1024, L"r, ccs=UTF-8");
MessageBoxW(NULL, retVal, NULL, 0);
_aligned_free(retVal);

You must use the message box to print foreign characters.

This is an excerpt discussing content on encoding for Japanese language , created using Notepad++ (stated in comments as being used by OP)

Double Byte encodings, also called, by usage, Double Byte Character Set (DBCS)

Some of them preexisted Unicode, and were designed to encode character sets with a large number of characters, mainly found in Far East languages with ideographic or syllabic scripts:

 The 2 Bytes Universal Character Set : UCS-2 Big Endian and UCS-2 Little Endian The Japanese Code Page : Shift-JIS ( Windows-932 ) The Chinese Code Pages : Simplified Chinese GB2312 ( Windows-936 ), Traditionnal Chinese Big5 ( Windows-950 ) The Korean Code Pages : Windows 949, EUC-KR 

It would appear that Shift-JIS might be the encoding you are trying to read. From here

Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft...

In general, you need to determine the encoding used to create the multi-byte characters in a file, before they can be correctly read back out by a function in C, or any other language. This link may help .

You read the file content and basically copied it into an allocated memory buffer.

But a key point is: what encoding is used to store the Japanese text in the file?

For example, if the text is encoded in UTF-8, you should convert from UTF-8 to UTF-16 (using for example the MultiByteToWideChar Win32 API) , as you seem having a wchar_t buffer in memory.

If you are using a recent version of Visual Studio, you can also specify some encoding information in the mode string passed to _wfopen_s (using the ccs flag).

EDIT Since you are printing the content of the read buffer using printf, make sure that the buffer is NUL-terminated.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM