简体   繁体   中英

How to read a file with Unicode contents

How can I read a file with Unicode contents using C/C++?

I used ReadFile function to read a file with Unicode contents, but it doesn't have the true output. I want to have a buffer that contains all the contents of the file

I use this code:

#include <Windows.h>

int main()
{
    HANDLE hndlRead;
    OVERLAPPED ol = {0};

    CHAR* szReadBuffer;
    INT fileSize;

    hndlRead = CreateFileW(L"file", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

    if (hndlRead != INVALID_HANDLE_VALUE)
    {
        fileSize = GetFileSize(hndlRead, NULL);
        szReadBuffer = (CHAR*) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, (fileSize)*2);
        DWORD nb=0;
        int nSize=fileSize;
        if (szReadBuffer != NULL)
        {
            ReadFile(hndlRead, szReadBuffer, nSize, &nb, &ol);
        }
    }

    return 0;
}

Is there any way to read this file correctly?

This is nb and szReadBuffer:

在此处输入图像描述

This is my file content in notpad++:

在此处输入图像描述

Your code works fine. It reads the rdp file verbatim into memory.

You are troubled by the BOM (byte order mark) at the beginning of the rdp file.

If you look at the rdp file with text editor (notepad for instance) you will see this:

screen mode id:i:2
use multimon:i:0
desktopwidth:i:2560
desktopheight:i:1600
....

If you look at the rdp file with a hexadecimal editor you will see this:

0000 FFFE 7300 6300 7200 6500 6500 6E00 2000 ..s.c.r.e.e.n. .
0008 6D00 6F00 6400 6500 2000 6900 6400 3A00 m.o.d.e. .i.d...
....

FFFE is the byte order mark which indicates that the file is a text file encoded in little endian UNICODE, so each character takes 2 bytes instead of 1 byte.

Once the file read in memory you will get this (0x00318479 being the address szReadBuffer points to):

在此处输入图片说明

  • BTW 1: you should call CloseHandle(hndlRead) once the file has been read.
  • BTW 2: you should'nt use HeapAlloc but rather malloc or calloc .

Corrected program:

#include <Windows.h>

int main()
{
  HANDLE hndlRead;

  WCHAR* szReadBuffer;   // WCHAR instead of CHAR
  INT fileSize;

  hndlRead = CreateFileW(L"rdp.RDP", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

  if (hndlRead != INVALID_HANDLE_VALUE)
  {
    fileSize = GetFileSize(hndlRead, NULL);
    szReadBuffer = (WCHAR*)calloc(fileSize + sizeof(WCHAR), 1);  // + sizeof(WCHAR) for NUL string terminator
    DWORD nb = 0;
    int nSize = fileSize;
    if (szReadBuffer != NULL)
    {
      ReadFile(hndlRead, szReadBuffer, nSize, &nb, NULL);
    }

    CloseHandle(hndlRead);   // close what we have opened

    WCHAR *textwithoutbom = szReadBuffer + 1;  // skip BOM

    // put breakpoint here and inspect textwithoutbom

    free(szReadBuffer);  // free what we have allocated
  }

  return 0;
}

As suggests @MickaelWalz, the file format of the RDP file is now Unicode.

Here is a way to read and display the content of that file:

  • Use wchar_t * buffer instad of CHAR * or BYTE * buffer.
  • Check if the ReadFile() has been successfully performed bRet == True and nSize == nb .
  • Start to the second WCHAR to exclude the 0xFFFE Unicode identifier.
  • Don't forget to close your file CloseHandle(hndlRead); !
#include <stdio.h>
#include <iostream>
#include <Windows.h>

int main()
{
    HANDLE hndlRead;
    OVERLAPPED ol = {0};

    //BYTE* szReadBuffer;
    INT fileSize;
    wchar_t *szReadBuffer;

    hndlRead = CreateFileW(L"rdp.RDP", GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);

    if (hndlRead != INVALID_HANDLE_VALUE)
    {
        fileSize = GetFileSize(hndlRead, NULL);
        szReadBuffer = (wchar_t *) HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, (fileSize)*sizeof(wchar_t));
        DWORD nb=0;
        int nSize=fileSize;
        BOOL bRet;
        if (szReadBuffer != NULL)
        {
            bRet = ReadFile(hndlRead, szReadBuffer, nSize, &nb, &ol);
            if ((bRet) && (nb == nSize)) {
                printf("%02X,%02X... %02X\n",szReadBuffer[0],szReadBuffer[1],szReadBuffer[nb-1]);
                std::wcout << L"info " << (szReadBuffer+1) << L" " << nb << std::endl;
            }
        }
        CloseHandle(hndlRead);
    }

    return 0;
}

This question was asked almost 6 years ago.

Jabberwocky gave an excellent answer.

J. Piquard gave an excellent answer.

Using the answers from these two I can now (years later) read a file with a Unicode file name, and read from that file the Unicode contents of that file, both.

It has taken a lot of reading for me to find this on this site as it was almost drowned by the clutter of confusion being posted as assumed fact herein.

Thank you, all of you guys, that actually know how do this stuff.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM