简体   繁体   English

使用C ++在适当位置编辑文本文件

[英]Edit text file in place using C++

I have a text file which I am adding tags to in order to make it XML readable. 我有一个文本文件,正在向其中添加标签以使其XML可读。 In order for our reader to recognize it as valid, each line must at least be wrapped in tags. 为了使我们的读者认识到它是有效的,每行至少必须包装在标签中。 My issue arises because this is actually a Syriac translation dictionary and so there are many non-standard characters (the actual Syriac words). 出现我的问题是因为这实际上是叙利亚语翻译词典,所以有许多非标准字符(实际的叙利亚语单词)。 The most straight-forward way I see to accomplish what I need is to simply prepend and append each line with the needed tags, in place, without necessarily accessing or modifying the rest of the line. 我看到的最简单的方法来完成我需要的操作,就是简单地在每行前面加上所需的标签,而不必访问或修改其余的行。 Any other options would also be greatly appreciated. 任何其他选择也将不胜感激。

ifstream in_file;
string file_name;

string line;
string line2;
string pre_text;
string post_text;

int num = 1;

pre_text = "<entry n=\"";
post_text = "</entry>";

file_name = "D:/TEI/dictionary1.txt";
in_file.open(file_name.c_str());

if (in_file.is_open()){
    while (getline(in_file, line)){
        line2 = pre_text + to_string(num) + "\">" + line + post_text;
        cout << line2;
        num++;
    }
}

The file in question may be downloaded here. 有问题的文件可以在这里下载

You are using std::string which, by default, deals with ASCII encoded text, and you are opening your file in "text translation mode". 您正在使用std::string ,默认情况下,它处理ASCII编码的文本,并且您正在以“文本翻译模式”打开文件。 The first thing you need to do is open the file in binary mode so that it doesn't perform translation on individual char values: 您需要做的第一件事是以二进制模式打开文件,以便它不会对单个char值执行转换:

in_file.open(file_name.c_str(), std::ios::binary);

or in C++11 或在C ++ 11中

in_file.open(file_name, std::ios::binary);

The next thing is to stop using std::string for storing the text from the file. 接下来的事情是停止使用std :: string来存储文件中的文本。 You will need to us a string type that recognizes the character encoding you are using and use the appropriate character type. 您将需要一个字符串类型,该字符串类型可以识别您正在使用的字符编码并使用适当的字符类型。

As it turns out, std::string is actually an alias for std::basic_string<char> . 事实证明, std::string实际上是std::basic_string<char>的别名。 In C++11 several new unicode character types were introduced, in C++03 there was wchar_t which supports "wide" characters (more than 8 bits). 在C ++ 11中引入了几种新的unicode字符类型,在C ++ 03中提供了wchar_t ,它支持“宽”字符(超过8位)。 There is a standard alias for basic_string s of wchar_t s: std::wstring . wchar_tbasic_string s有一个标准别名: std::wstring

Start with the following simple test: 从以下简单测试开始:

#include <iostream>
#include <fstream>
#include <string>

int main() {
    std::string file_name = "D:/TEI/dictionary1.txt";
    std::wifstream in_file(file_name, std::ios::binary);

    if (!in_file.is_open()) {
        // "L" prefix indicates a wide string literal
        std::wcerr << L"file open failed\n";
        return 1;
    }

    std::wstring line1;
    std::getline(in_file, line1);
    std::wcout << L"line1 = " << line1 << L"\n";
}

Note how cout etc also become prefixed with w ... 注意cout等如何也以w为前缀...

The standard ASCII characterset contains 128 characters numbered 0 thru 127. In ASCII \\n and \\r are represented with a 7-bit value of 13 and 10 respectively. 标准ASCII字符集包含128个字符,编号为0到127。在ASCII中, \\n\\r分别用7位值13和10表示。

Your text file appears to be UTF-8 encoded. 您的文本文件似乎是UTF-8编码的。 UTF-8 uses an 8-bit unsigned representation that allows characters to use a variable number of bytes: the value 0 requires 1 byte, the value 128 requires 2 bytes, the value 8192 requires 3 bytes, and so on. UTF-8使用8位无符号表示形式,允许字符使用可变数量的字节:值0需要1个字节,值128需要2个字节,值8192需要3个字节,依此类推。

A value with the highest-bit (2^7) clear is a single, 7-bit ascii value or the end of a multibyte-sequence. 具有最高位(2 ^ 7) 清除的值是单个7位ascii值或多字节序列的结尾。 If the highest-bit is set, the lower bits are considered to be a "prefix value". 如果设置了最高位,则将低位视为“前缀值”。 So the byte sequence { (128+2), 0 } would represent the value (2 << 7) | 0 因此字节序列{ (128+2), 0 }表示值(2 << 7) | 0 (2 << 7) | 0 or (wchar_t)256 . (2 << 7) | 0(wchar_t)256 The byte sequence { 130, 13 } represents (2 << 7) | 13 字节序列{ 130, 13 }表示(2 << 7) | 13 (2 << 7) | 13 or wchar_t 269 . (2 << 7) | 13wchar_t 269

You can read and write utf-8 values through char streams and storage, but only as opaque byte streams. 您可以通过char流和存储读取和写入utf-8 ,但只能作为不透明的字节流。 The moment you start to need to understand the values you generally need to resort to wchar_t , uint16_t or uint32_t etc. 当您开始需要了解值时,通常需要使用wchar_tuint16_tuint32_t等。

If you are working with Microsoft's toolset (noting the "D:/" path), you may need to look into TCHAR ( https://msdn.microsoft.com/en-us/library/c426s321.aspx ) 如果使用的是Microsoft的工具集(请注意“ D:/”路径),则可能需要研究TCHARhttps://msdn.microsoft.com/zh-cn/library/c426s321.aspx

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM