简体   繁体   English

将非ASCII字符转换为英文对应的C++

[英]Convert non-ascii characters to english counterpart C++

I'm needing to compare data that has been cultivated from various locations, some of which have non-ascii characters, specifically the english letters with accents on them.我需要比较从不同位置采集的数据,其中一些具有非 ascii 字符,特别是带有重音符号的英文字母。 An example is "Frédérik Gauthier�: -61�: -87�: -61�: -87".一个例子是“Frédérik Gauthier�: -61�: -87�: -61�: -87”。 When I looked at the int values for the character, I've noticed that these characters are always a combination of 2 "characters" with values of -61 indicating the letter will be accented followed by the letter, in this case -87 for the accented 'e'.当我查看字符的 int 值时,我注意到这些字符始终是 2 个“字符”的组合,其值为 -61,表示字母后跟重音,在本例中为 -87重音“e”。 My goal is to just "drop" the accent and use the english character.我的目标是“放弃”口音并使用英文字符。 Obviously, I can't rely on this behavior from system to system, so how do you handle this situation?显然,我不能依靠这种行为从一个系统到另一个系统,那么你如何处理这种情况呢? std::string, handles the characters without issue, but as soon as I get to the char level, that's where the issues come up. std::string 可以毫无问题地处理字符,但是一旦我达到 char 级别,问题就出现了。 Any guidance?有什么指导吗?

#include <iostream>
#include <fstream>
#include <algorithm>

int main(int argc, char** argv){
    std::fstream fin;
    std::string line;
    std::string::iterator it;
    bool leave = false;
    fin.open(argv[1], std::ios::in);

    while(getline(fin, line)){
        std::for_each(line.begin(), line.end(), [](char &a){
            if(!isascii(a)) {
                if(int(a) == -68) a = 'u';
                else if(int(a) == -74) a = 'o';
                else if(int(a) == -83) a = 'i';
                else if(int(a) == -85) a = 'e';
                else if(int(a) == -87) a = 'e';
                else if(int(a) == -91) a = 'a';
                else if(int(a) == -92) a = 'a';
                else if(int(a) == -95) a = 'a';
                else if(int(a) == -120) a = 'n';
            }
        });
        it = line.begin();
        while(it != line.end()){
            it = std::find_if(line.begin(), line.end(), [](char &a){ return !isascii(a); });
            if(it != line.end()){
                line.erase(it);
                it = line.begin();
            }
        }
        std::cout << line << std::endl;
        std::for_each(line.begin(), line.end(), [&leave](char &a){
            if(!isascii(a)) {
                std::cout << a << " : " << int(a);
            }
        });
        if(leave){
            fin.close();
            return 1;
        }
    }
    fin.close();
    return 0;
}

This is a tricky task in general and you'll probably need to adapt your solution to your particular task.这通常是一项棘手的任务,您可能需要根据您的特定任务调整您的解决方案。 To transliterate your string from whatever encoding it's in to ASCII, it's best to rely on a library instead of trying to implement this yourself.要将您的字符串从它所在的任何编码音译为 ASCII,最好依靠一个库而不是尝试自己实现它。 Here's an example using iconv:这是一个使用 iconv 的示例:

#include <iconv.h>
#include <memory>
#include <type_traits>
#include <string>
#include <iostream>
#include <algorithm>
#include <string_view>
#include <cassert>
using namespace std;

string from_u8string(const u8string &s) {
  return string(s.begin(), s.end());
}

using iconv_handle = unique_ptr<remove_pointer<iconv_t>::type, decltype(&iconv_close)>;
iconv_handle make_converter(string_view to, string_view from) {
    auto raw_converter = iconv_open(to.data(), from.data());
    if (raw_converter != (iconv_t)-1) {
        return { raw_converter, iconv_close };
    } else {
        throw std::system_error(errno, std::system_category());
    }
}

string convert_to_ascii(string input, string_view encoding) {
    iconv_handle converter = make_converter("ASCII//TRANSLIT", encoding);

    char* input_data = input.data();
    size_t input_size = input.size();

    string output;
    output.resize(input_size * 2);
    char* converted = output.data();
    size_t converted_size = output.size();

    auto chars_converted = iconv(converter.get(), &input_data, &input_size, &converted, &converted_size);
    if (chars_converted != (size_t)(-1)) {
        return output;
    } else {
        throw std::system_error(errno, std::system_category());
    }
}

string convert_to_plain_ascii(string_view input, string_view encoding) {
    auto converted = convert_to_ascii(string{ input }, encoding);
    converted.erase(
        std::remove_if(converted.begin(), converted.end(), [](char c) { return !isalpha(c); }),
        converted.end()
    );
    return converted;
}

int main() {
    try {
        auto converted_utf8 = convert_to_plain_ascii(from_u8string(u8"Frédérik"), "UTF-8");
        assert(converted_utf8 == "Frederik");
        auto converted_1252 = convert_to_plain_ascii("Frédérik", "windows-1252");
        assert(converted_1252 == "Frederik");
    } catch (std::system_error& e) {
        cout << "Error " << e.code() << ": " << e.what() << endl;
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM