替换 C# 中的字符 (ascii)

Question

我得到了一个包含以下字符的文件：à、è、ì、ò、ù - À。 我需要做的是用普通字符替换这些字符，例如：à = a, è = e 等等.....这是我目前的代码：

StreamWriter sw = new StreamWriter(@"C:/JoinerOutput.csv");
string path = @"C:/Joiner.csv";
string line = File.ReadAllText(path);

if (line.Contains("à"))
{
    string asAscii = Encoding.ASCII.GetString(Encoding.Convert(Encoding.UTF8, Encoding.GetEncoding(Encoding.ASCII.EncodingName, new EncoderReplacementFallback("a"), new DecoderExceptionFallback()), Encoding.UTF8.GetBytes(line)));
    Console.WriteLine(asAscii);
    Console.ReadLine();

    sw.WriteLine(asAscii);
    sw.Flush();
}

基本上这会在文件中搜索特定字符并将其替换为另一个字符。 我遇到的问题是我的 if 语句不起作用。 我该如何解决这个问题？

这是输入文件的示例：

Dimàkàtso Mokgàlo
Màmà Ràtlàdi
Koos Nèl
Pàsèkà Modisè
Jèrèmiàh Morèmi
Khèthiwè Buthèlèzi
Tiànà Pillày
Viviàn Màswàngànyè
Thirèshàn Rèddy
Wàdè Cornèlius
ènos Nètshimbupfè

这是如果使用的输出： line = line.Replace('à', 'a'); ：

Chï¿½rlï¿½nï¿½ Kirstï¿½n
Mï¿½mï¿½ Rï¿½tlï¿½di
Koos Nï¿½l
Pï¿½sï¿½kï¿½ Modisï¿½
Jï¿½rï¿½miï¿½h Morï¿½mi
Khï¿½thiwï¿½ Buthï¿½lï¿½zi
Tiï¿½nï¿½ Pillï¿½y
Viviï¿½n Mï¿½swï¿½ngï¿½nyï¿½
Thirï¿½shï¿½n Rï¿½ddy
Wï¿½dï¿½ Cornï¿½lius
ï¿½nos Nï¿½tshimbupfï¿½

使用我的代码，符号将被完全删除

Answer 1

其他人评论了使用 Unicode 查找表来删除变音符号。 我做了一个快速的谷歌搜索，找到了这个例子。 代码无耻地复制，（重新格式化），并贴在下面：

using System;
using System.Text;
using System.Globalization;

public static class Remove
{
    public static string RemoveDiacritics(string stIn)
    {
        string stFormD = stIn.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for(int ich = 0; ich < stFormD.Length; ich++) {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            if(uc != UnicodeCategory.NonSpacingMark) {
                sb.Append(stFormD[ich]);
            }
        }

        return(sb.ToString().Normalize(NormalizationForm.FormC));
    }
}

因此，您的代码可以通过调用来清理输入：

line = Remove.RemoveDiacritics(line);

Answer 2

不知道它是否有用，但在一个在 LED 屏幕上写消息的内部工具中，我们有以下替换（我确信有更智能的方法可以使 unicode 表工作，但这一个就足够了对于这个小型内部工具）：

        strMessage = Regex.Replace(strMessage, "[éèëêð]", "e");
        strMessage = Regex.Replace(strMessage, "[ÉÈËÊ]", "E");
        strMessage = Regex.Replace(strMessage, "[àâä]", "a");
        strMessage = Regex.Replace(strMessage, "[ÀÁÂÃÄÅ]", "A");
        strMessage = Regex.Replace(strMessage, "[àáâãäå]", "a");
        strMessage = Regex.Replace(strMessage, "[ÙÚÛÜ]", "U");
        strMessage = Regex.Replace(strMessage, "[ùúûüµ]", "u");
        strMessage = Regex.Replace(strMessage, "[òóôõöø]", "o");
        strMessage = Regex.Replace(strMessage, "[ÒÓÔÕÖØ]", "O");
        strMessage = Regex.Replace(strMessage, "[ìíîï]", "i");
        strMessage = Regex.Replace(strMessage, "[ÌÍÎÏ]", "I");
        strMessage = Regex.Replace(strMessage, "[š]", "s");
        strMessage = Regex.Replace(strMessage, "[Š]", "S");
        strMessage = Regex.Replace(strMessage, "[ñ]", "n");
        strMessage = Regex.Replace(strMessage, "[Ñ]", "N");
        strMessage = Regex.Replace(strMessage, "[ç]", "c");
        strMessage = Regex.Replace(strMessage, "[Ç]", "C");
        strMessage = Regex.Replace(strMessage, "[ÿ]", "y");
        strMessage = Regex.Replace(strMessage, "[Ÿ]", "Y");
        strMessage = Regex.Replace(strMessage, "[ž]", "z");
        strMessage = Regex.Replace(strMessage, "[Ž]", "Z");
        strMessage = Regex.Replace(strMessage, "[Ð]", "D");
        strMessage = Regex.Replace(strMessage, "[œ]", "oe");
        strMessage = Regex.Replace(strMessage, "[Œ]", "Oe");
        strMessage = Regex.Replace(strMessage, "[«»\u201C\u201D\u201E\u201F\u2033\u2036]", "\"");
        strMessage = Regex.Replace(strMessage, "[\u2026]", "...");

需要注意的一件事是，如果在大多数语言中，经过这种处理后文本仍然可以理解，情况并非总是如此，并且通常会迫使读者参考句子的上下文才能理解它。 如果你有选择，这不是你想要的。

请注意，正确的解决方案是使用 unicode 表，用“组合变音符号”+字符形式替换带有集成变音符号的字符，然后删除变音符号......

Answer 3

我经常使用基于 Dana 提供的版本的扩展方法。 快速解释：

规范化形成 D 将è 等字符拆分为e和非间距`
从此，删除了 nospacing 字符
结果被归一化回形式 D（我不确定这是否必要）

代码：

using System.Linq;
using System.Text;
using System.Globalization;

// namespace here
public static class Utility
{
    public static string RemoveDiacritics(this string str)
    {
        if (str == null) return null;
        var chars =
            from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
            let uc = CharUnicodeInfo.GetUnicodeCategory(c)
            where uc != UnicodeCategory.NonSpacingMark
            select c;

        var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);

        return cleanStr;
    }
}

Answer 4

你为什么要把事情复杂化？

line = line.Replace('à', 'a');

更新：

File.ReadAllText的文档说：

此方法尝试根据字节顺序标记的存在自动检测文件的编码。 可以检测编码格式 UTF-8 和 UTF-32（大端和小端）。

读取可能包含导入文本的文件时，请使用 ReadAllText(String, Encoding) 方法重载，因为可能无法正确读取无法识别的字符。

C:/Joiner.csv是什么编码？ 也许您应该为File.ReadAllText使用其他重载，您可以在其中自己指定输入编码？

Answer 5

用简单的方法做。 下面的代码将仅用 2 行代码将所有特殊字符替换为 ASCII 字符。 它为您提供与 Julien Roncaglia 的解决方案相同的结果。

byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(inputText);
string outputText = System.Text.Encoding.ASCII.GetString(bytes);

Answer 6

用这个：

     if (line.Contains(“OldChar”))
     {
        line = line.Replace(“OldChar”, “NewChar”);
     }

Answer 7

听起来您想要做的是将扩展 ASCII（八位）转换为 ASCII（七位）-因此搜索可能会有所帮助。

我见过用其他语言处理这个的库，但从来没有在 C# 中这样做过，但这看起来可能有点启发：

将两个 ascii 字符转换为它们的“对应”一个字符扩展 ascii 表示

替换 C# 中的字符 (ascii)

问题描述

7 个解决方案

解决方案1
25 2011-03-28 13:31:51

解决方案2
11 已采纳 2011-03-28 13:32:51

解决方案3
6 2012-10-31 09:28:22

解决方案4
3 2011-03-28 13:27:30

解决方案5
2 2016-10-11 08:16:17

解决方案6
0 2011-03-28 13:30:11

解决方案7
0 2011-03-28 13:40:35

替换 C# 中的字符 (ascii)

问题描述

7 个解决方案

解决方案1 25 2011-03-28 13:31:51

解决方案2 11 已采纳 2011-03-28 13:32:51

解决方案3 6 2012-10-31 09:28:22

解决方案4 3 2011-03-28 13:27:30

解决方案5 2 2016-10-11 08:16:17

解决方案6 0 2011-03-28 13:30:11

解决方案7 0 2011-03-28 13:40:35

解决方案1
25 2011-03-28 13:31:51

解决方案2
11 已采纳 2011-03-28 13:32:51

解决方案3
6 2012-10-31 09:28:22

解决方案4
3 2011-03-28 13:27:30

解决方案5
2 2016-10-11 08:16:17

解决方案6
0 2011-03-28 13:30:11

解决方案7
0 2011-03-28 13:40:35