简体   繁体   English

C# 删除 HTML 字符串中的空格

[英]C# removing white spaces in an HTML string

is it possible to remove all white spaces in the following HTML string in C#:是否可以在 C# 中删除以下 HTML 字符串中的所有空格:

"
<html>

<body>

</body>

</html>
"

Thanks谢谢

When dealing with HTML or any markup for that matter, it's usually best to run it through a parser that truly understands the rules of that markup.在处理 HTML 或任何与此相关的标记时,通常最好通过真正理解该标记规则的解析器来运行它。

The first benefit is that it can tell you if your initial input data is garbage to start with.第一个好处是它可以告诉您初始输入数据是否是垃圾。

If the parser is smart enough it might even be able to correct badly formed markup automatically, or accept it with relaxed rules.如果解析器足够聪明,它甚至可以自动纠正格式错误的标记,或者以宽松的规则接受它。

You can then modify the parsed content....and get the parser to write out the changes...this way you can be sure the markup rules are followed and you have correct output.然后您可以修改解析的内容....并让解析器写出更改...这样您就可以确保遵循标记规则并且您有正确的输出。

For some simple HTML markup scenarios or for markup that is so badly formed a parser just balks on it straight away, then yes you can revert to hacking the input string...with string replacements, etc....it all depends on your needs as to which approach you take.对于一些简单的 HTML 标记场景或对于格式如此糟糕的标记,解析器会立即阻止它,然后是的,您可以恢复对输入字符串的攻击......使用字符串替换等......这一切都取决于你需要你采取哪种方法。

Here are a couple of tools that can help you out:这里有一些工具可以帮助您:

HTML Tidy HTML 整洁

You can use HTML Tidy and just specify some options/rules on how you want your HTML to be tidied up (eg remove superfluous whitespace).您可以使用 HTML Tidy 并指定一些关于如何整理 HTML 的选项/规则(例如删除多余的空格)。

It's a WIN32 DLL...but there are C# Wrappers for it.它是一个 WIN32 DLL...但有 C# 包装器。

HtmlAgilityPack HtmlAgilityPack

You can use HtmlAgilityPack to parse HTML if you need to understand the structure better and perhaps do your own tidying up/restructuring.如果您需要更好地理解结构并且可能自己进行整理/重组,您可以使用 HtmlAgilityPack 来解析 HTML。

myString = myString.Replace(System.Environment.NewLine, "");

您可以使用正则表达式来匹配替换的空白字符:

s = RegEx.Replace(s, @"\s+", String.Empty);

I used this solution (in my opinion it works well. See also test code):我使用了这个解决方案(在我看来它运作良好。另见测试代码):

  1. Add an extension method to trim the HTML string:添加一个扩展方法来修剪 HTML 字符串:
public static string RemoveSuperfluousWhitespaces(this string input)
{
    if (input.Length < 3) return input;
    var resultString = new StringBuilder(); // Using StringBuilder is much faster than using regular expressions here!
    var inputChars = input.ToCharArray();
    var index1 = 0;
    var index2 = 1;
    var index3 = 2;
    // Remove superfluous white spaces from the html stream by the following replacements:
    //  '<no whitespace>' '>' '<whitespace>' ==> '<no whitespace>' '>'
    //  '<whitespace>' '<' '<no whitespace>' ==> '<' '<no whitespace>'
    while (index3 < inputChars.Length)
    {
        var char1 = inputChars[index1];
        var char2 = inputChars[index2];
        var char3 = inputChars[index3];
        if (!Char.IsWhiteSpace(char1) && char2 == '>' && Char.IsWhiteSpace(char3))
        {
            // drop whitespace character in char3
            index3++;
        }
        else if (Char.IsWhiteSpace(char1) && char2 == '<' && !Char.IsWhiteSpace(char3))
        {
            // drop whitespace character in char1
            index1 = index2;
            index2 = index3;
            index3++;
        }
        else
        {
            resultString.Append(char1);
            index1 = index2;
            index2 = index3;
            index3++;
        }
    }

    // (index3 >= inputChars.Length)
    resultString.Append(inputChars[index1]);
    resultString.Append(inputChars[index2]);
    var str = resultString.ToString();
    return str;
}

// 2) add test code:

[Test]
public void TestRemoveSuperfluousWhitespaces()
{
    var html1 = "<td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td>";
    var html2 = $"<td class=\"keycolumn\">{Environment.NewLine}<p class=\"mandatory\">Some recipe parameter name</p>{Environment.NewLine}</td>";
    var html3 = $"<td class=\"keycolumn\">{Environment.NewLine} <p class=\"mandatory\">Some recipe parameter name</p> {Environment.NewLine}</td>";
    var html4 = " <td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td>";
    var html5 = "<td class=\"keycolumn\"><p class=\"mandatory\">Some recipe parameter name</p></td> ";
    var compactedHtml1 = html1.RemoveSuperfluousWhitespaces();
    compactedHtml1.Should().BeEquivalentTo(html1);
    var compactedHtml2 = html2.RemoveSuperfluousWhitespaces();
    compactedHtml2.Should().BeEquivalentTo(html1);
    var compactedHtml3 = html3.RemoveSuperfluousWhitespaces();
    compactedHtml3.Should().BeEquivalentTo(html1);
    var compactedHtml4 = html4.RemoveSuperfluousWhitespaces();
    compactedHtml4.Should().BeEquivalentTo(html1);
    var compactedHtml5 = html5.RemoveSuperfluousWhitespaces();
    compactedHtml5.Should().BeEquivalentTo(html1);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM