简体   繁体   English

如何确定哪个HTML是“代码”,哪个是“显示/内容”?

[英]How to determine which HTML is “code” and which is “display/content”?

I want to use C# to parse HTML data. 我想使用C#解析HTML数据。

If you think of every character of HTML data as being a bit: true = "html/code". 如果您认为HTML数据的每个字符都有点:true =“ html / code”。 false = "display/content". false =“显示/内容”。 Then you would know which part of the HTML is the "code". 然后,您将知道HTML的哪一部分是“代码”。

Let's use the following HTML example: 让我们使用以下HTML示例:

<a id="a1" class="c1" attr1="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>

I want to do a C# String.Replace to find all instances of "a1" and replace it with "new1". 我想做一个C#String.Replace查找“ a1”的所有实例,并将其替换为“ new1”。 I want to do a C# String.Replace to find all instances of "attr1" and replace it with "new2". 我想做一个C#String.Replace以查找“ attr1”的所有实例,并将其替换为“ new2”。 But I only want the html "code" to be affected, and I want all "content" to NOT be changed. 但是我只希望HTML“代码”受到影响,并且我希望所有“内容”都不会被更改。 The desired result is: 理想的结果是:

<a id="new1" class="c1" new2="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>

Note: the desired result has 2 other instances of "a1" that were not renamed. 注意:期望的结果还有其他两个未重命名的“ a1”实例。 Note: the desired result has 2 other instances of "attr1" that were not renamed. 注意:期望的结果还有其他两个“ attr1”实例未重命名。

I can't find any existing library or software that would help in this effort. 我找不到任何有助于此工作的现有库或软件。

EDIT1: HtmlAgilityPack might be an option. 编辑1:HtmlAgilityPack可能是一个选项。 However, I'm still no closer to understanding how I could use it to differentiate between code and not-code? 但是,我仍然不了解如何使用它来区分代码和非代码?

EDIT2: Please keep in mind this question is simplified of my real problem as much as possible. EDIT2:请记住,这个问题尽可能简化了我的实际问题。 Renaming things with and without quotes won't be the answer. 重命名带引号和不带引号的内容将不是答案。 I specifically need to figure out how to differentiate between code and not-code. 我特别需要弄清楚如何区分代码和非代码。

EDIT3: I have included "attr1" as a secondary String.Replace. EDIT3:我已将“ attr1”作为辅助String.Replace包括在内。 I need to find both attributes AND values of attributes to replace. 我需要找到要替换的属性和属性值。 And I need to be able to distinguish between code and not-code. 而且我需要能够区分代码和非代码。

Any suggestions? 有什么建议么?

Following the comments made on this post, I came up with the following: 在对这篇文章发表评论之后,我提出了以下建议:

void Main()
{
    var html = "<a id=\"attr1\" class=\"c1\" attr1=\"x\" attr2=\"y\">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>";

    var res = Replace(html, "attr1", "attrA");
}

public string Replace(string html, string oldval, string newval)
{
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);

    foreach (var n in doc.DocumentNode.ChildNodes)
    {
        foreach (var a in n.Attributes)
        {
            if (a.Value.Equals(oldval))
            {
                a.Value = newval;
            }

            if (a.Name.Equals(oldval))
            {
                a.Name = newval;
            }
        }
    }

    return doc.DocumentNode.OuterHtml;
}

Given the input: 给定输入:

<a id="attr1" class="c1" attr1="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>

The output is: 输出为:

<a id="attrA" class="c1" attra="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>

This should meet the current requirements. 这应该满足当前的要求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM