[英]How to determine which HTML is “code” and which is “display/content”?
I want to use C# to parse HTML data. 我想使用C#解析HTML数据。
If you think of every character of HTML data as being a bit: true = "html/code". 如果您认为HTML数据的每个字符都有点:true =“ html / code”。 false = "display/content".
false =“显示/内容”。 Then you would know which part of the HTML is the "code".
然后,您将知道HTML的哪一部分是“代码”。
Let's use the following HTML example: 让我们使用以下HTML示例:
<a id="a1" class="c1" attr1="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>
I want to do a C# String.Replace to find all instances of "a1" and replace it with "new1". 我想做一个C#String.Replace查找“ a1”的所有实例,并将其替换为“ new1”。 I want to do a C# String.Replace to find all instances of "attr1" and replace it with "new2".
我想做一个C#String.Replace以查找“ attr1”的所有实例,并将其替换为“ new2”。 But I only want the html "code" to be affected, and I want all "content" to NOT be changed.
但是我只希望HTML“代码”受到影响,并且我希望所有“内容”都不会被更改。 The desired result is:
理想的结果是:
<a id="new1" class="c1" new2="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>
Note: the desired result has 2 other instances of "a1" that were not renamed. 注意:期望的结果还有其他两个未重命名的“ a1”实例。 Note: the desired result has 2 other instances of "attr1" that were not renamed.
注意:期望的结果还有其他两个“ attr1”实例未重命名。
I can't find any existing library or software that would help in this effort. 我找不到任何有助于此工作的现有库或软件。
EDIT1: HtmlAgilityPack might be an option. 编辑1:HtmlAgilityPack可能是一个选项。 However, I'm still no closer to understanding how I could use it to differentiate between code and not-code?
但是,我仍然不了解如何使用它来区分代码和非代码?
EDIT2: Please keep in mind this question is simplified of my real problem as much as possible. EDIT2:请记住,这个问题尽可能简化了我的实际问题。 Renaming things with and without quotes won't be the answer.
重命名带引号和不带引号的内容将不是答案。 I specifically need to figure out how to differentiate between code and not-code.
我特别需要弄清楚如何区分代码和非代码。
EDIT3: I have included "attr1" as a secondary String.Replace. EDIT3:我已将“ attr1”作为辅助String.Replace包括在内。 I need to find both attributes AND values of attributes to replace.
我需要找到要替换的属性和属性值。 And I need to be able to distinguish between code and not-code.
而且我需要能够区分代码和非代码。
Any suggestions? 有什么建议么?
Following the comments made on this post, I came up with the following: 在对这篇文章发表评论之后,我提出了以下建议:
void Main()
{
var html = "<a id=\"attr1\" class=\"c1\" attr1=\"x\" attr2=\"y\">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>";
var res = Replace(html, "attr1", "attrA");
}
public string Replace(string html, string oldval, string newval)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
foreach (var n in doc.DocumentNode.ChildNodes)
{
foreach (var a in n.Attributes)
{
if (a.Value.Equals(oldval))
{
a.Value = newval;
}
if (a.Name.Equals(oldval))
{
a.Name = newval;
}
}
}
return doc.DocumentNode.OuterHtml;
}
Given the input: 给定输入:
<a id="attr1" class="c1" attr1="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>
The output is: 输出为:
<a id="attrA" class="c1" attra="x" attr2="y">a1 c1 attr1</a> <p>a1 c1 attr1 attr2</p>
This should meet the current requirements. 这应该满足当前的要求。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.