C# 从 HTML 解析 XML 正文并保存到文件

Question

C# after performing GET from the API it returns the XML code embedded in the HTML file similar to this: C# 从 API 执行 GET 后，它返回嵌入在 Z4C4AD5FCA2E7A3F74DBB1CED00 文件中的 XML 代码：

<!DOCTYPE html>

<html lang="en">
    <head>
        <meta name="viewport" content="initial-scale=1, width=device-width">
        <title>config</title>
    </head>
    <body>
        
<CONFIG="2"/>
<VALUE1="1"/>
<VALUE2="2"/>
<CONFIGEND="0"/>

    </body>
</html>

I am trying to save the XML content from the body "<CONFIG... CONFIGEND="0"/>" out to a file.我正在尝试将正文“<CONFIG...CONFIGEND="0"/>”中的 XML 内容保存到文件中。 My attempts using HtmlAgilityPack result in the XML data being modified as follows:我使用 HtmlAgilityPack 的尝试导致 XML 数据被修改如下：

<CONFIG="2"></CONFIG>
...
<CONFIGEND="0"></CONFIGEND>

I am new to C# (and programming in general) so please be kind.我是 C#（和一般编程）的新手，所以请善待。 Search attempts have left me more confused than I started:/搜索尝试让我比开始时更加困惑：/

Answer 1

Yes you have figured out HtmlAgilityPack is converting something.是的，您已经发现HtmlAgilityPack正在转换一些东西。 Html actually an Xml file. Html 实际上是一个 Xml 文件。 But System.Xml.XmlDocument cannot handle this html file.但System.Xml.XmlDocument无法处理此 html 文件。 So you need to parse manually.所以需要手动解析。

As Anis R.作为 Anis R。 says, best way is RegularExpressions.说，最好的方法是正则表达式。 To use RegularExpressions, you need to add using System.Text.RegularExpressions;要使用RegularExpressions，需要添加using System.Text.RegularExpressions; to first lines.到第一行。

Let's say your Html content is in htmlstring variable.假设您的 Html 内容在htmlstring变量中。

Firstly you need to define pattern for your case.首先，您需要为您的案例定义模式。

string regexPattern = @"\<body\>(.*?)\<\/body\>";
Regex regex = new Regex(regexPattern, RegexOptions.Singleline);

You need to use RegexOptions.Singleline option.您需要使用RegexOptions.Singleline选项。 Because your html content will have new line characters.因为您的 html 内容将有换行符。

string body = regex.Match(htmlstring).Value;

With this, you will have:有了这个，你将拥有：

<body>
        
<CONFIG="2"/>
<VALUE1="1"/>
<VALUE2="2"/>
<CONFIGEND="0"/>

    </body>

To remove body tags;删除身体标签；

string result = body.Replace("<body>", "").Replace("</body>", "");

To trim leading and trailing spaces;修剪前导和尾随空格；

string prettierResult = result.Trim();

Now you have;现在你有；

<CONFIG="2"/>
<VALUE1="1"/>
<VALUE2="2"/>
<CONFIGEND="0"/>

To save content to a file;将内容保存到文件；

File.WriteAllText("c:\\path-to-save", prettierResult);

Answer 2

If the format is consistent ¹ (eg, you always want everything between <body>...</body> ), then one way is to use a regex:如果格式是一致的¹ （例如，您总是想要<body>...</body>之间的所有内容），那么一种方法是使用正则表达式：

string pattern = @"<body>(.*)</body>";  
Regex rg = new Regex(pattern);  
        
string html = "<body>Content here</body>";  
        
// get first match and print it
Match firstMatch = rg.Matches(html)[0];
Console.WriteLine(firstMatch.Groups[1]); // "Content here"

(PS: this will need using System.Text.RegularExpressions; ) （PS：这需要using System.Text.RegularExpressions; ）

¹ Keeping this in mind ¹牢记这一点

C# 从 HTML 解析 XML 正文并保存到文件

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-02-20 21:39:35

解决方案2
0 2021-02-20 20:49:39

C# 从 HTML 解析 XML 正文并保存到文件

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-02-20 21:39:35

解决方案2 0 2021-02-20 20:49:39

解决方案1
2 已采纳 2021-02-20 21:39:35

解决方案2
0 2021-02-20 20:49:39