需要帮助解析 HTML 标签之间的文本

Question

Ok, the problem it's that i have string with HTML.好的，问题是我有 HTML 的字符串。 I need to find an specific format like this:我需要找到这样的特定格式：

some text

of that HTML, I need to extract some text and save it into a list.在 HTML 中，我需要提取some text并将其保存到列表中。 How can accomplish my goal.怎样才能完成我的目标。

note that the text can appear like this请注意，文本可以像这样显示

<p>
    Central: 
<span class="fieldText">Central_Local</span><br>Area Resolutoria:  
<span class="fieldText">Area_Resolutoria</span><br>VPI:  
<span class="fieldText">VIP</span><br>Ciudad: <span class="fieldText">Ciudad</span>   <br>Estado:  <span class="fieldText">Estado</span><br>Region  <span class="fieldText">Region</span>    
</p>

Answer 1

You can try regex: @"<span.*?>(.*?)" If you combine it with captures you can get the whole list with @"^(.*?<span.*?>(.*?).*?)+$" .您可以尝试正则表达式： @"<span.*?>(.*?)"如果将其与捕获结合使用，则可以使用@"^(.*?<span.*?>(.*?).*?)+$" .

But the truth is you shouldn't use regex for XML or HTML - there is a plenty of parsers out there, as others have already mentioned.但事实是你不应该对 XML 或 HTML 使用正则表达式 - 正如其他人已经提到的那样，那里有很多解析器。

            string s = @"
<p>
    Central: 
<span class=""fieldText"">Central_Local</span><br>Area Resolutoria:  
<span class=""fieldText"">Area_Resolutoria</span><br>VPI:  
<span class=""fieldText"">VIP</span><br>Ciudad: <span class=""fieldText"">Ciudad</span>   <br>Estado:  <span class=""fieldText"">Estado</span><br>Region  <span class=""fieldText"">Region</span>    
</p>";

            Match m = Regex.Match(s, @"^(.*?<span .*?>(.*?)</span>.*?)+$", RegexOptions.Singleline);

            foreach (var capture in m.Groups[2].Captures)
                Console.WriteLine(capture);

Answer 2

I don't like using regular expression for stuff like this.我不喜欢对这样的东西使用正则表达式。

I've written a free HTML tag parser that you could either use as is, modify to fit your needs, or just use as a guide to how you might approach this on your own.我编写了一个免费的HTML 标记解析器，您可以按原样使用、修改以满足您的需求，或者仅用作您如何自行处理此问题的指南。

Answer 3

Have you tried the HtmlAgilityPack ?你试过HtmlAgilityPack吗？

Answer 4

For small stuff like this I prefer using regular expressions.对于像这样的小东西，我更喜欢使用正则表达式。 Not sure what the C# syntax is, but the expression would look something like this:不确定 C# 语法是什么，但表达式看起来像这样：

|<span class="fieldText">(.+)</span>|

Jonathan Wood's suggestion for using an HTML tag parser is a good idea too, especially if you'll be doing a lot of parsing. Jonathan Wood 建议使用 HTML 标记解析器也是一个好主意，尤其是在您要进行大量解析时。

Answer 5

Regex has been shown to be a bad solution for parsing HTML.正则表达式已被证明是解析 HTML 的糟糕解决方案。 The HTML Agility Pack is exactly what you need for this task. HTML 敏捷包正是您完成这项任务所需要的。

需要帮助解析 HTML 标签之间的文本

问题描述

5 个解决方案

解决方案1
2 已采纳 2011-07-13 16:17:20

解决方案2
2 2011-07-13 16:20:38

解决方案3
1 2011-07-13 16:25:15

解决方案4
0 2011-07-13 16:24:35

解决方案5
0 2011-07-13 16:26:01

需要帮助解析 HTML 标签之间的文本

问题描述

5 个解决方案

解决方案1 2 已采纳 2011-07-13 16:17:20

解决方案2 2 2011-07-13 16:20:38

解决方案3 1 2011-07-13 16:25:15

解决方案4 0 2011-07-13 16:24:35

解决方案5 0 2011-07-13 16:26:01

解决方案1
2 已采纳 2011-07-13 16:17:20

解决方案2
2 2011-07-13 16:20:38

解决方案3
1 2011-07-13 16:25:15

解决方案4
0 2011-07-13 16:24:35

解决方案5
0 2011-07-13 16:26:01