简体   繁体   English

RegEx匹配HTML标记并提取文本

[英]RegEx matching HTML tags and extracting text

I have a string of test like this: 我有一串这样的测试:

<customtag>hey</customtag>

I want to use a RegEx to modify the text between the "customtag" tags so that it might look like this: 我想使用RegEx修改“customtag”标签之间的文本,使其看起来像这样:

<customtag>hey, this is changed!</customtag>

I know that I can use a MatchEvaluator to modify the text, but I'm unsure of the proper RegEx syntax to use. 我知道我可以使用MatchEvaluator来修改文本,但我不确定要使用正确的RegEx语法。 Any help would be much appreciated. 任何帮助将非常感激。

我不会为此使用正则表达式,但如果你必须这个表达式应该工作: <customtag>(.+?)</customtag>

I'd chew my own leg off before using a regular expression to parse and alter HTML. 在使用正则表达式解析和修改HTML之前,我会嚼掉自己的腿。

Use XSL or DOM . 使用XSLDOM


Two comments have asked me to clarify. 有两条评论让我澄清一下。 The regular expression substitution works in the specific case in the OP's question, but in general regular expressions are not a good solution. 正则表达式替换在OP的问题的特定情况下起作用,但通常正则表达式不是一个好的解决方案。 Regular expressions can match regular languages , ie a sequence of input which can be accepted by a finite state machine. 正则表达式可以匹配常规语言 ,即可以被有限状态机接受的输入序列。 HTML can contain nested tags to any arbitrary depth, so it's not a regular language. HTML可以包含任意深度的嵌套标记,因此它不是常规语言。

What does this have to do with the question? 这与这个问题有什么关系? Using a regular expression for the OP's question as it is written works, but what if the content between the <customtag> tags contains other tags? 在编写OP的问题时使用正则表达式可以正常工作,但是如果<customtag>标记之间的内容包含其他标记会怎么样? What if a literal < character occurs in the text? 如果文字中出现文字<字符怎么办? It has been 11 months since Jon Tackabury asked the question, and I'd guess that in that time, the complexity of his problem may have increased. 自从Jon Tackabury提出这个问题已经过去了11个月,而且我猜在那个时候,他的问题的复杂性可能会增加。

Regular expressions are great tools and I do use them all the time. 正则表达式是很好的工具,我一直都在使用它们。 But using them in lieu of a real parser for input that needs one is going to work in only very simple cases. 但是使用它们来代替真正的解析器以获得需要的输入只能在非常简单的情况下工作。 It's practically inevitable that these cases grow beyond what regular expressions can handle. 这些案例实际上不可避免地超出了正则表达式所能处理的范围。 When that happens, you'll be tempted to write a more complex regular expression, but these quickly become very laborious to develop and debug. 当这种情况发生时,你会想要编写一个更复杂的正则表达式,但这些很快就变得非常费力,无法进行开发和调试。 Be ready to scrap the regular expression solution when the parsing requirements expand. 准备好在解析需求扩展时废弃正则表达式解决方案。

XSL and DOM are two standard technologies designed to work with XML or XHTML markup. XSL和DOM是两种标准技术,旨在使用XML或XHTML标记。 Both technologies know how to parse structured markup files, keep track of nested tags, and allow you to transform tags attributes or content. 这两种技术都知道如何解析结构化标记文件,跟踪嵌套标记,并允许您转换标记属性或内容。

Here are a couple of articles on how to use XSL with C#: 以下是一些关于如何在C#中使用XSL的文章:

Here are a couple of articles on how to use DOM with C#: 这里有几篇关于如何在C#中使用DOM的文章:

Here's a .NET library that assists DOM and XSL operations on HTML: 这是一个.NET库,可以帮助HTML上的DOM和XSL操作:

如果两个标签之间不存在任何其他标签,则此正则表达式更安全,更高效:

<customtag>[^<>]*</customtag>
//This is to replace all HTML Text

var re = new RegExp("<[^>]*>", "g");

var x2 = Content.replace(re,"");

//This is to replace all &nbsp;

var x3 = x2.replace(/\u00a0/g,'');

Most people use HTML Agility Pack for HTML text parsing. 大多数人使用HTML Agility Pack进行HTML文本解析。 However, I find it a little robust and complicated for my own needs. 但是,我发现它对我自己的需求有点强大和复杂。 I create a web browser control in memory, load the page, and copy the text from it. 我在内存中创建一个Web浏览器控件,加载页面,然后从中复制文本。 (see example below) (见下面的例子)

You can find 3 simple examples here: 你可以在这里找到3个简单的例子:

http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/ http://jakemdrew.wordpress.com/2012/02/03/getting-only-the-text-displayed-on-a-webpage-using-c/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM