简体   繁体   English

正则表达式在两个特定字符串之间获取HTML

[英]Regex to get HTML between two specific strings

I am not very well-versed in regular expressions, but I am trying to accomplish something in ASP.Net which I think requires them. 我不太熟悉正则表达式,但是我试图在ASP.Net中完成某些我认为需要它们的工作。

I am pulling in an HTML file, doing some processing, and outputting new "merged" html. 我要提取一个HTML文件,进行一些处理,然后输出新的“合并的” html。 The portion I am struggling with is grabbing a chunk of code located between two predefined "tags" of my own creation. 我苦苦挣扎的部分是抓取位于我自己的创作的两个预定义“标签”之间的代码块。

Here is an example of the relevant input html: 以下是相关输入html的示例:

<table style="width: 500px; font-family: Trebuchet MS, sans-serif; font-size: 13px; background-color: #fff; border: 0; border-collapse: collapse;" align="center" cellspacing="0">
<thead>
<tr>
<th colspan="3" style="text-align: left;border-bottom: 1px solid #DDDDDD;">
Add-ons
</th>
</tr>
</thead>
<tbody>
[AddonsListSTART]
<tr style="border-bottom: 1px dashed #DDDDDD;">
<td>[AddonName]</td>
<td>[AddonQty]</td>
<td align="right">[AddOnPrice]</td>
</tr>
[AddonsListEND]
</tbody>
</table>
<br />

This is my C# code: 这是我的C#代码:

//Find Add-ons HTML : between [AddonsListSTART] & [AddonsListEND]
Regex rgxAddonSE = new Regex(@"\[AddonsListSTART\](?<MyHtml>.*)\[AddonsListEND\]");

Match matchAddonSE  = rgxAddonSE.Match(htmlEmail);

string htmlAddons = matchAddonSE.ToString();

What I want to happen is for "htmlAddons" to be equal to the string: 我想发生的是“ htmlAddons”等于字符串:

<tr style="border-bottom: 1px dashed #DDDDDD;">
<td>[AddonName]</td>
<td>[AddonQty]</td>
<td align="right">[AddOnPrice]</td>
</tr>

The problem is that it is always blank, and "matchAddonSE.Success" is always FALSE. 问题在于它始终为空,而“ matchAddonSE.Success”始终为FALSE。 I know there is something wrong with my regex, but I can't figure out what. 我知道我的正则表达式有问题,但我不知道是什么。

Thank you in advance for any help. 预先感谢您的任何帮助。

Heather 希瑟

I think it may be related to multi-line/single-line processing. 我认为这可能与多行/单行处理有关。 Consider http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Singleline 考虑http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#Singleline

The problem is that .* does not include new line. 问题是。*不包括换行符。 regex in such predefined labels that appear once in text (expecting single match) might not be the best way to go, why not just find IndexOf and use substring. 这样的预定义标签中的正则表达式在文本中出现一次(期望单个匹配)可能不是最好的方法,为什么不找到IndexOf并使用子字符串呢?

If you still want to use regex add \\r\\n meaning [.\\r\\n]* using \\s\\S will give you pretty much the same as 如果您仍想使用正则表达式,请添加\\ r \\ n的意思是使用\\ s \\ S的[。\\ r \\ n] *将与

\\s is Equivalent to [ \\f\\n\\r\\t\\v]. \\ s等效于[\\ f \\ n \\ r \\ t \\ v]。

\\S is Equivalent to [^ \\f\\n\\r\\t\\v]. \\ S等效于[^ \\ f \\ n \\ r \\ t \\ v]。

another option would be to set regex matches to Single-line Mode. 另一个选择是将正则表达式匹配设置为单行模式。 (name is confusing but it acctually means it allows dot "." to grab new lines) (名称令人困惑,但实际上表示它允许点“。”抓住新行)

below is a substring usage example. 下面是一个子字符串用法示例。

String startTag = "[AddonsListSTART]";
String endTag = "[AddonsListEND]"
int start = htmlEmail.IndexOf(startTag );
int end = htmlEmail.IndexOf(endTag);
String res ="";
if((start>=0) && (end>=0)){
  res = htmlEmail.substring(start + startTag.length,end - (start + startTag.length));
}

here is a single line mode usage : (note RegexOptions.Singleline ) 这是单行模式的用法:(请注意RegexOptions.Singleline)

//Find Add-ons HTML : between [AddonsListSTART] & [AddonsListEND]
Regex rgxAddonSE = new Regex(@"\[AddonsListSTART\](?<MyHtml>.*)\[AddonsListEND\]", RegexOptions.Singleline);

Match matchAddonSE  = rgxAddonSE.Match(htmlEmail);

string htmlAddons = matchAddonSE.ToString();

same thing except using the single line mode from within pattern 除了在模式中使用单行模式外,其他操作相同

Regex rgxAddonSE = new Regex(@"(?s)\[AddonsListSTART\](?<MyHtml>.*)\[AddonsListEND\]");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM