[英]Extract text from multiline HTML using Regex
I'm trying to extract some text from HTML file. 我正在尝试从HTML文件中提取一些文本。
This is sample of part that makes me a hedeache: 这是使我头痛的部分示例:
<TD>
Adresa instalacije:
</TD>
<TD COLSPAN=2>
<TABLE border=0 cellpadding=3 cellspacing="1" bgcolor="#AAAA77" width="100%">
<TR bgcolor="#FFFFCC">
<TD COLSPAN=2><B>SOME TEXT</B></TD>
</TR>
<TR bgcolor="#FFFFCC">
<TD>ADM šifra: </TD>
<TD><B>914122</B></TD>
</TR>
</TABLE>
</TD>
The part I want to extract is between 我要提取的部分介于
<TD COLSPAN=2><B> </B></TD>
And this is my regex: 这是我的正则表达式:
var regexAdresa = @"<TD>Adresa korisnika:</TD><TD COLSPAN=2>";
regexAdresa += @"<TABLE border=0 cellpadding=3 cellspacing=""1"" bgcolor=""#AAAA77"" width=""100%"">";
regexAdresa += @"<TR bgcolor=""#FFFFCC"">";
regexAdresa += @"<TD><B>(.*?)</B></TD>";
regexAdresa += @"</TR></TABLE></TD>";
var r0 = new Regex(regexAdresa);
var rr0 = r0.Match(text);
var res0 = rr0.Groups[1].ToString();
My result is always resturs 0. Am I doing something wrong? 我的结果始终是resturs0。我做错什么了吗?
I'd use PhantomJS, it's invisible to the user and it parses the entire DOM, giving you access via Selenium. 我将使用PhantomJS,它对用户是不可见的,并且它解析整个DOM,从而使您可以通过Selenium访问。 To Access
<TD COLSPAN=2><B> </B></TD>.
要访问
<TD COLSPAN=2><B> </B></TD>.
var text = driver.findElement(By.CssSelector("td.colspan=2" b)).Text;
Warning code not tested, given as example only. 警告代码未经测试,仅作为示例提供。
For further information on using the By locator within Selenium click here . 有关在Selenium中使用By定位器的更多信息,请单击此处 。
Thanks to all, especially to @Arghya C. 感谢所有人,尤其是@ArghyaC。
I've tried something and for now this satisfy my needs. 我尝试了一些东西,现在满足了我的需求。 Maybe is not best solution but it works:
也许不是最好的解决方案,但它可以工作:
var regexAdresa = @"<TD (COLSPAN=[1-9]+)?><B>[^<>]+<\/B><\/TD>";
Regex g = new Regex(regexAdresa);
Match m = g.Match(text);
if (m.Success)
{
MessageBox.Show(m.ToString());
MessageBox.Show(Regex.Replace(m.ToString(), "<.*?>", String.Empty));
}
I get the line where is text that i want and in second step with regex the HTML tags are removed. 我得到的行是我想要的文本,在第二步中使用正则表达式删除了HTML标签。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.