简体   繁体   English

用于html解析的正则表达式(在c#中)

[英]regex for html parsing (in c#)

I'm trying to parse a html page and extract 2 values from a table row. 我正在尝试解析html页面并从表格行中提取2个值。 The html for the table row is as follows: - 表格行的html如下: -

<tr>
<td title="Associated temperature in (ºC)" class="TABLEDATACELL" nowrap="nowrap" align="Left" colspan="1" rowspan="1">Max Temperature (ºC)</td>
<td class="TABLEDATACELLNOTT" nowrap="nowrap" align="Center" colspan="1" rowspan="1">6</td>
<td class="TABLEDATACELLNOTT" nowrap="nowrap" align="Center" colspan="1" rowspan="1"> 13:41:30</td>
</tr>

and the expression I have at the moment is: 而我现在的表达是:

<tr>[\s]<td[^<]+?>Max Temperature[\w\s]*</td>[\s]
<td[^<]+?>(?<value>([\d]+))</td>[\s]
<td[^<]+?>(?<time>([\d\:]+))</td>[\s]</tr>

However I don't seem to be able to extract any matches. 但是我似乎无法提取任何匹配项。 Could anyone point me in the right direction, thanks. 任何人都可以指出我正确的方向,谢谢。

Parsing HTML reliably using regexp is known to be notoriously difficult. 众所周知,使用regexp可靠地解析HTML是非常困难的。

I think I would be looking for a HTML parsing library, or a "screen scraping" library ;) 我想我会寻找一个HTML解析库,或“屏幕抓取”库;)

If the HTML comes from an unreliable source, you have to be extra careful to handle malicious HTML syntax well. 如果HTML来自不可靠的源,则必须格外小心处理恶意HTML语法。 Bad HTML handling is a major source of security attacks. 糟糕的HTML处理是安全攻击的主要来源。

Try 尝试

<tr>\s*
<td[^>]*>.*?</td>\s*
<td[^>]*>\s*(?<value>\d+)\s*</td>\s*
<td[^>]*>\s*(?<time>\d{2}:\d{2}:\d{2})\s*</td>\s*
</tr>\s*

When you write <td[^<]+?> I guess you really mean <td[^>]*> 当你写<td[^<]+?>我猜你真的是指<td[^>]*>

That is "opening brace, td, maybe stuff other than closing brace..." 这是“开口支撑,td,也许是闭合支撑以外的其他东西......”

<tr>[\s]<td[^<]+?>Max Temperature[\w\s]*</td>[\s]

还没有看完它,但是[^ <]可能需要[^>],因为你试图匹配所有非>直到>在最高温度之前。

The " (ºC)" before the closing td was matched against: 关闭td之前的“(ºC)”与:

<tr>[\s]<td[^<]+?>Max Temperature[^<]*</td>[\s]

Is that \\wa word-boundary? 是\\ wa字边界? I think that it gets a little tricky there, I'd use a more general approach. 我认为那里有点棘手,我会使用更通用的方法。

And on the third line, there is one whitespace after the td tag, is that accounted for? 在第三行,td标签后面有一个空格,是否占了?

<td[^<]+?>[\s]?(?<time>([\d\:]+))</td>[\s]</tr>

I use http://www.regexbuddy.com/ for such controls. 我使用http://www.regexbuddy.com/进行此类控制。 So far I tested @sgehrig's suggestion is correct 到目前为止,我测试了@ sgehrig的建议是正确的

Use the Html Agility Pack or a similar library instead, as @Bjarke Ebert suggests. 正如@Bjarke Ebert建议的那样,使用Html Agility Pack或类似的库。 It's the right tool for the task. 它是完成任务的正确工具。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM