简体   繁体   English

正则表达式以在html标签中查找内容

[英]Regex to find content in html tags

I need to parse a html file and extract the NeedThis* strings with C#/.net, sample code is: 我需要解析一个html文件并使用C#/。net提取NeedThis *字符串,示例代码为:

<tr class="class">
    <td style="width: 120px">
        <a href="NeedThis1">NeedThis2</a>
    </td>
    <td style="width: 120px">
        <a href="NeedThis3">
            NeedThis4</a>
    </td>
    <td style="width: 30%">
        NeedThis5
    </td>
    <td>
        NeedThis6
    </td>
    <td style="width: 120px">
        NeedThis7
    </td>
</tr>

I know a html parser should be better here, but all I need is to extract these texts, this is just for a temp helper tool... 我知道在这里html解析器应该更好,但是我所需要的只是提取这些文本,这只是用于临时帮助器工具...

anyone can help me with this? 有人可以帮助我吗?

thanks! 谢谢!

如果您确定html有效,则可以使用Linq到Xml,否则最好使用HTML Agility Pack之类的解析器

It doesn't matter whether you're doing this for a one-off or for a "finished project". 无论是一次性完成还是“完成的项目”都无关紧要。 Your task isn't text extraction and it's not something that a regex can do effectively. 您的任务不是文本提取,也不是正则表达式可以有效执行的操作。 The data you're looking for depends on the structure of the HTML. 您要查找的数据取决于HTML的结构。 Your task is parsing HTML. 您的任务是解析HTML。 When your task is parsing HTML, use an HTML parser. 当您的任务正在解析HTML时,请使用HTML解析器。 It's not difficult. 不难 In fact it's a lot easier than writing the pile of regexes you would need otherwise. 实际上,这比编写否则需要的大量正则表达式要容易得多。

You seem to have answered your own question. 您似乎已经回答了自己的问题。 You should use a parser . 您应该使用解析器 But if you don't you can use the RE NeedThis.* 但是,如果您不这样做,则可以使用RE NeedThis.*

Of course, if you want any context with those strings, you should just use a parser. 当然,如果您希望这些字符串具有任何上下文,则应仅使用解析器。

Hans, as you can see by the other answers using a RegEx is probably not the best way to do what you want to do, but since I need to practice my RegEx anyways I went ahead and made one just in case you wanted to experiment. 汉斯(Hans),从其他答案中可以看到,使用RegEx可能不是完成您想要做的事情的最佳方法,但是由于我仍然需要练习RegEx,所以我继续做一个,以防万一您想尝试。 This will only catch NeedThis2 , but it should give you an idea of how you would make your own RegEx when it is an appropriate solution. 这只会捕获NeedThis2 ,但是它应该使您了解在合适的解决方案时如何制作自己的RegEx。

<a href="NeedThis1">NeedThis2</a>

RegEx to catch NeedThis2 : 正则表达式赶上NeedThis2

(?:<a[^<a]+?>)(\S)*(?:<[^<]+?a>)

Pretty nasty huh? 很讨厌吧? :) :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM