简体   繁体   English

正则表达式

[英]regex expression

I am trying to get all the text between the following tags and it is just not workind 我正在尝试获取以下标签之间的所有文本,但这只是不行

If Not String.IsNullOrEmpty(_html) Then
               Dim regex As Regex = New Regex( _
                            ".*<entry(?<link>.+)</entry>", _
                            RegexOptions.IgnoreCase _
                            Or RegexOptions.CultureInvariant _
                            Or RegexOptions.Multiline _
                            )

            Dim ms As MatchCollection = regex.Matches(_html)
            Dim url As String = String.Empty
            For Each m As Match In ms
                 url = m.Groups("link").Value
                 urls.Add(url)
            Next
            Return urls

I have already wrote my fetch functions to get the html as string. 我已经编写了提取函数以将html作为字符串获取。 I was looking at an example of the html agility pack and I dont have files saved as html docs 我在看html敏捷包的示例,但没有文件另存为html docs

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
  HtmlAttribute att = link["href"];
  att.Value = FixLink(att);
   }
  doc.Save("file.htm");

I would use this software to help with your regexes. 我会使用该软件来帮助您的正则表达式。

Free RegExBuilder software. 免费的RegExBuilder软件。

The best way to do this in .Net is via the HTML Agility Pack . 在.Net中执行此操作的最佳方法是通过HTML Agility Pack Using regular expressions on html is not usually a good idea. 在html上使用正则表达式通常不是一个好主意。

The exceptions are situations where you can make certain assumptions about the structure of the html, such as one-off jobs (where you can study the actual input for your program) or when the html is generated by a trusted source. 例外情况是,您可以对html的结构做出某些假设,例如一次性工作(您可以在其中研究程序的实际输入)或html由受信任的源生成时。 For example, can you assume that the html is well-formed or that tags will not be nested beyond a certain depth? 例如,您可以假设html的格式正确,还是标签不会嵌套超过一定深度? (Note that neither of those assumptions by themselves are good enough to build an expression that won't fall down given some edge case or other.) (请注意,这些假设本身都不足以建立一个在某些极端情况下不会掉落的表达式。)

If you meet this criteria we need to know exactly what assumptions you are allowed to make before we can write an accurate expression. 如果满足此条件,我们需要确切地知道您可以做的假设,然后才能编写准确的表达式。

Obligatory "don't use regex to parse HTML" warning: 强制性的“不要使用正则表达式来解析HTML”警告:

Using regex to parse HTML has been covered at length on SO. SO上已详细介绍了使用正则表达式解析HTML。 Please read the following post: 请阅读以下文章:

RegEx match open tags except XHTML self-contained tags 除了XHTML自包含标记之外,RegEx匹配开放标记

Would it be possible to convert your HTML to XHTML and parse it using xpath? 是否可以将您的HTML转换为XHTML并使用xpath进行解析?

Using a tool like HTML Tidy or SGML you can do this conversion. 您可以使用HTML TidySGML之类的工具进行此转换。 Then you could use xpath to extract the desired data: //entry/link 然后,您可以使用xpath提取所需的数据: //entry/link

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM