简体   繁体   English

正则表达式查找超链接而排除纯文本

[英]Regex that finds hyperlinks while excluding plain text

So i'm looking to scrape rapidshare.com links from websites. 因此,我希望从网站上抓取Rapidshare.com链接。 I have the following regular expressions to find links: 我有以下正则表达式来查找链接:

<a href=\"(http://rapidshare.com/files/(\\d+)/(.+)\\.(\\w{3,4}))\"

http://rapidshare.com/files/(\\d+)/(.+)\\.(\\w{3,4})

How can I write a regex that will exclude text that is embedded in a <a href="..."> tag. 我该如何写一个正则表达式来排除嵌入在<a href="...">标记中的文本。 and only capture the text in >here</a> 并且仅捕获>here</a>的文本

I also have to bare in mind that not all links are embedded in href tags. 我还必须记住,并非所有链接都嵌入在href标签中。 Some are just displayed in plain text. 有些仅以纯文本显示。

Basically is there a wway to exclude patterns in regex ? 基本上有什么方法可以排除正则表达式中的模式?

Thanks. 谢谢。

这样怎么样,最后一部分将尝试匹配' " >

http://rapidshare.com/files/(\d+)/([^'"> ]+)

To capture the inner text of an anchor tag, while ignoring all attribute text of the tag, you'd use the pattern: 要捕获锚标记的内部文本,而忽略该标记的所有属性文本,则可以使用以下模式:

<a href="http://rapidshare.com/files/(\d+)/(.+)\.(\w{3,4})[^>]*>(.*?)</a>

The [^>]* part matches everything else in your tag up until the end of the start tag. [^>] *部分与标签中的所有其他内容匹配,直到开始标签结束为止。 The (.*?) performs a non-greedy capture of the inner text. (。*?)对内部文本执行非贪婪捕获。

If you want to capture anchor tag links and non-anchor tag links, then those are really two separate problems. 如果要捕获锚标记链接非锚标记链接,那么这实际上是两个单独的问题。 There's probably a regex for it, but it would be terribly complicated. 可能有一个正则表达式,但是它非常复杂。 You're better off simply looking for non-anchor-tag links separately with the simple regex: 最好只使用简单的正则表达式单独查找非锚标签链接:

[^'"]http://rapidshare.com/files/(\d+)/(.+)\.(\w{3,4})

How about something like: 怎么样:

/http:\/\/rapidshare.com\/files\/\d+\/[^<&\s]+\.\w{3,4}/

I got rid of the capturing groups, because I think you only had them in there because you thought you might need them to make sure the different groupings worked and you can add them back in if you really want them parsed out. 我摆脱了捕获组,因为我认为您只在其中存在捕获组,因为您认为可能需要它们来确保不同的组起作用,并且如果您真的希望将它们解析出来,则可以将它们添加回去。

You can expand upon the [^<&"\\s] as it only is excluding white spaces, the < character which could be the start of the tag, the & which would include things like &nbsp; and other HTML entities or the " which would be the end of the href= . 您可以在[^<&"\\s]进行扩展,因为它仅排除空格, <字符(可能是标记的开头), & (可能包含&nbsp;和其他HTML实体)或"将是href=的结尾。 but you could exclude any non-valid URI character if you wanted. 但您可以根据需要排除任何无效的URI字符 This should cover your inline text as well as those embedded as href. 这应该涵盖您的内联文本以及嵌入为href的文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM