简体   繁体   English

正则表达式-比较两个捕获组

[英]Regex - Compare two capture groups

trying to create a regex to limit our spam intake. 试图创建一个正则表达式来限制我们的垃圾邮件摄入量。 Problem is, I'm not exactly fluent in regular expressions. 问题是,我不太会使用正则表达式。 The product of my work below is mostly copy and paste, tweaks, and searches for things to help tweak it more. 我下面的工作成果主要是复制和粘贴,调整和搜索有助于进一步调整的内容。

What I've decided I want to try is using a regex to match emails where a link misrepresents the hostname. 我已经决定尝试使用正则表达式来匹配链接错误代表主机名的电子邮件。

For example: 例如:

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>

I basically only care about the hostnames, to limit false positives and to avoid more or less legitimate links such as A HREF...>click here! 我基本上只关心主机名,以限制误报并避免或多或少的合法链接,例如A HREF ...>单击此处!

To date, I have this: 到目前为止,我有这个:

(HREF="http[s]?:\/\/)(?'hostname1'(.*?))[:|\/|"].*?\"\>(http[s]?:\/\/)(?'hostname2'(.*?))[<|\/|:]

According to https://regex101.com/ I have two named capture groups (hostname1 and hostname2), and a whack of other groups that I'm not sure I care about. 根据https://regex101.com/,我有两个命名的捕获组(hostname1和hostname2),还有一个不确定的其他捕获组。

What I want to do is match the string if hostname1 and hostname2 are the same. 如果主机名1和主机名2相同,我要匹配的字符串。 I get the feeling that it involves either a lookbehind or a lookahead, but I honestly don't know. 我感到它涉及回溯或回溯,但老实说我不知道​​。

EDIT: Thanks to Jan for prototyping this. 编辑:感谢Jan制作此原型。 I, as per the comments in his answer, made one quick addition to add the unaccounted for case of image tags. 根据他回答中的评论,我做了一个快速补充,以添加图片标签中未说明的情况。 In the case of large websites (BestBuy for example) they store their images on a different content server, which was triggering the rule. 对于大型网站(例如,BestBuy),它们将图像存储在其他内容服务器上,这触发了规则。 I've decided to exclude image tags, which I BELIEVE (in my very non-expert opinion) I have successfully done. 我决定排除图片标签,我认为(我非常不赞同)我已经成功完成了图片标签。 YMMV. 因人而异。

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>((?!<IMG).?)(?:https?:\/\/)?(?!.*\k'hostname')

It somewhat depends on your programming language. 在某种程度上取决于您的编程语言。 In PHP you could come up with sth. 在PHP中,您可以提出sth。 like: 喜欢:

href=["']https?:\/\/(?<hostname>[^\/]+)[^>]+>(?:https?:\/\/)?\k'hostname'
# match href, =, a single/double quote, :// literally
# capture everything up to a forward slash (but not including) in a group called hostname
# followed by anything but >
# followed by >
# start a non capturing group (?:) with http/https://
# look if one can match the previously captured group called hostname

If this is the case, it is presumably not a spam link (href and link text match). 如果是这种情况,则可能不是垃圾邮件链接(href和链接文本匹配)。

An overview: 概述:

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>
<a href="https://example.com/subfolder">example.com</a> <-- will match, the others not
<a href="http://somebadsite.com">https://somegoodsite.com</a>

See a working example here on regex101.com . 在regex101.com上查看工作示例

EDIT: According to your comment, you want the negated result, this can be done via a negative lookahead: 编辑:根据您的评论,您需要否定的结果,可以通过负前瞻来完成:

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>(?:https?:\/\/)?(?!.*\k'hostname')
# same as before, except for the last part: (?!...)
# this one assures that the following group (hostname in our case) is not matched

See a working example for this regex here . 在这里查看此正则表达式的工作示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM