正则表达式-比较两个捕获组

Question

试图创建一个正则表达式来限制我们的垃圾邮件摄入量。 问题是，我不太会使用正则表达式。 我下面的工作成果主要是复制和粘贴，调整和搜索有助于进一步调整的内容。

我已经决定尝试使用正则表达式来匹配链接错误代表主机名的电子邮件。

例如：

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>

我基本上只关心主机名，以限制误报并避免或多或少的合法链接，例如A HREF ...>单击此处！

到目前为止，我有这个：

(HREF="http[s]?:\/\/)(?'hostname1'(.*?))[:|\/|"].*?\"\>(http[s]?:\/\/)(?'hostname2'(.*?))[<|\/|:]

根据https://regex101.com/，我有两个命名的捕获组（hostname1和hostname2），还有一个不确定的其他捕获组。

如果主机名1和主机名2相同，我要匹配的字符串。 我感到它涉及回溯或回溯，但老实说我不知道。

编辑：感谢Jan制作此原型。 根据他回答中的评论，我做了一个快速补充，以添加图片标签中未说明的情况。 对于大型网站（例如，BestBuy），它们将图像存储在其他内容服务器上，这触发了规则。 我决定排除图片标签，我认为（我非常不赞同）我已经成功完成了图片标签。 因人而异。

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>((?!<IMG).?)(?:https?:\/\/)?(?!.*\k'hostname')

Answer 1

在某种程度上取决于您的编程语言。 在PHP中，您可以提出sth。 喜欢：

href=["']https?:\/\/(?<hostname>[^\/]+)[^>]+>(?:https?:\/\/)?\k'hostname'
# match href, =, a single/double quote, :// literally
# capture everything up to a forward slash (but not including) in a group called hostname
# followed by anything but >
# followed by >
# start a non capturing group (?:) with http/https://
# look if one can match the previously captured group called hostname

如果是这种情况，则可能不是垃圾邮件链接（href和链接文本匹配）。

概述：

<A HREF="http://phishers.org/we_want_your_money.htm">http://someLegitimateSite.com/somewhere </A>
<a href="https://example.com/subfolder">example.com</a> <-- will match, the others not
<a href="http://somebadsite.com">https://somegoodsite.com</a>

在regex101.com上查看工作示例。

编辑：根据您的评论，您需要否定的结果，可以通过负前瞻来完成：

href=["']https?:\/\/(?<hostname>[^\/"]+)[^>]+>(?:https?:\/\/)?(?!.*\k'hostname')
# same as before, except for the last part: (?!...)
# this one assures that the following group (hostname in our case) is not matched

在这里查看此正则表达式的工作示例。

正则表达式-比较两个捕获组

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-12-17 19:08:48

正则表达式-比较两个捕获组

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-12-17 19:08:48

解决方案1
0 已采纳 2015-12-17 19:08:48