正则表达式仅匹配 URL 中的父页面

Question

我有一组这样的网址：

这将不匹配：

https://example.com/parent/child.html

这些将匹配：

https://example.com/parent.html

https://example.com/parent.html/page/page-number

https://example.com/anything

https://example.com/anything/page/page-number

https://example.com/anything/sub-anything

https://example.com/anything/sub-anything/page/page-number

我已经搜索了很多，但没有解决方案。 我试过这个，但它没有按预期工作：

/^(https:\/\/example\.com\/[^/]+\.html|https:\/\/example\.com\/[^/]+\.html\/(.+?)|https:\/\/example\.com\/anything\/[^/]+)$/

'parent', 'child', 'anything', 'sub-anything' 只包含单词、数字、-、%

“页码”只是数字

在这种情况下，什么是好的正则表达式？

非常感谢。

Answer 1

编辑：将\\w更改为[\\w\\d-]以允许数字和破折号

这是一个非常懒惰的正则表达式，可以正确匹配您的测试用例，但除此之外可能不一定可用。 如果您想吸引更高质量的答案，我建议添加更多负面测试用例的示例。

https?:\/\/[\w%-]++(?:\.com)?(?(?=(\/[\w%-]+\/)[\w%-]+\.html)(?!)|.*)

如果父母的深度可以大于 1，例如： https://example.com/parent/parent2/child.html : https://example.com/parent/parent2/child.html并且您仍然不希望它匹配，那么以下应该可以解决问题：

https?:\/\/[\w%-]++(?:\.com)?+(?(?=(?:\/[\w%-]+)+\/[\w%-]+\.html)(?!)|.*)

对后者的解释如下：

https?       match "http" or "https"
:\/\/        match "://"
[\w%-]++    match any letters, numbers, '%', or '-'; don't allow backtracking (possessive)
(?:\.com)?+  match .com once if it's there, don't allow backtracking, don't store in capture group
(?(?=...)    if our positive lookahead matches
    (?:\/[\w%-]+)+    one or more groups of letter/number/'%'/'-' with a leading forward slash
    \/[\w%-]+\.html   followed be another forward slash, some letters/numbers/'%'/'-', then '.html'
(?!)         fail the match
|            else
.*)          match whatever is left

这是Regex101 上的正则表达式

正则表达式仅匹配 URL 中的父页面

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-01-27 16:56:10

正则表达式仅匹配 URL 中的父页面

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-01-27 16:56:10

解决方案1
1 已采纳 2020-01-27 16:56:10