简体   繁体   English

Notepad++ 正则表达式组捕获

[英]Notepad++ regex group capture

I have such txt file:我有这样的txt文件:

ххх.prontube.ru
salo.ru
bbb.antichat.ru
yyy.ru
xx.bb.prontube.ru
zzz.com
srfsf.jwbefw.com.ua

Trying to delete all subdomains with such regex:尝试使用此类正则表达式删除所有子域:

Find:    .+\.((.*?)\.(ru|ua|com\.ua|com|net|info))$
Replace with: \1

Receive:接收:

prontube.ru
salo.ru
antichat.ru
yyy.ru
prontube.ru
zzz.com
com.ua

Why last line becomes com.ua instead of jwbefw.com.ua ?为什么最后一行变成com.ua而不是jwbefw.com.ua

This works without look around:这无需环顾四周即可工作:

Find: [a-zA-Z0-9-.]+\\.([a-zA-Z0-9-]+)\\.([a-zA-Z0-9-]+)$ Replace: \\1\\.\\2查找: [a-zA-Z0-9-.]+\\.([a-zA-Z0-9-]+)\\.([a-zA-Z0-9-]+)$替换: \\1\\.\\2

It finds something with at least 2 periods and only letters, numbers, and dashes following the last two periods;它会找到至少有 2 个句点并且最后两个句点后面只有字母、数字和破折号的东西; then it replaces it with the last 2 parts.然后用最后两个部分替换它。 More intuitive, in my opinion.在我看来,更直观。

There's something funny going on with that leading xxx .那个领先的xxx发生了一些有趣的事情。 It doesn't appear to be plain ASCII.它似乎不是普通的 ASCII。 For the sake of this question, I'm going to assume that's just something funny with this site and not representative of your real data.为了这个问题,我将假设这只是本网站的有趣之处,并不代表您的真实数据。

Incorrect不正确

Interestingly, I previously had an incorrect answer here that accumulated a lot of upvotes.有趣的是,我以前在这里有一个错误的答案,积累了很多赞成票。 So I think I should preserve it:所以我认为我应该保留它:

Find: [a-zA-Z0-9-]+\\.([a-zA-Z0-9-]+)\\.(.+)$ Replace: \\1\\.\\2查找: [a-zA-Z0-9-]+\\.([a-zA-Z0-9-]+)\\.(.+)$替换: \\1\\.\\2

It just finds a host name with at least 2 periods in it, then replaces it with everything after the first dot.它只是找到一个包含至少 2 个句点的主机名,然后用第一个点之后的所有内容替换它。

The .+ part is matching as much as possible. .+部分尽可能匹配。 Try using .+?尝试使用.+? instead, and it will capture the least possible, allowing the com.ua option to match.相反,它将捕获尽可能少的,允许com.ua选项匹配。

.+?\.([\w-]*?\.(?:ru|ua|com\.ua|com|net|info))$

This answer still uses the specific domain names that the original question was looking at.此答案仍使用原始问题所查看的特定域名。 As some TLD (top level domains) have a period in them, and you could theoretically have a list including multiple subdomains, whitelisting the TLD in the regex is a good idea if it works with your data set.由于某些 TLD(顶级域)中有句点,理论上您可以拥有一个包含多个子域的列表,如果它适用于您的数据集,则将正则表达式中的 TLD 列入白名单是一个好主意。 Both current answers (from 2013) will not handle the difference between "xx.bb.prontube.ru" and "srfsf.jwbefw.com.ua" correctly.当前的两个答案(来自 2013 年)都无法正确处理“xx.bb.prontube.ru”和“srfsf.jwbefw.com.ua”之间的区别。

Here is a quick explanation of why this psnig's original regex isn't working as intended:这是为什么这个 psnig 的原始正则表达式没有按预期工作的快速解释:
The + is greedy. +是贪婪的。 .+ will zip all the way to the right at the end of the line capturing everything, then work its way backwards (to the left) looking for a match from here: .+将在捕获所有内容的行末尾一直向右压缩,然后向后(向左)从这里寻找匹配项:

(ru|ua|com\\.ua|com|net|info)

With srfsf.jwbefw.com.ua the regex engine will first fail to match a , then it will move the token one place to the left to look at "ua" At that point, ua from the regex (the second option) is a match.使用srfsf.jwbefw.com.ua正则表达式引擎将首先无法匹配a ,然后它将标记向左移动一个位置以查看“ua”此时,来自正则表达式(第二个选项)的ua是匹配。

The engine will not keep looking to find "com.ua" because ".ua" met that requirement.引擎不会一直寻找“com.ua”,因为“.ua”满足该要求。

Niet the Dark Absol's answer tells the regex to be "lazy" Niet the Dark Absol 的回答告诉正则表达式“懒惰”
.+? will match any character (at least one) and then try to find the next part of the regex.将匹配任何字符(至少一个),然后尝试找到正则表达式的下一部分。 If that fails, it will advance the token, .+ matching one more character and then evaluating the rest of the regex again.如果失败,它将推进标记, .+匹配一个字符,然后再次评估正则表达式的其余部分。
The .+? .+? will eventually consume: srfsf.jwbefw before matching the period, and then matching com.ua .最终会消耗: srfsf.jwbefw在匹配句之前,然后匹配com.ua

But the implimentation of ?但是 的实现? also creates issues.也会产生问题。

Adding in the question mark makes that first .+ lazy, but then causes group1 to match bb.prontube.ru instead of prontube.ru添加问号使第一个 .+ 懒惰,但随后导致 group1 匹配bb.prontube.ru而不是prontube.ru

This is because that first period after the bb will match, then inside group 1 (.*?) will match bb.prontube .这是因为 bb 之后的第一个句点将匹配,然后在组 1 (.*?)将匹配bb.prontube before \\.(ru|ua|com\\.ua|com|net|info))$ matches .ru\\.(ru|ua|com\\.ua|com|net|info))$匹配.ru 之前

To avoid this, change that third group from (.*?) to ([\\w-]*?) so it won't capture .为避免这种情况,请将第三组从(.*?)更改为([\\w-]*?)以便它不会捕获. only letters and numbers, or a dash.只有字母和数字,或破折号。

resulting regex:结果正则表达式:
.+?\\.(([\\w-])*?\\.(ru|ua|com\\.ua|com|net|info))$

Note that you don't need to capture any groups other than the first.请注意,您不需要捕获除第一个以外的任何组。 Adding ?: makes the TLD options non-capturing.添加 ?: 使 TLD 选项无法捕获。

last change:最后更改:
.+?\\.([\\w-]*?\\.(?:ru|ua|com\\.ua|com|net|info))$

Search what: .+?\\.(\\w+\\.(?:ru|com|com\\.au))搜索什么: .+?\\.(\\w+\\.(?:ru|com|com\\.au))
Replace with: $1替换为: $1

Look in the picture above, what regex capture referring看上图,regex capture指的是什么
It's color the way you will not need a regex explaination anymore ....它的颜色是你不再需要正则表达式解释的方式......

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM