简体   繁体   English

如何得到正则表达式的逆?

[英]How to get the inverse of a regular expression?

Let's say I have a regular expression that works correctly to find all of the URLs in a text file: 假设我有一个正则表达式,可以正常查找文本文件中的所有URL:

(http://)([a-zA-Z0-9\/\.])*

If what I want is not the URLs but the inverse - all other text except the URLs - is there an easy modification to make to get this? 如果我想要的不是URL而是反向 - 除了URL之外的所有其他文本 - 是否有一个简单的修改来获得这个?

You could simply search and replace everything that matches the regular expression with an empty string, eg in Perl s/(http:\\/\\/)([a-zA-Z0-9\\/\\.])*//g 您可以使用空字符串搜索并替换与正则表达式匹配的所有内容,例如在Perl s/(http:\\/\\/)([a-zA-Z0-9\\/\\.])*//g

This would give you everything in the original text, except those substrings that match the regular expression. 除了那些与正则表达式匹配的子字符串外,这将为您提供原始文本中的所有内容。

If for some reason you need a regex-only solution, try this: 如果由于某种原因你需要一个只有正则表达式的解决方案,试试这个:

((?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%]))|\A(?!http://[a-zA-Z0-9\/\.#?/%])).+?((?=http://[a-zA-Z0-9\/\.#?/%])|\Z)

I expanded the set of of URL characters a little ( [a-zA-Z0-9\\/\\.#?/%] ) to include a few important ones, but this is by no means meant to be exact or exhaustive. 我稍微扩展了一组URL字符( [a-zA-Z0-9\\/\\.#?/%] )以包含一些重要的字符,但这绝不是准确或详尽的。

The regex is a bit of a monster, so I'll try to break it down: 正则表达式是一个怪物,所以我会试着打破它:

(?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%])

The first potion matches the end of a URL. 第一部分匹配URL的结尾。 http://[a-zA-Z0-9\\/\\.#?/%]+ matches the URL itself, while (?=[^a-zA-Z0-9\\/\\.#?/%]) asserts that the URL must be followed by a non-URL character so that we are sure we are at the end. http://[a-zA-Z0-9\\/\\.#?/%]+匹配URL本身,而(?=[^a-zA-Z0-9\\/\\.#?/%])断言URL必须后跟非URL字符,这样我们才能确定我们在最后。 A lookahead is used so that the non-URL character is sought but not captured. 使用前瞻,以便寻找非URL字符但不捕获。 The whole thing is wrapped in a lookbehind (?<=...) to look for it as the boundary of the match, again without capturing that portion. 整个事物被包裹在一个lookbehind (?<=...)以寻找它作为匹配的边界,再次没有捕获该部分。

We also want to match a non-URL at the beginning of the file. 我们还希望在文件开头匹配非URL。 \\A(?!http://[a-zA-Z0-9\\/\\.#?/%]) matches the beginning of the file ( \\A ), followed by a negative lookahead to make sure there's not a URL lurking at the start of the file. \\A(?!http://[a-zA-Z0-9\\/\\.#?/%])匹配文件的开头( \\A ),然后是否定前瞻以确保没有URL潜伏在文件的开头。 (This URL check is simpler than the first one because we only need the beginning of the URL, not the whole thing.) (这个URL检查比第一个更简单,因为我们只需要URL的开头,而不是整个URL。)

Both of those checks are put in parenthesis and OR 'd together with the | 这两项检查都放在括号和OR “随着一起ð | character. 字符。 After that, .+? 之后, .+? matches the string we are trying to capture. 匹配我们试图捕获的字符串。

Then we come to ((?=http://[a-zA-Z0-9\\/\\.#?/%])|\\Z) . 然后我们来((?=http://[a-zA-Z0-9\\/\\.#?/%])|\\Z) Here, we check for the beginning of a URL, once again with (?=http://[a-zA-Z0-9\\/\\.#?/%]) . 在这里,我们再次使用(?=http://[a-zA-Z0-9\\/\\.#?/%])检查URL的开头。 The end of the file is also a pretty good sign that we've reached the end of our match, so we should look for that, too, using \\Z . 文件的结尾也是一个非常好的迹象,表明我们已经达到了匹配的结束,所以我们也应该使用\\Z来寻找它。 Similarly to a first big group, we wrap it in parenthesis and OR the two possibilities together. 同样第一大集团,我们把它包在括号中和OR两种可能性在一起。

The | | symbol requires the parenthesis because its precedence is very low, so you have to explicitly state the boundaries of the OR . 符号需要括号,因为它的优先级非常低,因此您必须明确说明OR的边界。

This regex relies heavily on zero-width assertions (the \\A and \\Z anchors, and the lookaround groups). 这个正则表达式在很大程度上依赖于零宽度断言( \\A\\Z锚点以及环视组)。 You should always understand a regex before you use it for anything serious or permanent (otherwise you might catch a case of perl), so you might want to check out Start of String and End of String Anchors and Lookahead and Lookbehind Zero-Width Assertions . 在将它用于任何严重或永久性的事情之前,你应该总是理解一个正则表达式(否则你可能会遇到perl的情况),所以你可能想要检查字符串的开头和字符串锚点的结束以及Lookahead和Lookbehind零宽度断言

Corrections welcome, of course! 当然,更正欢迎!

如果我正确理解了这个问题,你可以使用搜索/替换...只是在表达式周围使用通配符,然后替换第一个和最后一个部分。

s/^(.*)(your regex here)(.*)$/$1$3/

im not sure if this will work exactly as you intend but it might help: Whatever you place in the brackets [] will be matched against. 我不确定这是否会按照您的意图完成,但它可能会有所帮助:无论您放置在方括号[]中的哪一个都将被匹配。 If you put ^ within the bracket, ie [^a-zA-Z0-9/.] it will match everything except what is in the brackets. 如果你把^支架,即在[^ A-ZA-Z0-9 /]它将匹配除了什么是在括号中的一切。

http://www.regular-expressions.info/ http://www.regular-expressions.info/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM