简体   繁体   English

如何将以下文件名转换为Python中的正则表达式?

[英]How can I translate the following filename to a regular expression in Python?

I am battling regular expressions now as I type. 我正在打字时正在与正则表达式作斗争。

I would like to determine a pattern for the following example file: b410cv11_test.ext . 我想确定以下示例文件的模式: b410cv11_test.ext I want to be able to do a search for files that match the pattern of the example file aforementioned. 我希望能够搜索与上述示例文件的模式匹配的文件。 Where do I start (so lost and confused) and what is the best way of arriving at a solution that best matches the file pattern? 我从哪里开始(如此迷失和困惑)以及获得最符合文件模式的解决方案的最佳方法是什么? Thanks in advance. 提前致谢。

Further clarification of question: 进一步澄清问题:

I would like the pattern to be as follows: must start with 'b', followed by three digits, followed by 'cv', followed by two digits, then an underscore, followed by 'release', followed by .'ext' 我希望模式如下:必须以'b'开头,后跟三位数字,然后是'cv',接着是两位数字,然后是下划线,接着是'release',接着是.'ext'

Now that you have a human readable description of your file name, it's quite straight forward to translate it into a regular expression (at least in this case ;) 既然你有一个人类可读的文件名描述,那么将其翻译成正则表达式是非常简单的(至少在这种情况下;)

must start with 必须从

The caret ( ^ ) anchors a regular expression to the beginning of what you want to match, so your re has to start with this symbol. 插入符号( ^ )将正则表达式锚定到您想要匹配的开头,因此您必须以此符号开头。

'b', 'B',

Any non-special character in your re will match literally, so you just use "b" for this part: ^b . 你的re中的任何非特殊字符都会按字面意思匹配,因此你只需使用“b”来表示这一部分: ^b

followed by [...] digits, 接着是[...]位数,

This depends a bit on which flavor of re you use: 这取决于你使用哪种味道:

The most general way of expressing this is to use brackets ( [] ). 表达这一点的最常用方法是使用括号( [] )。 Those mean "match any one of the characters listed within. [ASDF] for example would match either A or S or D or F , [0-9] would match anything between 0 and 9. 这些意思是“匹配其中列出的任何一个字符。 [ASDF]例如匹配ASDF[0-9]将匹配0到9之间的任何内容。

Your re library probably has a shortcut for "any digit". 您的库可能有“任何数字”的快捷方式。 In sed and awk you could use [[:digit:]] [sic!], in python and many other languages you can use \\d . sedawk你可以使用[[:digit:]] [sic!],在python和许多其他语言中你可以使用\\d

So now your re reads ^b\\d . 所以现在你重读^b\\d

followed by three [...] 其次是三个[...]

The most simple way to express this would be to just repeat the atom three times like this: \\d\\d\\d . 表达这个的最简单的方法就是像这样重复原子三次: \\d\\d\\d

Again your language might provide a shortcut: braces ( {} ). 您的语言可能再次提供快捷方式:大括号( {} )。 Sometimes you would have to escape them with a backslash (if you are using sed or awk, read about "extended regular expressions"). 有时您必须使用反斜杠来逃避它们(如果您使用sed或awk,请阅读“扩展正则表达式”)。 They also give you a way to say "at least x, but no more than y occurances of the previous atom": {x,y} . 它们还为您提供了一种方法来说“至少x,但不超过前一个原子的{x,y} ”: {x,y}

Now you have: ^b\\d{3} 现在你有: ^b\\d{3}

followed by 'cv', 然后是'cv',

Literal matching again, now we have ^b\\d{3}cv 文字匹配再次,现在我们有^b\\d{3}cv

followed by two digits, 后跟两位数,

We already covered this: ^b\\d{3}cv\\d{2} . 我们已经介绍了这个: ^b\\d{3}cv\\d{2}

then an underscore, followed by 'release', followed by .'ext' 然后是下划线,然后是'release',然后是.'ext'

Again, this should all match literally, but the dot ( . ) is a special character. 同样,这应该完全匹配,但点( . )是一个特殊字符。 This means you have to escape it with a backslash: ^\\d{3}cv\\d{2}_release\\.ext 这意味着您必须使用反斜杠转义它: ^\\d{3}cv\\d{2}_release\\.ext

Leaving out the backslash would mean that a filename like "b410cv11_test_ext" would also match, which may or may not be a problem for you. 省略反斜杠意味着像“b410cv11_test_ext”这样的文件名也会匹配,这对你来说可能是也可能不是问题。

Finally, if you want to guarantee that there is nothing else following ".ext", anchor the re to the end of the thing to match, use the dollar sign ( $ ). 最后,如果您想保证“.ext”之后没有其他内容,请将re锚定到匹配的东西的末尾,使用美元符号( $ )。

Thus the complete regular expression for your specific problem would be: 因此,针对您的特定问题的完整正则表达式将是:

^b\d{3}cv\d{2}_release\.ext$

Easy. 简单。

Whatever language or library you use, there has to be a reference somewhere in the documentation that will show you what the exact syntax in your case should be. 无论您使用何种语言或库,都必须在文档中的某处提供参考,以便向您展示您的案例中的确切语法。 Once you have learned to break down the problem into a suitable description, understanding the more advanced constructs will come to you step by step. 一旦您学会将问题分解为合适的描述,就会逐步了解更高级的结构。

To avoid confusion, read the following, in order. 为避免混淆,请按顺序阅读以下内容。

First, you have the glob module, which handles file name regular expressions just like the Windows and unix shells. 首先,你有glob模块,它处理文件名正则表达式,就像Windows和unix shell一样。

Second, you have the fnmatch module, which just does pattern matching using the unix shell rules. 其次,你有fnmatch模块, 它只使用unix shell规则进行模式匹配。

Third, you have the re module, which is the complete set of regular expressions. 第三,你有re模块,它是一套完整的正则表达式。

Then ask another, more specific question. 然后问另一个更具体的问题。

我希望模式如下:必须以'b'开头,后跟三位数字,然后是'cv',接着是两位数字,然后是下划线,接着是'release',接着是.'ext'

^b\d{3}cv\d{2}_release\.ext$

Your question is a bit unclear. 你的问题有点不清楚。 You say you want a regular expression, but could it be that you want a glob-style pattern you can use with commands like ls? 你说你想要一个正则表达式,但是你可能想要一个可以用ls这样的命令使用的glob风格模式吗? glob expressions and regular expressions are similar in concept but different in practice (regular expressions are considerably more powerful, glob style patterns are easier for the most common cases when looking for files. glob表达式和正则表达式在概念上类似,但在实践中有所不同(正则表达式功能更强大,在查找文件时,最常见的情况下,glob样式模式更容易。

Also, what do you consider to be the pattern? 另外,你认为这个模式是什么? Certainly, * (glob) or .* (regex) will match the pattern. 当然,*(glob)或。*(正则表达式)将匹配模式。 Also, _test.ext (glob) or . 另外, _test.ext(glob)或。 _test.ext (regexp) pattern would match, as would many other variations. _test.ext(regexp)模式将与许多其他变体匹配。

Can you be more specific about the pattern? 你能更具体地说明这种模式吗? For example, you might describe it as "b, followed by digits, followed by cv, followed by digits ..." 例如,您可以将其描述为“b,后跟数字,后跟cv,后跟数字......”

Once you can precisely explain the pattern in your native language (and that must be your first step), it's usually a fairly straight-forward task to translate that into a glob or regular expression pattern. 一旦你能用你的母语精确地解释模式(这必定是你的第一步),将它转换为全局或正则表达式模式通常是一项相当直接的任务。

如果字母不重要,你可以试试\\ w \\ d \\ d \\ d \\ w \\ w \\ d \\ d_test.ext哪个匹配字母/数字模式,或者b \\ d \\ d \\ dcv \\ d \\ d_test.ext或者两者的混合。

When working with regexes I find the Mochikit regex example to be a great help. 在使用正则表达式时,我发现Mochikit正则表达式的例子是一个很好的帮助。

/^b\d\d\dcv\d\d_test\.ext$/

Then use the python re (regex) module to do the match. 然后使用python re(regex)模块进行匹配。 This is of course assuming regex is really what you need and not glob as the others mentioned. 这当然是假设正则表达式确实是你需要的而不是像其他人提到的那样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM