简体   繁体   English

我如何找到一个字符串,其中包含两个正则表达式之间的任意字符串,但python中的某个正则表达式除外?

[英]How can I find a string that contains any between two regular expressions except for a certain regex in python?

I'm trying to write a regular expression to sift through 3mb of text and find certain strings. 我正在尝试编写一个正则表达式以筛选3mb的文本并查找某些字符串。 Right now it works relatively well, except for one problem. 目前,除了一个问题之外,它的运行效果还不错。

The current expression I'm using is 我正在使用的当前表达式是

pattern = re.compile(r'[A-Z]{4} \d{3}.{4,40} \(\d\)')

This effectively searches through the enormous string and finds all occurences of 4 uppercase aplha characters followed by a space, followed by 3 numbers followed by 4-40 any kind of characters, followed by a space, followed by (n) where n is any number. 这有效地搜索了巨大的字符串,并找到了所有出现的4个大写aplha字符,后跟一个空格,后跟3个数字,再跟4-40个任何类型的字符,后跟一个空格,然后是(n),其中n是任意数字。

What I'm looking for is something like ACCT 220 Principles of Accounting I (3) 我正在寻找类似于ACCT 220 Principles of Accounting I (3)

This is exactly what I want, except that it sometimes catches the pattern too early. 这正是我想要的,除了它有时过早地抓住模式。 There are some occurrences in the document that one class will precede the class where the pattern is supposed to start. 在文档中有一些情况表明一个类将在该模式应该开始的类之前。 For example I'll end up with BMGT 310.ACCT 220 Principles of Accounting I (3) 例如,我将BMGT 310.ACCT 220 Principles of Accounting I (3)

I figured one way to get around this would be to not allow patterns to contain 4 upper case letters in the .{4,40} portion of the regular expression. 我想一种解决此问题的方法是在正则表达式的.{4,40}部分中不允许模式包含4个大写字母。 I've tried using ^ to no avail. 我尝试使用^无济于事。

For example I tried something along the lines of [AZ]{4} \\d{3}([^AZ]{4}){4,40} \\(\\d\\) but then I end up with an empty list since the expression didn't find anything. 例如,我尝试了一些类似[AZ]{4} \\d{3}([^AZ]{4}){4,40} \\(\\d\\)但是由于没有列表,所以我最终得到了一个空列表,因为表情没找到任何东西。

I'm thinking that I just don't understand the syntax of regex so much yet. 我想我只是不太了解正则表达式的语法。 If anyone knows how to fix my expression so that it will find all instances of 4 upper case letters followed by a space, followed by three numbers, followed by 4-40 any kind of characters that do NOT contain 4 capital letters in a row, followed by a space, followed by (n) where n is a number, that would be awesome and greatly appreciated. 如果有人知道如何修正我的表达式,以便找到所有4个大写字母的实例,后跟一个空格,三个数字,然后是4-40个连续包含4个大写字母的任何字符,后面跟一个空格,然后跟(n),其中n是一个数字,这将是很棒的,我们将不胜感激。

I understand this question might be rather confusing. 我知道这个问题可能会令人困惑。 If you need any more information from me, please let me know. 如果您需要我提供更多信息,请告诉我。

If you don't want to match 4 uppercases in a row, you can instead make use of a negative lookahead, and then match 1 character at a time with {4,40} : 如果您不想连续匹配4个大写字母,则可以改用负的超前查询,然后使用{4,40}一次匹配1个字符:

Piece of your current working regex: 您当前的正则表达式的一部分:

.{4,40}

To be changed to: 更改为:

(?:(?![A-Z]{4}).){4,40}

regex101 demo regex101演示

A negative lookahead (?! ... ) will make a match fail if what's inside it matches. 如果内部的匹配项为负数,则匹配项(?! ... )将使匹配项失败。 Since we have (?![AZ]{4}) , the match will fail if there are 4 uppercase in a row. 由于我们有(?![AZ]{4}) ,因此如果连续有4个大写字母,则匹配将失败。 They are zero-width assertions, such that the final match won't be affected at all, and also why I'm still using a . 它们是零宽度的断言,因此最终匹配根本不会受到影响,这也是为什么我仍使用a的原因. for the main matching. 主要匹配。


A simple example which might help explain how negative lookahead work and how to understand the zero-width assertion is this: 一个简单的示例可能会帮助解释否定超前工作的方式以及如何理解零宽度断言:

w(?!o)

This regex will match the w (see that no o is involved) in way , whole , below but not the w in word . 此正则表达式将匹配w (看有没有o参与)的waywholebelow但不wword

(?![AZ]{4}). will thus match . 因此将匹配. , unless this . 除非如此. is an uppercase character followed by 3 more uppercase character (making this a 4 uppercase consecutive). 是一个大写字符,后跟3个大写字符(使它成为4个大写连续字符)。

To repeat this . 重复一遍. now, you cannot just use (?![AZ]{4}).{4,40} because the negative lookahead will only apply to the first . 现在,您不能只使用(?![AZ]{4}).{4,40}因为负前瞻仅适用于第一个. and not the others. 而不是其他人。 The trick is thus to put (?![AZ]{4}). 因此,诀窍是放置(?![AZ]{4}). in a group and then repeat: 分组,然后重复:

((?![A-Z]{4}).){4,40}

Last, I prefer using non-capture groups (?: ... ) because they make the regex a bit more efficient since they don't store captures: 最后,我更喜欢使用非捕获组(?: ... )因为它们不存储捕获,从而使正则表达式的效率更高:

(?:(?![A-Z]{4}).){4,40}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Python 中找到带有正则表达式的字符串 - How can I find string with regular expressions in Python Python:如何在字符串中的某些单词之间找到文本? - Python: How can I find text between certain words in a string? 如何在python中找到两个字符串之间的字符串并打印出来? - How can I find and print a string between two strings in python? 如何使用python正则表达式将String数据附加到某些位置? - How do I append String data to certain positions using python regular expressions? 使用正则表达式提取字符串,该字符串在字符串中的任何位置都包含某个单词 - Extracting a string with regular expressions that contains a certain word anywhere in the string 如何匹配 python 中正则表达式中的字符串列表中的任何字符串? - How to match any string from a list of strings in regular expressions in python? 如何在python中使用正则表达式来捕获两个单词之间的字符? - How to use regular expressions in python to capture the characters between two words? Python - Regex - 如何在两组字符串之间查找字符串 - Python — Regex — How to find a string between two sets of strings Python正则表达式在两个字符串之间找到字符串 - Python Regex to find String between two strings 使用正则表达式在字符串中找到两个相同的字符 - Find two of the same character in a string with regular expressions
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM