我如何找到一个字符串，其中包含两个正则表达式之间的任意字符串，但python中的某个正则表达式除外？

Question

I'm trying to write a regular expression to sift through 3mb of text and find certain strings. 我正在尝试编写一个正则表达式以筛选3mb的文本并查找某些字符串。 Right now it works relatively well, except for one problem. 目前，除了一个问题之外，它的运行效果还不错。

The current expression I'm using is 我正在使用的当前表达式是

pattern = re.compile(r'[A-Z]{4} \d{3}.{4,40} \(\d\)')

This effectively searches through the enormous string and finds all occurences of 4 uppercase aplha characters followed by a space, followed by 3 numbers followed by 4-40 any kind of characters, followed by a space, followed by (n) where n is any number. 这有效地搜索了巨大的字符串，并找到了所有出现的4个大写aplha字符，后跟一个空格，后跟3个数字，再跟4-40个任何类型的字符，后跟一个空格，然后是（n），其中n是任意数字。

What I'm looking for is something like ACCT 220 Principles of Accounting I (3) 我正在寻找类似于ACCT 220 Principles of Accounting I (3)

This is exactly what I want, except that it sometimes catches the pattern too early. 这正是我想要的，除了它有时过早地抓住模式。 There are some occurrences in the document that one class will precede the class where the pattern is supposed to start. 在文档中有一些情况表明一个类将在该模式应该开始的类之前。 For example I'll end up with BMGT 310.ACCT 220 Principles of Accounting I (3) 例如，我将BMGT 310.ACCT 220 Principles of Accounting I (3)

I figured one way to get around this would be to not allow patterns to contain 4 upper case letters in the .{4,40} portion of the regular expression. 我想一种解决此问题的方法是在正则表达式的.{4,40}部分中不允许模式包含4个大写字母。 I've tried using ^ to no avail. 我尝试使用^无济于事。

For example I tried something along the lines of [AZ]{4} \\d{3}([^AZ]{4}){4,40} \\(\\d\\) but then I end up with an empty list since the expression didn't find anything. 例如，我尝试了一些类似[AZ]{4} \\d{3}([^AZ]{4}){4,40} \\(\\d\\)但是由于没有列表，所以我最终得到了一个空列表，因为表情没找到任何东西。

I'm thinking that I just don't understand the syntax of regex so much yet. 我想我只是不太了解正则表达式的语法。 If anyone knows how to fix my expression so that it will find all instances of 4 upper case letters followed by a space, followed by three numbers, followed by 4-40 any kind of characters that do NOT contain 4 capital letters in a row, followed by a space, followed by (n) where n is a number, that would be awesome and greatly appreciated. 如果有人知道如何修正我的表达式，以便找到所有4个大写字母的实例，后跟一个空格，三个数字，然后是4-40个连续不包含4个大写字母的任何字符，后面跟一个空格，然后跟（n），其中n是一个数字，这将是很棒的，我们将不胜感激。

I understand this question might be rather confusing. 我知道这个问题可能会令人困惑。 If you need any more information from me, please let me know. 如果您需要我提供更多信息，请告诉我。

Answer 1

If you don't want to match 4 uppercases in a row, you can instead make use of a negative lookahead, and then match 1 character at a time with {4,40} : 如果您不想连续匹配4个大写字母，则可以改用负的超前查询，然后使用{4,40}一次匹配1个字符：

Piece of your current working regex: 您当前的正则表达式的一部分：

.{4,40}

To be changed to: 更改为：

(?:(?![A-Z]{4}).){4,40}

regex101 demo regex101演示

A negative lookahead (?! ... ) will make a match fail if what's inside it matches. 如果内部的匹配项为负数，则匹配项(?! ... )将使匹配项失败。 Since we have (?![AZ]{4}) , the match will fail if there are 4 uppercase in a row. 由于我们有(?![AZ]{4}) ，因此如果连续有4个大写字母，则匹配将失败。 They are zero-width assertions, such that the final match won't be affected at all, and also why I'm still using a . 它们是零宽度的断言，因此最终匹配根本不会受到影响，这也是为什么我仍使用a的原因. for the main matching. 主要匹配。

A simple example which might help explain how negative lookahead work and how to understand the zero-width assertion is this: 一个简单的示例可能会帮助解释否定超前工作的方式以及如何理解零宽度断言：

w(?!o)

This regex will match the w (see that no o is involved) in way , whole , below but not the w in word . 此正则表达式将匹配w （看有没有o参与）的way ， whole ， below但不w的word 。

(?![AZ]{4}). will thus match . 因此将匹配. , unless this . 除非如此. is an uppercase character followed by 3 more uppercase character (making this a 4 uppercase consecutive). 是一个大写字符，后跟3个大写字符（使它成为4个大写连续字符）。

To repeat this . 重复一遍. now, you cannot just use (?![AZ]{4}).{4,40} because the negative lookahead will only apply to the first . 现在，您不能只使用(?![AZ]{4}).{4,40}因为负前瞻仅适用于第一个. and not the others. 而不是其他人。 The trick is thus to put (?![AZ]{4}). 因此，诀窍是放置(?![AZ]{4}). in a group and then repeat: 分组，然后重复：

((?![A-Z]{4}).){4,40}

Last, I prefer using non-capture groups (?: ... ) because they make the regex a bit more efficient since they don't store captures: 最后，我更喜欢使用非捕获组(?: ... )因为它们不存储捕获，从而使正则表达式的效率更高：

(?:(?![A-Z]{4}).){4,40}

我如何找到一个字符串，其中包含两个正则表达式之间的任意字符串，但python中的某个正则表达式除外？

问题描述

1 个解决方案

解决方案1
4 已采纳 2014-01-02 06:52:03

我如何找到一个字符串，其中包含两个正则表达式之间的任意字符串，但python中的某个正则表达式除外？

问题描述

1 个解决方案

解决方案1 4 已采纳 2014-01-02 06:52:03

解决方案1
4 已采纳 2014-01-02 06:52:03