如何返回與特定模式不匹配的字符串列表？

Question

我試圖從文本文件返回與特定模式不匹配的所有結果，但是語法有困難。

pattern is [A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}

嘗試以下操作沒有成功：

'^(?![A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}$).*$'

r'^(?!([A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}).)*$'

下面是匹配模式的代碼，現在我需要查找所有不匹配的條目。

pattern = r'[A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}'

regex1 = re.compile(pattern, flags = re.IGNORECASE)

regex1.findall(text1)

數據樣本如下：

plos_annotate5_1375_1.txt plos_annotate5_1375_2.txt plos_anno％tate5_1375_3.txt plos_annotate6_1032_1.txt

第三串是我想拉的

Answer 1

如果可以在Python中進行運算，為什么要在正則表達式中取反？

strings_without_rx = [s for s in the_strings if not regex1.search(s)]

如果要掃描文件行，甚至不需要全部存儲它們，因為打開的文件是其行的可迭代項：

with open("some.file") as source:
  lines_without_rx = [s for s in source if not regex1.search(s)]
# Here the file is auto-closed.

Answer 2

您可以檢查一下正則表達式是否在數學上：

if regex.match(text1) is None:
    # Do magic you need

Answer 3

我建議對您的模式使用否定超前斷言 ：

r'(?![A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}[^A-Za-z0-9_+\.-]+)'

如果沒有將其與findall一起使用，它將沒有任何循環，從而為您提供所有不匹配的模式：

re.findall(r'(?![A-Z]+\_[A-Z0-9]+\_[0-9]+\_[0-9]+\.[A-Z]{3}[^A-Za-z0-9_+\.-]+)')