匹配任意分割多行的字符串

Question

Is there a way in regex's to match a string that is arbitrarily split over multiple lines - say we have the following format in a file: 在正则表达式中是否有一种方法可以匹配任意分割为多行的字符串 - 比如我们在文件中有以下格式：

msgid "This is "
"an example string"
msgstr "..."

msgid "This is an example string"
msgstr "..."

msgid ""
"This is an " 
"example" 
" string"
msgstr "..."

msgid "This is " 
"an unmatching string" 
msgstr "..."

So we would like to have a pattern that would match all the example strings, ie: match the string regardless of how it's split across lines. 因此，我们希望有一个匹配所有示例字符串的模式，即：匹配字符串，无论它是如何跨行分割的。 Notice that we are after a specific string as shown in the sample, not just any string. 请注意，我们在示例中显示的特定字符串之后，而不仅仅是任何字符串。 So in this case we would like to match the string "This is an example string" . 所以在这种情况下，我们希望匹配字符串"This is an example string" 。

Of course we can can easily concat the strings then apply the match, but got me wondering if this is possible. 当然我们可以轻松地连接字符串然后应用匹配，但让我想知道这是否可能。 I'm talking Python regex's but a general answer is ok. 我正在谈论Python正则表达式，但一般的答案是可以的。

Answer 1

Do you want to match a series of words? 你想要匹配一系列单词吗？ If so, you could look for words with just spaces (\\s) in between, since \\s matches newlines and spaces alike. 如果是这样，你可以查找中间只有空格（\\ s）的单词，因为\\ s匹配换行符和空格。

import re

search_for = "This is an example string"
search_for_re = r"\b" + r"\s+".join(search_for.split()) + r"\b"
pattern = re.compile(search_for_re)
match = lambda s: pattern.match(s) is not None

s = "This is an example string"
print match(s), ":", repr(s)

s = "This is an \n example string"
print match(s), ":", repr(s)

s = "This is \n an unmatching string"
print match(s), ":", repr(s)

Prints: 打印：

True : 'This is an example string'
True : 'This is an \n example string'
False : 'This is \n an unmatching string'

Answer 2

This is a bit tricky with the need for quotes on every line, and the allowance of empty lines. 由于需要在每一行上引用，以及空行的容差，这有点棘手。 Here's a regex that matches the file you posted correctly: 这是一个与您正确发布的文件匹配的正则表达式：

'(""\n)*"This(( "\n(""\n)*")|("\n(""\n)*" )| )is(( "\n(""\n)*")|("\n(""\n)*" )| )an(( "\n(""\n)*")|("\n(""\n)*" )| )example(( "\n(""\n)*")|("\n(""\n)*" )| )string"'

That's a bit confusing, but all it is is the string you want to match, but it starts with: 这有点令人困惑，但它只是你要匹配的字符串，但它始于：

(""\n)*"

and has replaces the spaces between each word with: 并用以下内容替换每个单词之间的空格：

(( "\n(""\n)*")|("\n(""\n)*" )| )

which checks for three different possibilities after each word, either a "space, quote, newline, (unlimited number of empty strings) quote", or that same sequence but more the space to the end, or just a space. 它检查每个单词后面的三种不同的可能性，“空格，引号，换行符，（无限数量的空字符串）引用”，或者相同的序列，但更多的空间到最后，或只是一个空格。

A much easier way to get this working would be to write a little function that would take in the string you are trying to match and return the regex that will match it: 一个更容易实现这个工作的方法是编写一个小函数，它将接收你想要匹配的字符串并返回与之匹配的正则表达式：

def getregex(string):
    return '(""\n)*"' + string.replace(" ", '(( "\n(""\n)*")|("\n(""\n)*" )| )') + '"'

So, if you had the file you posted in a string called "filestring", you would get the matches like this: 所以，如果你有一个名为“filestring”的字符串中的文件，你会得到这样的匹配：

import re

def getregex(string):
    return '(""\n)*"' + string.replace(" ", '(( "\n(""\n)*")|("\n(""\n)*" )| )') + '"'

matcher = re.compile(getregex("This is an example string"))

for i in matcher.finditer(filestring):
    print i.group(0), "\n"

>>> "This is "
    "an example string"

    "This is an example string"

    ""
    "This is an "
    "example"
    " string"

This regex doesn't take into account the space you have after "example" in the third msgid, but I assume this is generated by a machine and that's a mistake. 这个正则表达式没有考虑你在第三个msgid中的“示例”之后的空间，但我认为这是由机器生成的，这是一个错误。

匹配任意分割多行的字符串

问题描述

2 个解决方案

解决方案1
4 已采纳 2012-05-05 06:42:54

解决方案2
0 2012-05-05 07:30:22

匹配任意分割多行的字符串

问题描述

2 个解决方案

解决方案1 4 已采纳 2012-05-05 06:42:54

解决方案2 0 2012-05-05 07:30:22

解决方案1
4 已采纳 2012-05-05 06:42:54

解决方案2
0 2012-05-05 07:30:22