简体   繁体   English

匹配任意分割多行的字符串

[英]Matching a string that's arbitrarily splits over multiple lines

Is there a way in regex's to match a string that is arbitrarily split over multiple lines - say we have the following format in a file: 在正则表达式中是否有一种方法可以匹配任意分割为多行的字符串 - 比如我们在文件中有以下格式:

msgid "This is "
"an example string"
msgstr "..."

msgid "This is an example string"
msgstr "..."

msgid ""
"This is an " 
"example" 
" string"
msgstr "..."

msgid "This is " 
"an unmatching string" 
msgstr "..."

So we would like to have a pattern that would match all the example strings, ie: match the string regardless of how it's split across lines. 因此,我们希望有一个匹配所有示例字符串的模式,即:匹配字符串,无论它是如何跨行分割的。 Notice that we are after a specific string as shown in the sample, not just any string. 请注意,我们在示例中显示的特定字符串之后,而不仅仅是任何字符串。 So in this case we would like to match the string "This is an example string" . 所以在这种情况下,我们希望匹配字符串"This is an example string"

Of course we can can easily concat the strings then apply the match, but got me wondering if this is possible. 当然我们可以轻松地连接字符串然后应用匹配,但让我想知道这是否可能。 I'm talking Python regex's but a general answer is ok. 我正在谈论Python正则表达式,但一般的答案是可以的。

Do you want to match a series of words? 你想要匹配一系列单词吗? If so, you could look for words with just spaces (\\s) in between, since \\s matches newlines and spaces alike. 如果是这样,你可以查找中间只有空格(\\ s)的单词,因为\\ s匹配换行符和空格。

import re

search_for = "This is an example string"
search_for_re = r"\b" + r"\s+".join(search_for.split()) + r"\b"
pattern = re.compile(search_for_re)
match = lambda s: pattern.match(s) is not None

s = "This is an example string"
print match(s), ":", repr(s)

s = "This is an \n example string"
print match(s), ":", repr(s)

s = "This is \n an unmatching string"
print match(s), ":", repr(s)

Prints: 打印:

True : 'This is an example string'
True : 'This is an \n example string'
False : 'This is \n an unmatching string'

This is a bit tricky with the need for quotes on every line, and the allowance of empty lines. 由于需要在每一行上引用,以及空行的容差,这有点棘手。 Here's a regex that matches the file you posted correctly: 这是一个与您正确发布的文件匹配的正则表达式:

'(""\n)*"This(( "\n(""\n)*")|("\n(""\n)*" )| )is(( "\n(""\n)*")|("\n(""\n)*" )| )an(( "\n(""\n)*")|("\n(""\n)*" )| )example(( "\n(""\n)*")|("\n(""\n)*" )| )string"'

That's a bit confusing, but all it is is the string you want to match, but it starts with: 这有点令人困惑,但它只是你要匹配的字符串,但它始于:

(""\n)*"

and has replaces the spaces between each word with: 并用以下内容替换每个单词之间的空格:

(( "\n(""\n)*")|("\n(""\n)*" )| )

which checks for three different possibilities after each word, either a "space, quote, newline, (unlimited number of empty strings) quote", or that same sequence but more the space to the end, or just a space. 它检查每个单词后面的三种不同的可能性,“空格,引号,换行符,(无限数量的空字符串)引用”,或者相同的序列,但更多的空间到最后,或只是一个空格。

A much easier way to get this working would be to write a little function that would take in the string you are trying to match and return the regex that will match it: 一个更容易实现这个工作的方法是编写一个小函数,它将接收你想要匹配的字符串并返回与之匹配的正则表达式:

def getregex(string):
    return '(""\n)*"' + string.replace(" ", '(( "\n(""\n)*")|("\n(""\n)*" )| )') + '"'

So, if you had the file you posted in a string called "filestring", you would get the matches like this: 所以,如果你有一个名为“filestring”的字符串中的文件,你会得到这样的匹配:

import re

def getregex(string):
    return '(""\n)*"' + string.replace(" ", '(( "\n(""\n)*")|("\n(""\n)*" )| )') + '"'

matcher = re.compile(getregex("This is an example string"))

for i in matcher.finditer(filestring):
    print i.group(0), "\n"

>>> "This is "
    "an example string"

    "This is an example string"

    ""
    "This is an "
    "example"
    " string"

This regex doesn't take into account the space you have after "example" in the third msgid, but I assume this is generated by a machine and that's a mistake. 这个正则表达式没有考虑你在第三个msgid中的“示例”之后的空间,但我认为这是由机器生成的,这是一个错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM