简体   繁体   English

正则表达式用于多行字符串?

[英]Regex for multi-line string?

I have the following input: 我有以下输入:

str = """

    Q: What is a good way of achieving this?

    A: I am not sure. Try the following:

    1. Take this first step. Execute everything.

    2. Then, do the second step

    3. And finally, do the last one



    Q: What is another way of achieving this?

    A: I am not sure. Try the following alternatives:

    1. Take this first step from before. Execute everything.

    2. Then, don't do the second step

    3. Do the last one and then execute the above step

"""

I want to capture the QA pairs in the input but I am not able to get a good regex to do this. 我想在输入中捕获QA对,但是我无法获得良好的正则表达式来执行此操作。 I managed the following: 我管理以下内容:

(?ms)^[\s#\-\*]*(?:Q)\s*:\s*(\S.*?\?)[\s#\-\*]+(?:A)\s*:\s*(\S.*)$

But, I'm able to capture the input as follows: 但是,我能够捕获输入,如下所示:

('Q', 'What is a good way of achieving this?')
('A', "I am not sure. Try the following:\n    1. Take this first step. Execute everything.\n    2. Then, do the second step\n    3. And finally, do the last one\n\n    Q: What is another way of achieving this?\n    A: I am not sure. Try the following alternatives:\n    1. Take this first step from before. Execute everything.\n    2. Then, don't do the second step\n    3. Do the last one and then execute the above step\n")

Notice how the second QA pair got captured by the first. 注意第二对质量检查对是如何被第一对捕获的。 If I use a greedy ? 如果我使用贪婪? at the end of the answer regex, it does not capture the enumerations. 在答案正则表达式的末尾,它不捕获枚举。 Any suggestions on how to solve this? 关于如何解决这个问题有什么建议吗?

解决问题的一种惰性方法,但不是最好的方法,是用“ Q:”将字符串炸开,然后用简单的/Q:(.+)A:(.+)/msU(通常为正则表达式)解析这些部分。 。

Just using this works fine for me. 只是使用它对我来说很好。 Only requires trimming a bit of whitespace. 只需要修剪一点空白。

(?s)(Q):((?:(?!A:).)*)(A):((?:(?!Q:).)*)

Example of use: 使用示例:

>>> import re
>>> str = """
...
...     Q: What is a good way of achieving this?
...
...     A: I am not sure. Try the following:
...
...     1. Take this first step. Execute everything.
...
...     2. Then, do the second step
...
...     3. And finally, do the last one  ...      ...   ...
...     Q: What is another way of achieving this?
...
...     A: I am not sure. Try the following alternatives:
...
...     1. Take this first step from before. Execute everything.
...
...     2. Then, don't do the second step
...
...     3. Do the last one and then execute the above step
...
... """
>>> regex = r"(?s)(Q):((?:(?!A:).)*)(A):((?:(?!Q:).)*)"
>>> match = re.findall(regex, str)
>>> map(lambda x: [part.strip().replace('\n', '') for part in x], match)
[['Q', 'What is a good way of achieving this?', 'A', 'I am not sure. Try the following:    1. Take this first step. Execute everything.    2. Then, do the second step    3. And finally, do the last one'], ['Q', 'What is another way of achieving this?', 'A', "I am not sure. Try the following alternatives:    1. Take this first step from before. Execute everything.    2. Then, don't do the second step    3. Do the last one and then execute the above step"]]

Even added a little thingie to help you clean the whitespaces at the end there. 甚至添加了一点小东西来帮助您清理那里的空白。

I'm not that smart to write huge regex(yet), so here is my non-regex solution - 我编写大型正则表达式还不够聪明,所以这是我的非正则表达式解决方案-

>>> str = """

    Q: What is a good way of achieving this?

    A: I am not sure. Try the following:

    1. Take this first step. Execute everything.

    2. Then, do the second step

    3. And finally, do the last one



    Q: What is another way of achieving this?

    A: I am not sure. Try the following alternatives:

    1. Take this first step from before. Execute everything.

    2. Then, don't do the second step

    3. Do the last one and then execute the above step

"""
>>> qas = str.strip().split('Q:')
>>> clean_qas = map(lambda x: x.strip().split('A:'), filter(None, qas))
>>> print clean_qas
[['What is a good way of achieving this?\n\n    ', ' I am not sure. Try the following:\n\n    1. Take this first step. Execute everything.\n\n    2. Then, d
o the second step\n\n    3. And finally, do the last one'], ['What is another way of achieving this?\n\n    ', " I am not sure. Try the following alternativ
es:\n\n    1. Take this first step from before. Execute everything.\n\n    2. Then, don't do the second step\n\n    3. Do the last one and then execute the
above step"]]

You should clean the whitespaces though. 不过,您应该清理空格。 Or you could do what Puciek said. 或者,您可以按照Puciek所说的去做。

Just for fun - 纯娱乐 -

>>> clean_qas = map(lambda x: map(lambda s: s.strip(), x.strip().split('A:')), filter(None, qas))
>>> print clean_qas
[['What is a good way of achieving this?', 'I am not sure. Try the following:\n\n    1. Take this first step. Execute everything.\n\n    2. Then, do the sec
ond step\n\n    3. And finally, do the last one'], ['What is another way of achieving this?', "I am not sure. Try the following alternatives:\n\n    1. Take
 this first step from before. Execute everything.\n\n    2. Then, don't do the second step\n\n    3. Do the last one and then execute the above step"]]

Looks ugly though. 虽然看起来很丑。

Slightly modifying your original solution: 稍微修改您的原始解决方案:

(?ms)^[\s#\-\*]*(?:Q)\s*:\s+(\S[^\n\r]*\?)[\s#\-\*]+(?:A)\s*:\s+(\S.*?)\s*(?=$|Q\s*:\s+)
  • Questions and answers must have at least one space after the : . 问题和答案必须在:之后至少有一个空格。
  • Instead of matching questions non-greedily (which won't allow for having multiple ? 's in one question), don't allow newlines in questions. 与其以非贪婪的方式匹配问题(不允许一个问题中包含多个? ),还不应该在问题中使用换行符。
  • Instead of matching to end of string, non-greedily match until either the match is followed by the end of the string or it's followed by another question. 相反,匹配结束串的,非贪婪地匹配, 直到在比赛之后的字符串的结尾,或者它接着另一个问题。

Use re.findall to get all question/answer matches. 使用re.findall获取所有问题/答案匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM