简体   繁体   English

尝试使用正则表达式 findall 进行对话

[英]Trying to use regex findall for a dialogue

I have been stuck on this regex for quite some time now.我已经被这个正则表达式困住了很长一段时间了。 I am pulling in chat data in a single variable and I am trying to break them up.我在单个变量中提取聊天数据,并试图将它们分解。

IE IE

BOT: Ask me a question or select from the options below. BOT:问我一个问题或从以下选项中选择。

USER: How do I sign up or cancel Auto Pay?用户:我如何注册或取消自动支付?

BOT: Would you like to sign up or cancel Auto Pay? BOT:您要注册还是取消自动支付?

USER: Selected: Cancel Auto Pay"用户:已选择:取消自动支付”

I am looking for just the Bot's messages and vice versa.我正在寻找 Bot 的消息,反之亦然。

I am using Python and I pulled chat data and I want to break it up into two different sections, Chat verbatims vs User verbatims.我正在使用 Python,我提取了聊天数据,我想把它分成两个不同的部分,聊天逐字与用户逐字。 I am trying to use regex to break them up and the one findall I used looks like this but that comes back with empty results.我正在尝试使用正则表达式来分解它们,而我使用的 findall 看起来像这样,但结果却是空的。

clean.str.findall(r'^BOT:.*USER:$)

My thoughts behind that was to clean and drop the 'USER:' afterwards.我的想法是之后清理并删除“用户:”。 I've tried multiple iterations of these.我已经尝试了这些的多次迭代。 Any insight what I'm doing wrong would be a huge help !任何洞察我做错了什么都会有很大的帮助! Also, I read the posting rules, if I did it wrong let me know and I'll fix it up.另外,我阅读了发布规则,如果我做错了,请告诉我,我会修复它。

Use regex ^Bot:(.+)使用正则表达式^Bot:(.+)

import re

regex = r"^BOT:(.+)"

chat = """
BOT: Ask me a question or select from the options below.

USER: How do I sign up or cancel Auto Pay?

BOT: Would you like to sign up or cancel Auto Pay?

USER: Selected: Cancel Auto Pay"
"""

result = re.findall(regex, chat, re.MULTILINE)

print(result)

Output:输出:

[' Ask me a question or select from the options below.', ' Would you like to sign up or cancel Auto Pay?']

Now you can easily iterate over the result:现在您可以轻松地迭代结果:

for r in result:
    print(r)

Output:输出:

 Ask me a question or select from the options below.
 Would you like to sign up or cancel Auto Pay?

If you want to remove the leading space infront of every result.如果要删除每个结果前面的前导空格。 Either use regex = r"^BOT: (.+)" as your regex.要么使用regex = r"^BOT: (.+)"作为你的正则表达式。 Or use .strip() on every result.或者在每个结果上使用.strip()

Without leading space ( regex = r"^BOT: (.+)" ):没有前导空格( regex = r"^BOT: (.+)" ):

Ask me a question or select from the options below.
Would you like to sign up or cancel Auto Pay?

Edit编辑


If you want the Bot: in your results just remove the parenthesis in your regex.如果您想要Bot:在您的结果中删除正则表达式中的括号。

regex = r"^BOT: .+"

Test it: https://regex101.com/r/aM1GEj/2测试它: https ://regex101.com/r/aM1GEj/2

text = """BOT: Ask me a question or select from the options below.
USER: How do I sign up or cancel Auto Pay?
BOT: Would you like to sign up or cancel Auto Pay?
USER: Selected: Cancel Auto Pay"""

re.findall("BOT:.*\n", text)

Try this:尝试这个:

import re

text = """BOT: Ask me a question or select from the options below.

USER: How do I sign up or cancel Auto Pay?

BOT: Would you like to sign up or cancel Auto Pay?

USER: Selected: Cancel Auto Pay"""

print(re.findall(r'BOT:(.*?)(?:$|USER:)', text, flags=re.DOTALL))

Output is:输出是:

[' Ask me a question or select from the options below.\n\n', ' Would you like to sign up or cancel Auto Pay?\n\n']

Some considerations:一些考虑:

  • it fails if the user or the ot use the keyword USER: or BOT:如果用户或 ot 使用关键字USER:BOT:则失败
  • in the first group (.*?) , the ?在第一组(.*?)中, ? is needed to match in non greedy format: that is, stop at the first match, which allows the complete regex to only capture until the next USER: .需要以非贪婪格式匹配:即在第一个匹配时停止,这允许完整的正则表达式仅捕获直到下一个USER:
  • the last group, the ?: is to make this more efficient, since you are not going to use the result of this group.最后一组?:是为了提高效率,因为您不会使用该组的结果。 This group make the regex work for cases when the last message is from the bot.该组使正则表达式适用于最后一条消息来自机器人的情况。
  • the flags=re.DOTALL allows to catch also the newlines. flags=re.DOTALL还允许捕获换行符。 In case that newlines are only used to separate between messages, consider some other answers based in \n char.如果换行符仅用于分隔消息,请考虑基于\n char 的其他一些答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM