通过 Regex 和 Maxsplit 拆分字符串会返回多个拆分

Question

I have a list of strings taken from chat log data, and I am trying to find the optimal method of splitting the speaker from the content of speech.我有一个从聊天日志数据中提取的字符串列表，我正在尝试找到将说话者与语音内容分开的最佳方法。 Two examples are as follows:两个例子如下：

mystr = ['bob123 (5:09:49 PM): hi how are you', 
         'jane_r16 (12/01/2020 1:39:12 A.M.) : What day is it today?']

Note that, while they are broadly similar, there are some stylistic differences I need to account for (inclusion of dates, period marks, extra spaces etc.).请注意，虽然它们大体相似，但我需要考虑一些风格差异（包括日期、句号、额外空格等）。 I require a way to standardize and split such strings, and others like these, into something like the following list:我需要一种方法来标准化和拆分此类字符串以及其他类似字符串，如下所示：

mystrList = [['bob123','hi how are you'],['jane_r16','What day is it today']]

Given that I do not need the times, numbers, or most punctuation, i thought a reasonable first step would be to remove anything non-essential.鉴于我不需要时间、数字或大多数标点符号，我认为合理的第一步是删除任何不必要的内容。 After doing so, I now have the following:这样做之后，我现在有以下内容：

myCleanstr = ['bob(): hi how are you','janer() : What day is it today?']

Doing this has given me a pretty unique sequence of characters per string (): that is unlikely to appear elsewhere in the same string.这样做为每个字符串() 提供了一个非常独特的字符序列：不太可能出现在同一字符串的其他地方。 My subsequent thinking was to use this as a de-marker to split each string using Regex:我随后的想法是使用它作为 de-marker 来使用 Regex 拆分每个字符串：

mystr_split = [re.split(r'\(\)( ){,2}:', i, maxsplit=1, flags=re.I) for i in myCleanstr]

Here, my intention was the following:在这里，我的意图如下：

\(\) Find a sequence of an open followed by a closed parentheses symbol \(\)查找一个开括号后跟一个闭括号符号的序列
( ){,2} Then find zero, one, or two whitespaces ( ){,2}然后找到零个、一个或两个空格
: Then find a colon symbol :然后找一个冒号

However, in both instances, I receive three objects per string.但是，在这两种情况下，每个字符串我都会收到三个对象。 I get the correct speaker ID, and speech content.我得到了正确的演讲者 ID 和演讲内容。 But, in the first string I get an additional NoneType Object , and in the second string I get an additional string filled with a single white-space.但是，在第一个字符串中，我得到了一个额外的NoneType Object ，在第二个字符串中，我得到了一个用单个空格填充的额外字符串。

I had assumed that including maxsplit=1 would mean that the process would end after the first split has been found, but this doesn't appear to be the case.我曾假设包含maxsplit=1意味着该过程将在找到第一个拆分后结束，但事实并非如此。 Rather than filter my results on the content I need I would like to understand why it is performing as it is.我不想根据我需要的内容过滤我的结果，而是想了解它为什么会按原样执行。

Answer 1

You can use您可以使用

^(\S+)\s*\([^()]*\)\s*:\s*(.+)

Or, if the name can have whitespaces:或者，如果名称可以包含空格：

^(\S[^(]*?)\s*\([^()]*\)\s*:\s*(.+)

See the regex demo #1 and regex demo #2 .请参阅正则表达式演示 #1和正则表达式演示 #2 。 The regex matches:正则表达式匹配：

^ - start of string ^ - 字符串的开头
(\S+) - Group 1: any one or more whitespace chars (\S+) - 第 1 组：任何一个或多个空格字符
[^(]*? - zero or more chars other than a ( char, as few as possible [^(]*? - 除(字符外的零个或多个字符，尽可能少
\s* - zero or more whitespaces \s* - 零个或多个空格
\( - a ( char \( - 一个(字符
[^()]* - zero or more chars other than ( and ) [^()]* - 除(和)之外的零个或多个字符
\) - a ) char \) - a )字符
\s*:\s* - a colon enclosed with zero or more whitespaces \s*:\s* - 用零个或多个空格括起来的冒号
(.+) - Group 2: any one or more chars other than line break chars, as many as possible (the whole rest of the line). (.+) - 第 2 组：除换行符之外的任何一个或多个字符，尽可能多（行的整个 rest）。

See the Python demo :请参阅Python 演示：

import re
result = []
mystr = ['bob123 (5:09:49 PM): hi how are you', 'jane_r16 (12/01/2020 1:39:12 A.M.) : What day is it today?']
for s in mystr:
    m = re.search(r'^(\S+)\s*\([^()]*\)\s*:\s*(.+)', s)
    if m:
        result.append([z for z in m.groups()])
print(result)
# => [['bob123', 'hi how are you'], ['jane_r16', 'What day is it today?']]

通过 Regex 和 Maxsplit 拆分字符串会返回多个拆分

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-06 11:27:33

通过 Regex 和 Maxsplit 拆分字符串会返回多个拆分

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-06 11:27:33

解决方案1
1 已采纳 2021-01-06 11:27:33