I have a list of strings taken from chat log data, and I am trying to find the optimal method of splitting the speaker from the content of speech. Two examples are as follows:
mystr = ['bob123 (5:09:49 PM): hi how are you',
'jane_r16 (12/01/2020 1:39:12 A.M.) : What day is it today?']
Note that, while they are broadly similar, there are some stylistic differences I need to account for (inclusion of dates, period marks, extra spaces etc.). I require a way to standardize and split such strings, and others like these, into something like the following list:
mystrList = [['bob123','hi how are you'],['jane_r16','What day is it today']]
Given that I do not need the times, numbers, or most punctuation, i thought a reasonable first step would be to remove anything non-essential. After doing so, I now have the following:
myCleanstr = ['bob(): hi how are you','janer() : What day is it today?']
Doing this has given me a pretty unique sequence of characters per string that is unlikely to appear elsewhere in the same string.不太可能出现在同一字符串的其他地方。 My subsequent thinking was to use this as a de-marker to split each string using Regex:
mystr_split = [re.split(r'\(\)( ){,2}:', i, maxsplit=1, flags=re.I) for i in myCleanstr]
Here, my intention was the following:
\(\)
Find a sequence of an open followed by a closed parentheses symbol ( ){,2}
Then find zero, one, or two whitespaces :
Then find a colon symbol However, in both instances, I receive three objects per string. I get the correct speaker ID, and speech content. But, in the first string I get an additional NoneType Object
, and in the second string I get an additional string filled with a single white-space.
I had assumed that including maxsplit=1
would mean that the process would end after the first split has been found, but this doesn't appear to be the case. Rather than filter my results on the content I need I would like to understand why it is performing as it is.
You can use
^(\S+)\s*\([^()]*\)\s*:\s*(.+)
Or, if the name can have whitespaces:
^(\S[^(]*?)\s*\([^()]*\)\s*:\s*(.+)
See the regex demo #1 and regex demo #2 . The regex matches:
^
- start of string (\S+)
- Group 1: any one or more whitespace chars [^(]*?
- zero or more chars other than a (
char, as few as possible \s*
- zero or more whitespaces \(
- a (
char [^()]*
- zero or more chars other than (
and )
\)
- a )
char \s*:\s*
- a colon enclosed with zero or more whitespaces (.+)
- Group 2: any one or more chars other than line break chars, as many as possible (the whole rest of the line). See the Python demo :
import re
result = []
mystr = ['bob123 (5:09:49 PM): hi how are you', 'jane_r16 (12/01/2020 1:39:12 A.M.) : What day is it today?']
for s in mystr:
m = re.search(r'^(\S+)\s*\([^()]*\)\s*:\s*(.+)', s)
if m:
result.append([z for z in m.groups()])
print(result)
# => [['bob123', 'hi how are you'], ['jane_r16', 'What day is it today?']]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.