I have to parse a log txt file with regex in python. This is an example of a txt (named file
):
20/01/18, 08:11 - Peter: Good morning
How are you?
Peter 20/01/18, 09:00 - Caroline: I am fine thanks. You?
20/01/18, 09:01 - Peter: Good
I had some problems few days ago.
Now I am happy
Are you working?
20/01/18, 09:02 - Caroline: No I have to go to the supermarket to buy vegetables
20/01/18, 09:12 - Peter: Nice!
Where are you now?
I tried to parse the whole text with this regular expression:
f = open(file, 'r', encoding='utf-8')
texts=re.findall('(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*)',f.read())
f.close()
df= pd.DataFrame(texts,columns=['data','name','text'])
However, I have problems when matching one or multiple newlines in python (for example the text of Peter at 09:01). I also try to work on https://regex101.com/ to find a possible solution but I didn't succeed.
Can you help me please?
If you want to match the following text until the next date at the beginning of a new line, you could use a negative lookahead matching all lines that don't staart with a date like pattern:
(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*(?:\r?\n(?!\d+/\d+/\d).*)*)
About the last part (.*(?:\r?\n(?.\d+/\d+/\d).*)*)
(
Capture group 3
.*
Match 0+ times any char except a newline (?:
Non capturing group
\r?\n
Match a new line (?.\d+/\d+/\d).*
Assert what is on the right is not a date like format )*
Close non capturing group and repeat 0+ times )
Close group By default, .
will not match a newline. You need to use DOTALL mode to make it match newlines:
re.findall('(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*)', f.read(), re.DOTALL)
It works:
>>> import re
>>> s="""
... 20/01/18, 09:01 - Peter: Good
...
... I had some problems few days ago.
...
... Now I am happy
...
... Are you working?"""
>>> re.findall('(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*)', s, re.DOTALL)
[('20/01/18, 09:01', 'Peter', 'Good\n\nI had some problems few days ago.\n\nNow I am happy\n\nAre you working?')]
>>> _
This does not solve the problem of matching the entire rest of the text, though!
See @the-fourth-bird's answer for a real solution.
Another. more explicit way to handle it is to read the file line by line, and check if a line is a continuation or not.
rx = re.compile('^(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*)$') # Note the ^.
texts = []
for line in input_file: # Files iterate line by line.
new_match = rx.match(line)
if new_match:
texts.append(list(new_match.groups())) # We want a list
else:
# We have a continuation line; append it to the last item of group.
last = texts[-1]
last[-1] += line # Update in-place.
This may be easier to reason about.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.