简体   繁体   中英

Matching any character and or undefined newlines with regex in python

I have to parse a log txt file with regex in python. This is an example of a txt (named file ):

20/01/18, 08:11 - Peter: Good morning

How are you?

Peter 20/01/18, 09:00 - Caroline: I am fine thanks. You?

20/01/18, 09:01 - Peter: Good

I had some problems few days ago.

Now I am happy

Are you working?

20/01/18, 09:02 - Caroline: No I have to go to the supermarket to buy vegetables

20/01/18, 09:12 - Peter: Nice!

Where are you now?

I tried to parse the whole text with this regular expression:

f = open(file, 'r', encoding='utf-8')
texts=re.findall('(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*)',f.read())
f.close()

df= pd.DataFrame(texts,columns=['data','name','text'])

However, I have problems when matching one or multiple newlines in python (for example the text of Peter at 09:01). I also try to work on https://regex101.com/ to find a possible solution but I didn't succeed.

Can you help me please?

If you want to match the following text until the next date at the beginning of a new line, you could use a negative lookahead matching all lines that don't staart with a date like pattern:

(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*(?:\r?\n(?!\d+/\d+/\d).*)*)

About the last part (.*(?:\r?\n(?.\d+/\d+/\d).*)*)

  • ( Capture group 3
    • .* Match 0+ times any char except a newline
    • (?: Non capturing group
      • \r?\n Match a new line
      • (?.\d+/\d+/\d).* Assert what is on the right is not a date like format
    • )* Close non capturing group and repeat 0+ times
  • ) Close group

Regex demo

By default, . will not match a newline. You need to use DOTALL mode to make it match newlines:

re.findall('(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*)', f.read(), re.DOTALL)

It works:

>>> import re
>>> s="""
... 20/01/18, 09:01 - Peter: Good
... 
... I had some problems few days ago.
... 
... Now I am happy
... 
... Are you working?"""
>>> re.findall('(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*)', s, re.DOTALL)
[('20/01/18, 09:01', 'Peter', 'Good\n\nI had some problems few days ago.\n\nNow I am happy\n\nAre you working?')]
>>> _

This does not solve the problem of matching the entire rest of the text, though!

See @the-fourth-bird's answer for a real solution.

Another. more explicit way to handle it is to read the file line by line, and check if a line is a continuation or not.

rx = re.compile('^(\d+/\d+/\d+, \d+:\d+\d+) - (.+?): (.*)$') # Note the ^.
texts = []
for line in input_file:  # Files iterate line by line.
  new_match = rx.match(line)
  if new_match:
    texts.append(list(new_match.groups()))  # We want a list
  else:
    # We have a continuation line; append it to the last item of group.
    last = texts[-1]
    last[-1] += line  # Update in-place.

This may be easier to reason about.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM