I'm trying to parse some conversations from an app in a.txt file with Python's re module, but despite working on regex101 when used on a sample of the file, it doesn't work properly when I open the file and actually try to parse it.
The structure of the txt file is dd/mm/yyyy hh:mm - Message Author: message text\n
, and I'm trying to get only the Name: message \n
parts. I'm using the following pattern (?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:.*$)
. My code is looking more or less like the following:
buffer = open(file, 'r', encoding = 'UTF-8').read()
pattern = re.compile(r'(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*$)')
matches = re.findall(pattern, buffer)
As the title says, though, findall returns and empty list, and I don't know why. The following sample works as expected on regex101:
20/04/2021 09:54 - Person 1: this is an example text. Will it match?
20/04/2021 09:54 - Person 2: I think it does.
Lookarounds are "expensive". Better match what you want and capture the interesting parts.
That said, you might get along with a simpler expression:
^\d+[^-]+-\s+(?P<person>[^:]+):\s+(?P<text>.+)
See a demo on regex101.com .
Kiss: remove $
. It matches the end of string. You need to match end of lines, and re.M
could be helpful here. But removing $
is simply simpler.
(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*)
BUT even "kiss"er: you do not need lookbehind or escapes over slashes because re.findall
returns captured strings if you use a capturing group in the expression.
Use
pattern = re.compile(r'\b\d{2}/\d{2}/\d{4}\s*\d{2}:\d{2}\s*-\s*(?P<name>.*):\s*(?P<message>.*)')
with open(file, 'r', encoding = 'UTF-8') as buffer:
matches = [match.groupdict() for match in pattern.finditer(test_str)]
EXPLANATION
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\d{4} digits (0-9) (4 times)
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.