简体   繁体   中英

Python regex findall returning empty list after parsing a text file

I'm trying to parse some conversations from an app in a.txt file with Python's re module, but despite working on regex101 when used on a sample of the file, it doesn't work properly when I open the file and actually try to parse it.

The structure of the txt file is dd/mm/yyyy hh:mm - Message Author: message text\n , and I'm trying to get only the Name: message \n parts. I'm using the following pattern (?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:.*$) . My code is looking more or less like the following:

buffer = open(file, 'r', encoding = 'UTF-8').read()
pattern = re.compile(r'(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*$)')
matches = re.findall(pattern, buffer)

As the title says, though, findall returns and empty list, and I don't know why. The following sample works as expected on regex101:

20/04/2021 09:54 - Person 1: this is an example text. Will it match?
20/04/2021 09:54 - Person 2: I think it does.

Lookarounds are "expensive". Better match what you want and capture the interesting parts.
That said, you might get along with a simpler expression:

^\d+[^-]+-\s+(?P<person>[^:]+):\s+(?P<text>.+)

See a demo on regex101.com .

Kiss: remove $ . It matches the end of string. You need to match end of lines, and re.M could be helpful here. But removing $ is simply simpler.

(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*)

BUT even "kiss"er: you do not need lookbehind or escapes over slashes because re.findall returns captured strings if you use a capturing group in the expression.

Use

pattern = re.compile(r'\b\d{2}/\d{2}/\d{4}\s*\d{2}:\d{2}\s*-\s*(?P<name>.*):\s*(?P<message>.*)')
with open(file, 'r', encoding = 'UTF-8') as buffer:
    matches = [match.groupdict() for match in pattern.finditer(test_str)]

Regex proof | Python code

EXPLANATION

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  \d{4}                    digits (0-9) (4 times)
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  -                        '-'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM