Python regex findall 在解析文本文件后返回空列表

Question

I'm trying to parse some conversations from an app in a.txt file with Python's re module, but despite working on regex101 when used on a sample of the file, it doesn't work properly when I open the file and actually try to parse it.我正在尝试使用 Python 的 re 模块解析来自 a.txt 文件中的应用程序的一些对话，但是尽管在文件样本上使用regex101时，它在我打开文件并实际尝试时无法正常工作解析它。

The structure of the txt file is dd/mm/yyyy hh:mm - Message Author: message text\n , and I'm trying to get only the Name: message \n parts. txt 文件的结构是dd/mm/yyyy hh:mm - Message Author: message text\n ，我试图只获取Name: message \n部分。 I'm using the following pattern (?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:.*$) .我正在使用以下模式(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:.*$) 。 My code is looking more or less like the following:我的代码看起来或多或少类似于以下内容：

buffer = open(file, 'r', encoding = 'UTF-8').read()
pattern = re.compile(r'(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*$)')
matches = re.findall(pattern, buffer)

As the title says, though, findall returns and empty list, and I don't know why.但是，正如标题所说，findall 返回并为空列表，我不知道为什么。 The following sample works as expected on regex101:以下示例在 regex101 上按预期工作：

20/04/2021 09:54 - Person 1: this is an example text. Will it match?
20/04/2021 09:54 - Person 2: I think it does.

Answer 1

Lookarounds are "expensive".环顾四周是“昂贵的”。 Better match what you want and capture the interesting parts.更好地匹配您想要的内容并捕捉有趣的部分。
That said, you might get along with a simpler expression:也就是说，您可能会使用更简单的表达式：

^\d+[^-]+-\s+(?P<person>[^:]+):\s+(?P<text>.+)

See a demo on regex101.com .请参阅regex101.com 上的演示。

Answer 2

Kiss: remove $ .亲吻：删除$ 。 It matches the end of string.它匹配字符串的结尾。 You need to match end of lines, and re.M could be helpful here.您需要匹配行尾， re.M在这里可能会有所帮助。 But removing $ is simply simpler.但是删除$更简单。

(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*)

BUT even "kiss"er: you do not need lookbehind or escapes over slashes because re.findall returns captured strings if you use a capturing group in the expression.但即使是“亲吻”er：您不需要向后看或转义斜杠，因为如果您在表达式中使用捕获组， re.findall将返回捕获的字符串。

Use利用

pattern = re.compile(r'\b\d{2}/\d{2}/\d{4}\s*\d{2}:\d{2}\s*-\s*(?P<name>.*):\s*(?P<message>.*)')
with open(file, 'r', encoding = 'UTF-8') as buffer:
    matches = [match.groupdict() for match in pattern.finditer(test_str)]

Regex proof |正则表达式证明| Python code Python代码

EXPLANATION解释

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  \d{4}                    digits (0-9) (4 times)
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  -                        '-'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))

Python regex findall 在解析文本文件后返回空列表

问题描述

2 个解决方案

解决方案1
1 2021-05-15 21:38:28

解决方案2
0 2021-05-15 20:50:22

Python regex findall 在解析文本文件后返回空列表

问题描述

2 个解决方案

解决方案1 1 2021-05-15 21:38:28

解决方案2 0 2021-05-15 20:50:22

解决方案1
1 2021-05-15 21:38:28

解决方案2
0 2021-05-15 20:50:22