简体   繁体   English

Python regex findall 在解析文本文件后返回空列表

[英]Python regex findall returning empty list after parsing a text file

I'm trying to parse some conversations from an app in a.txt file with Python's re module, but despite working on regex101 when used on a sample of the file, it doesn't work properly when I open the file and actually try to parse it.我正在尝试使用 Python 的 re 模块解析来自 a.txt 文件中的应用程序的一些对话,但是尽管在文件样本上使用regex101时,它在我打开文件并实际尝试时无法正常工作解析它。

The structure of the txt file is dd/mm/yyyy hh:mm - Message Author: message text\n , and I'm trying to get only the Name: message \n parts. txt 文件的结构是dd/mm/yyyy hh:mm - Message Author: message text\n ,我试图只获取Name: message \n部分。 I'm using the following pattern (?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:.*$) .我正在使用以下模式(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:.*$) My code is looking more or less like the following:我的代码看起来或多或少类似于以下内容:

buffer = open(file, 'r', encoding = 'UTF-8').read()
pattern = re.compile(r'(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*$)')
matches = re.findall(pattern, buffer)

As the title says, though, findall returns and empty list, and I don't know why.但是,正如标题所说,findall 返回并为空列表,我不知道为什么。 The following sample works as expected on regex101:以下示例在 regex101 上按预期工作:

20/04/2021 09:54 - Person 1: this is an example text. Will it match?
20/04/2021 09:54 - Person 2: I think it does.

Lookarounds are "expensive".环顾四周是“昂贵的”。 Better match what you want and capture the interesting parts.更好地匹配您想要的内容并捕捉有趣的部分。
That said, you might get along with a simpler expression:也就是说,您可能会使用更简单的表达式:

^\d+[^-]+-\s+(?P<person>[^:]+):\s+(?P<text>.+)

See a demo on regex101.com .请参阅regex101.com 上的演示

Kiss: remove $ .亲吻:删除$ It matches the end of string.它匹配字符串的结尾。 You need to match end of lines, and re.M could be helpful here.您需要匹配行尾, re.M在这里可能会有所帮助。 But removing $ is simply simpler.但是删除$更简单。

(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*)

BUT even "kiss"er: you do not need lookbehind or escapes over slashes because re.findall returns captured strings if you use a capturing group in the expression.但即使是“亲吻”er:您不需要向后看或转义斜杠,因为如果您在表达式中使用捕获组, re.findall将返回捕获的字符串

Use利用

pattern = re.compile(r'\b\d{2}/\d{2}/\d{4}\s*\d{2}:\d{2}\s*-\s*(?P<name>.*):\s*(?P<message>.*)')
with open(file, 'r', encoding = 'UTF-8') as buffer:
    matches = [match.groupdict() for match in pattern.finditer(test_str)]

Regex proof |正则表达式证明| Python code Python代码

EXPLANATION解释

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  \d{4}                    digits (0-9) (4 times)
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  -                        '-'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM