[英]Python regex findall returning empty list after parsing a text file
I'm trying to parse some conversations from an app in a.txt file with Python's re module, but despite working on regex101 when used on a sample of the file, it doesn't work properly when I open the file and actually try to parse it.我正在尝试使用 Python 的 re 模块解析来自 a.txt 文件中的应用程序的一些对话,但是尽管在文件样本上使用regex101时,它在我打开文件并实际尝试时无法正常工作解析它。
The structure of the txt file is dd/mm/yyyy hh:mm - Message Author: message text\n
, and I'm trying to get only the Name: message \n
parts. txt 文件的结构是
dd/mm/yyyy hh:mm - Message Author: message text\n
,我试图只获取Name: message \n
部分。 I'm using the following pattern (?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:.*$)
.我正在使用以下模式
(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:.*$)
。 My code is looking more or less like the following:我的代码看起来或多或少类似于以下内容:
buffer = open(file, 'r', encoding = 'UTF-8').read()
pattern = re.compile(r'(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*$)')
matches = re.findall(pattern, buffer)
As the title says, though, findall returns and empty list, and I don't know why.但是,正如标题所说,findall 返回并为空列表,我不知道为什么。 The following sample works as expected on regex101:
以下示例在 regex101 上按预期工作:
20/04/2021 09:54 - Person 1: this is an example text. Will it match?
20/04/2021 09:54 - Person 2: I think it does.
Lookarounds are "expensive".环顾四周是“昂贵的”。 Better match what you want and capture the interesting parts.
更好地匹配您想要的内容并捕捉有趣的部分。
That said, you might get along with a simpler expression:也就是说,您可能会使用更简单的表达式:
^\d+[^-]+-\s+(?P<person>[^:]+):\s+(?P<text>.+)
See a demo on regex101.com .请参阅regex101.com 上的演示。
Kiss: remove $
.亲吻:删除
$
。 It matches the end of string.它匹配字符串的结尾。 You need to match end of lines, and
re.M
could be helpful here.您需要匹配行尾,
re.M
在这里可能会有所帮助。 But removing $
is simply simpler.但是删除
$
更简单。
(?<=\d{2}\/\d{2}\/\d{4}\s\d{2}:\d{2}\s\-\s)(.*:\s)(.*)
BUT even "kiss"er: you do not need lookbehind or escapes over slashes because re.findall
returns captured strings if you use a capturing group in the expression.但即使是“亲吻”er:您不需要向后看或转义斜杠,因为如果您在表达式中使用捕获组,
re.findall
将返回捕获的字符串。
Use利用
pattern = re.compile(r'\b\d{2}/\d{2}/\d{4}\s*\d{2}:\d{2}\s*-\s*(?P<name>.*):\s*(?P<message>.*)')
with open(file, 'r', encoding = 'UTF-8') as buffer:
matches = [match.groupdict() for match in pattern.finditer(test_str)]
Regex proof |正则表达式证明| Python code
Python代码
EXPLANATION解释
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
/ '/'
--------------------------------------------------------------------------------
\d{4} digits (0-9) (4 times)
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\d{2} digits (0-9) (2 times)
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
- '-'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.