简体   繁体   English

Python 的 regex.match() function 与多行字符串不一致

[英]Python's regex .match() function works inconsistently with multiline strings

I'm writing a script in python that takes a directory containing journal entries in the form of markdown files and processes each file in order to create an object from it.我正在 python 中编写一个脚本,该脚本采用包含 markdown 文件形式的日记条目的目录并处理每个文件,以便从中创建 object。 These objects are appended into a list of journal entry objects.这些对象被附加到日记帐分录对象列表中。 The object contains 3 fields: title, date and body. object 包含 3 个字段:标题、日期和正文。

In order to create this list of entry objects, I loop over each file in a directory and append to the list the return value of a function called entry_create_object , which takes the file text as an input.为了创建这个条目对象列表,我遍历目录中的每个文件,并将 append 循环到列表中,返回名为entry_create_object的 function 的返回值,它将文件文本作为输入。

def load_entries(directory):
    entries = []

    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename) 

        with open(filepath, 'r') as f:
            text = f.read()
            entry_object = entry_create_object(text)
            if entry_object: entries.append(entry_object)
            else: print(f"Couldn't read {filepath}")
    return entries 

In order to create the object, I use regular expressions to find the information I need for the title and date fields.为了创建 object,我使用正则表达式来查找标题和日期字段所需的信息。 The body is just the file contents.正文只是文件内容。 The function returns None if it doesn't match title and date.如果 function 与标题和日期不匹配,则返回 None。 The following code is what I use for this:以下代码是我使用的:

def entry_create_object(ugly_entry):

    title = re.match('^# (.*)', ugly_entry)
    date = re.match('(Date:|Created at:) (\w{3}) (\d{2}), (\d{4})', ugly_entry)
    body = ugly_entry
    if not (title and date and body):
        return

    entry_object = {}
    entry_object['title'], entry_object['date'], entry_object['body'] = title, date, body

    return entry_object

For some reason I can't understand, my regular expression for dates works for some files but doesn't for others, even though I've been able to succesfully match what I wanted by testing my regex pattern in an online regular expression webapp such as Regexr.出于某种原因,我无法理解,我的日期正则表达式适用于某些文件,但不适用于其他文件,即使我已经能够通过在在线正则表达式 webapp 中测试我的正则表达式模式成功匹配我想要的内容,例如作为正则表达式。 The title regex pattern works fine for all files.标题正则表达式模式适用于所有文件。

I've found in my testing that re.match is very inconsistent with multiline strings overall, but I haven't been able to find a way of fixing it.我在测试中发现re.match总体上与多行字符串非常不一致,但我一直无法找到修复它的方法。

I can't see anything wrong with my pattern.我看不出我的模式有什么问题。

Example of file that succesfully matches both title and date:成功匹配标题和日期的文件示例:

# Time tracker

Created at: Oct 21, 2020 4:16 PM
Date: Oct 21, 2020

[...]

Example of file that fails to match date:与日期不匹配的文件示例:

# Bad habits

Created at: Dec 6, 2020 4:24 PM
Date: Dec 6, 2020

[...]

Thank you for your time.感谢您的时间。

Let's decode the regex.让我们解码正则表达式。

    date = re.match('(Date:|Created at:) (\w{3}) (\d{2}), (\d{4})', ugly_entry)

That's three letters, followed by a space, followed by 2 digits, followed by comma space, followed by 4 digits.那是三个字母,后跟一个空格,然后是 2 位数字,然后是逗号空格,然后是 4 位数字。 Given that description, can you see why the following string does not match?鉴于该描述,您能明白为什么以下字符串不匹配吗?

Created at: Dec 6, 2020 4:24 PM

I shouldn't spoil the surprise, but you want (\d{1,2}),我不应该破坏这个惊喜,但你想要(\d{1,2}),

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM