Python 的 regex.match() function 与多行字符串不一致

Question

我正在 python 中编写一个脚本，该脚本采用包含 markdown 文件形式的日记条目的目录并处理每个文件，以便从中创建 object。 这些对象被附加到日记帐分录对象列表中。 object 包含 3 个字段：标题、日期和正文。

为了创建这个条目对象列表，我遍历目录中的每个文件，并将 append 循环到列表中，返回名为entry_create_object的 function 的返回值，它将文件文本作为输入。

def load_entries(directory):
    entries = []

    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename) 

        with open(filepath, 'r') as f:
            text = f.read()
            entry_object = entry_create_object(text)
            if entry_object: entries.append(entry_object)
            else: print(f"Couldn't read {filepath}")
    return entries

为了创建 object，我使用正则表达式来查找标题和日期字段所需的信息。 正文只是文件内容。 如果 function 与标题和日期不匹配，则返回 None。 以下代码是我使用的：

def entry_create_object(ugly_entry):

    title = re.match('^# (.*)', ugly_entry)
    date = re.match('(Date:|Created at:) (\w{3}) (\d{2}), (\d{4})', ugly_entry)
    body = ugly_entry
    if not (title and date and body):
        return

    entry_object = {}
    entry_object['title'], entry_object['date'], entry_object['body'] = title, date, body

    return entry_object

出于某种原因，我无法理解，我的日期正则表达式适用于某些文件，但不适用于其他文件，即使我已经能够通过在在线正则表达式 webapp 中测试我的正则表达式模式成功匹配我想要的内容，例如作为正则表达式。 标题正则表达式模式适用于所有文件。

我在测试中发现re.match总体上与多行字符串非常不一致，但我一直无法找到修复它的方法。

我看不出我的模式有什么问题。

成功匹配标题和日期的文件示例：

# Time tracker

Created at: Oct 21, 2020 4:16 PM
Date: Oct 21, 2020

[...]

与日期不匹配的文件示例：

# Bad habits

Created at: Dec 6, 2020 4:24 PM
Date: Dec 6, 2020

[...]

感谢您的时间。

Answer 1

让我们解码正则表达式。

    date = re.match('(Date:|Created at:) (\w{3}) (\d{2}), (\d{4})', ugly_entry)

那是三个字母，后跟一个空格，然后是 2 位数字，然后是逗号空格，然后是 4 位数字。 鉴于该描述，您能明白为什么以下字符串不匹配吗？

Created at: Dec 6, 2020 4:24 PM

我不应该破坏这个惊喜，但你想要(\d{1,2}),

Python 的 regex.match() function 与多行字符串不一致

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-17 21:24:14

Python 的 regex.match() function 与多行字符串不一致

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-17 21:24:14

解决方案1
0 已采纳 2021-03-17 21:24:14