简体   繁体   中英

Python's regex .match() function works inconsistently with multiline strings

I'm writing a script in python that takes a directory containing journal entries in the form of markdown files and processes each file in order to create an object from it. These objects are appended into a list of journal entry objects. The object contains 3 fields: title, date and body.

In order to create this list of entry objects, I loop over each file in a directory and append to the list the return value of a function called entry_create_object , which takes the file text as an input.

def load_entries(directory):
    entries = []

    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename) 

        with open(filepath, 'r') as f:
            text = f.read()
            entry_object = entry_create_object(text)
            if entry_object: entries.append(entry_object)
            else: print(f"Couldn't read {filepath}")
    return entries 

In order to create the object, I use regular expressions to find the information I need for the title and date fields. The body is just the file contents. The function returns None if it doesn't match title and date. The following code is what I use for this:

def entry_create_object(ugly_entry):

    title = re.match('^# (.*)', ugly_entry)
    date = re.match('(Date:|Created at:) (\w{3}) (\d{2}), (\d{4})', ugly_entry)
    body = ugly_entry
    if not (title and date and body):
        return

    entry_object = {}
    entry_object['title'], entry_object['date'], entry_object['body'] = title, date, body

    return entry_object

For some reason I can't understand, my regular expression for dates works for some files but doesn't for others, even though I've been able to succesfully match what I wanted by testing my regex pattern in an online regular expression webapp such as Regexr. The title regex pattern works fine for all files.

I've found in my testing that re.match is very inconsistent with multiline strings overall, but I haven't been able to find a way of fixing it.

I can't see anything wrong with my pattern.

Example of file that succesfully matches both title and date:

# Time tracker

Created at: Oct 21, 2020 4:16 PM
Date: Oct 21, 2020

[...]

Example of file that fails to match date:

# Bad habits

Created at: Dec 6, 2020 4:24 PM
Date: Dec 6, 2020

[...]

Thank you for your time.

Let's decode the regex.

    date = re.match('(Date:|Created at:) (\w{3}) (\d{2}), (\d{4})', ugly_entry)

That's three letters, followed by a space, followed by 2 digits, followed by comma space, followed by 4 digits. Given that description, can you see why the following string does not match?

Created at: Dec 6, 2020 4:24 PM

I shouldn't spoil the surprise, but you want (\d{1,2}),

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM