简体   繁体   中英

RegEx for matching a datetime followed by spaces and any chars

I need to profile some data in a bucket, and have come across a bit of a dilemma. This is the type of line in each file:

"2018-09-08 10:34:49 10.0 MiB path/of/a/directory "

What's required is to capture everything in bold while keeping in mind that some of the separators are tabs and other times they are spaces.

To rephrase, I need everything from the moment the date and time end (excluding the tab or space preceding it)

I tried something like this:

p = re.compile(r'^[\d\d\d\d.\d\d.\d\d\s\d\d:\d\d:\d\d].*')
for line in lines:
    print(re.findall(line))

How do I solve this problem?

EDIT : What if I wanted to also create new groups into that the newly matched string? Say I wanted to recreate it to --> 10MiB engagementName/folder/file/something.xlsx engagementName extensionType something.xlsx

RE-EDIT: The path/to/directory generally points to a file(and all files have extensions). from the reformatted string you guys have been helping me with, is there a way to keep building on the regex pattern to allow me to "create" a new group through the filtering on the fileExtensionType(I suppose by searching the end of the string for somthing along the lines of .anything) and adding that result into the formatted regex string?

Don't bother with a regular expression. You know the format of the line. Just split it:

from datetime import datetime

for l in lines:
    line_date, line_time, rest_of_line = l.split(maxsplit=2)
    print([line_date, line_time, rest_of_line])
    # ['2018-09-08', '10:34:49', '10.0 MiB path/of/a/directory']

Take special note of the use of the maxsplit argument. This prevents it from splitting the size or the path. We can do this because we know the date has one space in the middle and one space after it.

If the size will always have one space in the middle and one space following it, we can increase it to 4 splits to separate the size, too:

for l in lines:
    line_date, line_time, size_quantity, size_units, line_path = l.split(maxsplit=4)
    print([line_date, line_time, size_quantity, size_units, line_path])
    # ['2018-09-08', '10:34:49', '10.0', 'MiB', 'path/of/a/directory']

Note that extra contiguous spaces and spaces in the path don't screw it up:

l = "2018-09-08 10:34:49     10.0   MiB    path/of/a/direct       ory"
line_date, line_time, size_quantity, size_units, line_path = l.split(maxsplit=4)
print([line_date, line_time, size_quantity, size_units, line_path])
# ['2018-09-08', '10:34:49', '10.0', 'MiB', 'path/of/a/direct       ory']

You can concatenate parts back together if needed:

line_size = size_quantity + ' ' + size_units


If you want the timestamp for something, you can parse it:

# 'T' could be anything, but 'T' is standard for the ISO 8601 format
timestamp = datetime.strptime(line_date + 'T' + line_time, '%Y-%m-%dT%H:%M:%S')

You might not need an expression to do so, a string split would suffice. However, if you wish to do so, you might not want to bound your expression from very beginning. You can simply use this expression :

(:[0-9]+\s+)(.*)$ 

You can even slightly modify it to this expression which is just a bit faster:

:([0-9]+\s+)(.*)$

在此输入图像描述

Graph

The graph shows how the expression works:

在此输入图像描述


Example Test:

# -*- coding: UTF-8 -*-
import re

string = "2018-09-08 10:34:49   10.0 MiB path/of/a/directory"
expression = r'(:[0-9]+\s+)(.*)$'
match = re.search(expression, string)
if match:
    print("YAAAY! \"" + match.group(2) + "\" is a match 💚 ")
else: 
    print('🙀 Sorry! No matches! Something is not right! Call 911 👮')

Output

YAAAY! "10.0 MiB path/of/a/directory" is a match 💚 

JavaScript Performance Benchmark

This snippet is a JavaScript performance test with 10 million times repetition of your input string:

 repeat = 10000000; start = Date.now(); for (var i = repeat; i >= 0; i--) { var string = "2018-09-08 10:34:49 10.0 MiB path/of/a/directory"; var regex = /(.*)(:[0-9]+\\s+)(.*)/g; var match = string.replace(regex, "$3"); } end = Date.now() - start; console.log("YAAAY! \\"" + match + "\\" is a match 💚 "); console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 "); 

Edit:

You might only capture the end of a timestamp, because your expression would have less boundaries, it becomes simple and faster, and in case there was unexpected instances, it would still work:

2019/12/15 10:00:00     **desired output**
2019-12-15    10:00:00     **desired output**
2019-12-15, 10:00:00     **desired output**
2019-12 15 10:00:00     **desired output**

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM