简体   繁体   中英

Find date within strings using regex in both Python and grep

I have a log with entries in the following format:

1483528632  3   1   Wed Jan  4 11:17:12 2017    501040002   4
1533528768  4   2   Thu Jan  5 19:17:45 2017    534040012   3
...

How do I fetch only the timestamp component (eg. Wed Jan 4 11:17:12 2017 ) using regular expressions?

I have to implement the final product in python, but the requirement is to have part of an automated regression suite in bash/perl (with the final product eventually being in Python).

If the format is fixed in terms of space delimiters, you can simply split , get a slice of a date string and load it to datetime object via datetime.strptime() :

In [1]: from datetime import datetime

In [2]: s = "1483528632  3   1   Wed Jan  4 11:17:12 2017    501040002   4"

In [3]: date_string = ' '.join(s.split()[3:8])

In [4]: datetime.strptime(date_string, "%a %b %d %H:%M:%S %Y")
Out[4]: datetime.datetime(2017, 1, 4, 11, 17, 12)

The regex to match the timestamp is:

'[a-zA-Z]{3} +[a-zA-Z]{3} +\\d{1,2} +\\d{2}:\\d{2}:\\d{2} +\\d{4}' .

With grep that can be used like this (if your log file was called log.txt ):

$ grep -oE '[a-zA-Z]{3} +[a-zA-Z]{3} +\d{1,2} +\d{2}:\d{2}:\d{2} +\d{4}' log.txt
# Wed Jan  4 11:17:12 2017
# Thu Jan  5 19:17:45 2017

In python you can use that like so:

import re

log_entry = "1483528632  3   1   Wed Jan  4 11:17:12 2017    501040002   4"

pattern = '[a-zA-Z]{3} +[a-zA-Z]{3} +\d{1,2} +\d{2}:\d{2}:\d{2} +\d{4}'
compiled = re.compile(pattern)
match = compiled.search(log_entry)
match.group(0)
# 'Wed Jan  4 11:17:12 2017'

You can use this to get an actual datetime object from the string (expanding on above code):

from datetime import datetime
import re

log_entry = "1483528632  3   1   Wed Jan  4 11:17:12 2017    501040002   4"

pattern = '[a-zA-Z]{3} +[a-zA-Z]{3} +\d{1,2} +\d{2}:\d{2}:\d{2} +\d{4}'
compiled = re.compile(pattern)
match = compiled.search(log_entry)

log_time_str = match.group(0)
datetime.strptime(log_time_str, "%a %b %d %H:%M:%S %Y")
# datetime.datetime(2017, 1, 4, 11, 17, 12)

Grep is most often used in this scenario if you are working with syslog. But as the post is also tagged with Python. This example uses regular expressions with re :

import re

Define the pattern to match:

pat = "\w{3}\s\w{3}\s+\w\s\w{2}:\w{2}:\w{2}\s\w{4}"

Then use re.findall to return all non-overlapping matches of pattern in txt:

re.findall(pat,txt)

Output:

['Wed Jan  4 11:17:12 2017', 'Thu Jan  5 19:17:45 2017']

If you want to then use datetime :

import datetime

dates = re.findall(pat,txt)

datetime.datetime.strptime(dates[0], "%a %b %d %H:%M:%S %Y")

Output:

datetime.datetime(2017, 1, 4, 11, 17, 12)

You can then utilise these datetime objects:

dateObject = datetime.datetime.strptime(dates[0], "%a %b %d %H:%M:%S %Y").date()
timeObject = datetime.datetime.strptime(dates[0], "%a %b %d %H:%M:%S %Y").time()

print('The date is {} and time is {}'.format(dateObject,timeObject))

Output:

The date is 2017-01-04 and time is 11:17:12

Two approaches: with and without using regular expressions
1) using re.findall() function:

with open('test.log', 'r') as fh:
    lines = re.findall(r'\b[A-Za-z]{3}\s[A-Za-z]{3}\s{2}\d{1,2} \d{2}:\d{2}:\d{2} \d{4}\b',fh.read(), re.M)

print(lines)

2) usign str.split() and str.join() functions:

with open('test.log', 'r') as fh:
    lines = [' '.join(d.split()[3:8]) for d in fh.readlines()]

print(lines)

The output in both cases will be a below:

['Wed Jan  4 11:17:12 2017', 'Thu Jan  5 19:17:45 2017']
grep -E '\b(Mon|Tue|Wed|Thu|Fri|Sat|Sun) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +[0-9]+ [0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{4}\b' dates

如果您只想列出日期,而不是 grep,也许:

sed -nre 's/^.*([A-Za-z]{3}\s+[A-Za-z]{3}\s+[0-9]+\s+[0-9]+:[0-9]+:[0-9]+\s+[0-9]{4}).*$/\1/p' filename

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM