How do I make python automatically search a certain specific type of data (eg date) in a string of different format ?
Example inputs:
"-rwxr-xr-x 1 user usergrp 1632 Feb 26 11:03 Desktop/Application"
"Desktop/Application,1632,26/02"
"26/02/19 - Desktop/Application - 1632"
Output for these examples should be 26 Feb 19
.
Related but different: Convert “unknown format” strings to datetime objects?
This problem is different because the strings are not just dates, but embedded in the strings. I treat this problem as "How to find dates in strings with inconsistent formats?"
I use dateparser 0.7.1 , documentation can be found here . Because the format of the strings is unknown and can be different from every string, I calculate all character ngrams in the string and then parse them as dates. The most common date is then returned as the correct output. This is a slow and inefficient approach but it is the best I can come up with for the requirements here:
Code below:
from collections import Counter
import dateparser
def extract_date(min_date_length=5, max_date_length=15, min_year_value=2000, max_year_value=2020):
val = "Feb 26 11:03 Desktop/Application"
val = "Desktop/Application,1632,26/02"
val = "26/02/19 - Desktop/Application - 1632"
grams = []
for n in range(min_date_length, max_date_length):
grams.extend(val[i:i + n] for i in range(len(val) - n + 1))
dates = []
for gram in grams:
out = dateparser.parse(gram)
if out and min_year_value <= out.year <= max_year_value:
dates.append(out)
date, _count = Counter(dates).most_common(1)[0]
print(date)
return date
if __name__ == "__main__":
extract_date()
How it works:
min_date_length
and max_date_length
) for efficiency reasons, and dates generally can't be arbitrarily long or much shorter than the default of 5 (though it's possible, for example if the date format is 1/1
for 1st of January for example)dateparser.parse
to parse the ngram as a date, and ignore all those which it cannot parse1632
is considered as the year for "Desktop/Application,1632,26/02"
)This solution works on the three examples that were included in the question. Note again that this is very inefficient approach and it might not work in all situations (for example for multiple dates in a string it will break).
A more efficient approach is to use a regex to extract just the date strings from each string and then use datetime.strptime
. See strftime() and strptime() Behavior .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.