简体   繁体   中英

Returning specific values in a string of different format

How do I make python automatically search a certain specific type of data (eg date) in a string of different format ?

Example inputs:

"-rwxr-xr-x 1 user usergrp 1632 Feb 26 11:03 Desktop/Application"
"Desktop/Application,1632,26/02"
"26/02/19 - Desktop/Application - 1632"

Output for these examples should be 26 Feb 19 .

Related but different: Convert “unknown format” strings to datetime objects?

This problem is different because the strings are not just dates, but embedded in the strings. I treat this problem as "How to find dates in strings with inconsistent formats?"

I use dateparser 0.7.1 , documentation can be found here . Because the format of the strings is unknown and can be different from every string, I calculate all character ngrams in the string and then parse them as dates. The most common date is then returned as the correct output. This is a slow and inefficient approach but it is the best I can come up with for the requirements here:

  • unknown format
  • strings contain not just the dates
  • the dates can be in arbitrary positions in the string:

Code below:

from collections import Counter

import dateparser


def extract_date(min_date_length=5, max_date_length=15, min_year_value=2000, max_year_value=2020):
    val = "Feb 26 11:03 Desktop/Application"
    val = "Desktop/Application,1632,26/02"
    val = "26/02/19 - Desktop/Application - 1632"
    grams = []
    for n in range(min_date_length, max_date_length):
        grams.extend(val[i:i + n] for i in range(len(val) - n + 1))
    dates = []
    for gram in grams:
        out = dateparser.parse(gram)
        if out and min_year_value <= out.year <= max_year_value:
            dates.append(out)
    date, _count = Counter(dates).most_common(1)[0]
    print(date)
    return date


if __name__ == "__main__":
    extract_date()

How it works:

  • calculates all character ngrams in a range (between min_date_length and max_date_length ) for efficiency reasons, and dates generally can't be arbitrarily long or much shorter than the default of 5 (though it's possible, for example if the date format is 1/1 for 1st of January for example)
  • uses dateparser.parse to parse the ngram as a date, and ignore all those which it cannot parse
  • filter out those for which the year is too far in the past or too far in the future (this is a problem with the examples that were posted, 1632 is considered as the year for "Desktop/Application,1632,26/02" )
  • get the most common date that was found for the character ngrams

This solution works on the three examples that were included in the question. Note again that this is very inefficient approach and it might not work in all situations (for example for multiple dates in a string it will break).

A more efficient approach is to use a regex to extract just the date strings from each string and then use datetime.strptime . See strftime() and strptime() Behavior .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM