简体   繁体   English

以不同格式的字符串返回特定值

[英]Returning specific values in a string of different format

How do I make python automatically search a certain specific type of data (eg date) in a string of different format ?如何让python以不同格式的字符串自动搜索某种特定类型的数据(例如日期)?

Example inputs:示例输入:

"-rwxr-xr-x 1 user usergrp 1632 Feb 26 11:03 Desktop/Application"
"Desktop/Application,1632,26/02"
"26/02/19 - Desktop/Application - 1632"

Output for these examples should be 26 Feb 19 .这些示例的输出应为26 Feb 19

Related but different: Convert “unknown format” strings to datetime objects?相关但不同: 将“未知格式”字符串转换为日期时间对象?

This problem is different because the strings are not just dates, but embedded in the strings.这个问题是不同的,因为字符串不仅仅是日期,而是嵌入在字符串中。 I treat this problem as "How to find dates in strings with inconsistent formats?"我将此问题视为“如何在格式不一致的字符串中查找日期?”

I use dateparser 0.7.1 , documentation can be found here .我使用dateparser 0.7.1 ,文档可以在这里找到。 Because the format of the strings is unknown and can be different from every string, I calculate all character ngrams in the string and then parse them as dates.因为字符串的格式未知并且可能与每个字符串不同,所以我计算字符串中的所有字符 ngram,然后将它们解析为日期。 The most common date is then returned as the correct output.然后将最常见的日期作为正确的输出返回。 This is a slow and inefficient approach but it is the best I can come up with for the requirements here:这是一种缓慢且低效的方法,但它是我能针对此处的要求提出的最佳方法:

  • unknown format未知格式
  • strings contain not just the dates字符串不仅包含日期
  • the dates can be in arbitrary positions in the string:日期可以在字符串中的任意位置:

Code below:代码如下:

from collections import Counter

import dateparser


def extract_date(min_date_length=5, max_date_length=15, min_year_value=2000, max_year_value=2020):
    val = "Feb 26 11:03 Desktop/Application"
    val = "Desktop/Application,1632,26/02"
    val = "26/02/19 - Desktop/Application - 1632"
    grams = []
    for n in range(min_date_length, max_date_length):
        grams.extend(val[i:i + n] for i in range(len(val) - n + 1))
    dates = []
    for gram in grams:
        out = dateparser.parse(gram)
        if out and min_year_value <= out.year <= max_year_value:
            dates.append(out)
    date, _count = Counter(dates).most_common(1)[0]
    print(date)
    return date


if __name__ == "__main__":
    extract_date()

How it works:这个怎么运作:

  • calculates all character ngrams in a range (between min_date_length and max_date_length ) for efficiency reasons, and dates generally can't be arbitrarily long or much shorter than the default of 5 (though it's possible, for example if the date format is 1/1 for 1st of January for example)出于效率原因计算一个范围内(在min_date_lengthmax_date_length之间)的所有字符 ngram,并且日期通常不能任意长或比默认值 5 短得多(尽管它是可能的,例如如果日期格式是1/1以 1 月 1 日为例)
  • uses dateparser.parse to parse the ngram as a date, and ignore all those which it cannot parse使用dateparser.parsedateparser.parse解析为日期,并忽略所有无法解析的内容
  • filter out those for which the year is too far in the past or too far in the future (this is a problem with the examples that were posted, 1632 is considered as the year for "Desktop/Application,1632,26/02" )过滤掉过去太远或未来太远的年份(这是发布的示例的问题, 1632被视为"Desktop/Application,1632,26/02"
  • get the most common date that was found for the character ngrams获取为字符 ngrams 找到的最常见日期

This solution works on the three examples that were included in the question.此解决方案适用于问题中包含的三个示例。 Note again that this is very inefficient approach and it might not work in all situations (for example for multiple dates in a string it will break).再次注意,这是一种非常低效的方法,它可能不适用于所有情况(例如,对于字符串中的多个日期,它会中断)。

A more efficient approach is to use a regex to extract just the date strings from each string and then use datetime.strptime .更有效的方法是使用正则表达式从每个字符串中提取日期字符串,然后使用datetime.strptime See strftime() and strptime() Behavior .请参阅strftime() 和 strptime() 行为

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM