[英]Returning specific values in a string of different format
How do I make python automatically search a certain specific type of data (eg date) in a string of different format ?如何让python以不同格式的字符串自动搜索某种特定类型的数据(例如日期)?
Example inputs:示例输入:
"-rwxr-xr-x 1 user usergrp 1632 Feb 26 11:03 Desktop/Application"
"Desktop/Application,1632,26/02"
"26/02/19 - Desktop/Application - 1632"
Output for these examples should be 26 Feb 19
.这些示例的输出应为26 Feb 19
。
Related but different: Convert “unknown format” strings to datetime objects?相关但不同: 将“未知格式”字符串转换为日期时间对象?
This problem is different because the strings are not just dates, but embedded in the strings.这个问题是不同的,因为字符串不仅仅是日期,而是嵌入在字符串中。 I treat this problem as "How to find dates in strings with inconsistent formats?"我将此问题视为“如何在格式不一致的字符串中查找日期?”
I use dateparser 0.7.1 , documentation can be found here .我使用dateparser 0.7.1 ,文档可以在这里找到。 Because the format of the strings is unknown and can be different from every string, I calculate all character ngrams in the string and then parse them as dates.因为字符串的格式未知并且可能与每个字符串不同,所以我计算字符串中的所有字符 ngram,然后将它们解析为日期。 The most common date is then returned as the correct output.然后将最常见的日期作为正确的输出返回。 This is a slow and inefficient approach but it is the best I can come up with for the requirements here:这是一种缓慢且低效的方法,但它是我能针对此处的要求提出的最佳方法:
Code below:代码如下:
from collections import Counter
import dateparser
def extract_date(min_date_length=5, max_date_length=15, min_year_value=2000, max_year_value=2020):
val = "Feb 26 11:03 Desktop/Application"
val = "Desktop/Application,1632,26/02"
val = "26/02/19 - Desktop/Application - 1632"
grams = []
for n in range(min_date_length, max_date_length):
grams.extend(val[i:i + n] for i in range(len(val) - n + 1))
dates = []
for gram in grams:
out = dateparser.parse(gram)
if out and min_year_value <= out.year <= max_year_value:
dates.append(out)
date, _count = Counter(dates).most_common(1)[0]
print(date)
return date
if __name__ == "__main__":
extract_date()
How it works:这个怎么运作:
min_date_length
and max_date_length
) for efficiency reasons, and dates generally can't be arbitrarily long or much shorter than the default of 5 (though it's possible, for example if the date format is 1/1
for 1st of January for example)出于效率原因计算一个范围内(在min_date_length
和max_date_length
之间)的所有字符 ngram,并且日期通常不能任意长或比默认值 5 短得多(尽管它是可能的,例如如果日期格式是1/1
以 1 月 1 日为例)dateparser.parse
to parse the ngram as a date, and ignore all those which it cannot parse使用dateparser.parse
将dateparser.parse
解析为日期,并忽略所有无法解析的内容1632
is considered as the year for "Desktop/Application,1632,26/02"
)过滤掉过去太远或未来太远的年份(这是发布的示例的问题, 1632
被视为"Desktop/Application,1632,26/02"
)This solution works on the three examples that were included in the question.此解决方案适用于问题中包含的三个示例。 Note again that this is very inefficient approach and it might not work in all situations (for example for multiple dates in a string it will break).再次注意,这是一种非常低效的方法,它可能不适用于所有情况(例如,对于字符串中的多个日期,它会中断)。
A more efficient approach is to use a regex to extract just the date strings from each string and then use datetime.strptime
.更有效的方法是使用正则表达式从每个字符串中提取日期字符串,然后使用datetime.strptime
。 See strftime() and strptime() Behavior .请参阅strftime() 和 strptime() 行为。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.