简体   繁体   English

问题从非结构化文本中的各种格式中提取日期部分

[英]issue extracting date part from various formats in an unstructured text

I am trying to extract only date part from a bunch of unstructured text. 我试图从一堆非结构化文本中仅提取日期部分。

Issue is, the date could be in any of the following formats: 问题是,日期可以采用以下任何格式:

  • Jan. 16 or Jan 16 2017 (for January 16th, 2017) 2017年1月16日或1月16日(2017年1月16日)
  • Januray 2, 2017 2017年1月2日
  • 02/01/2017 (dd/mm/yyyy) 02/01/2017(年/月/日)
  • 01/02/2017 (mm/dd/yyyy) 2017/02/01(mm / dd / yyyy)
  • 01-02-17 (mm-dd-yy) 01-02-17(mm-dd-yy)

Sample Text: 示范文本:

x <- "There is a date which is Jan 2, 2017. Here is another date example 02/01/2017. This is third example date type [01/02/17]. This is fourth example date Jan. 16 and finally one more example is 01-02-2017"

What I was trying is one of the other options (from the examples in this answer): 我正在尝试的是其他选项之一(来自答案中的示例):

gsub(".*[(]|[)].*", "", string)

Any other generalized possibility? 还有其他一般化的可能吗?

First of all, Without knowing the date format, for this instance 02/03/2002 you can not tell whether a day is a day and a month is a month.... and in case year can be 2 digit too... eg dd/mm/yy or yy/mm/dd or mm/yy/dd ... you can not say which one is day, which one is month and which one is year... 首先,在不知道日期格式的情况下,对于这个实例02/03/2002,你无法判断一天是一天,一个月是一个月......如果年份也可以是2位数......例如dd / mm / yy或yy / mm / dd或mm / yy / dd ...你不能说哪一天是哪一天,哪一个是月,哪一个是年...

Taking all these things into account... there could be strings that may come from third party on which you may not have any way to determine the format ... thus no solution can guarantee to define day or month or year for you. 考虑到所有这些因素......可能存在可能来自第三方的字符串,您可能无法确定格式......因此,没有任何解决方案可以保证为您定义日期,月份或年份。

But it is possible to identify all the digit patterns that you have mentioned. 但是可以识别您提到的所有数字模式。 The following solution will give you three group . 以下解决方案将为您提供三组 You will get the three part of your date for all the formats that you have mentioned in group 1,2 and 3 . 对于您在组1,2和3中提到的所有格式,您将获得日期的三个部分。 You will have to analyze / guess a way to figure which one is day, which one is month, and which one is year. 你将不得不分析/猜测一种方法,以确定哪一个是一天,哪一个是月,哪一个是年。 That can't be covered by regex. 正则表达式无法涵盖这一点。

Taking all these facts into account, you may try the following regex: 考虑到所有这些事实,您可以尝试以下正则表达式:

((?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z ]*\.?)|(?:\d{1,2}))[\/ ,-](\d{1,2})(?:[\/ ,-]\s*(\d{4}|\d{2}))?

Regex 101 Demo 正则表达式101演示

Sample Source ( run here ): 示例源( 在此处运行 ):

library(stringr)
str<-"Jan. 16  bla bla bla Jan 16 2017 bla bla bla January 2, 2017 bla bla bla 02/01/2017 bla bla bla 01/02/2017 bla bla bla 01-02-17 bla bla bla jan. 16 There is a date which is Jan 2, 2017. Here is another date example 02/01/2017. This is third example date type [01/02/17]. This is fourth example date Jan. 16 and finally one more example is 01-02-2017"
patt <- "(?i)((?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z ]*\\.?)|(?:\\d{1,2}))[\\/ ,-](\\d{1,2})(?:[\\/ ,-]\\s*(\\d{4}|\\d{2}))?"
result<-str_match_all(str,patt)
result

Sample Output: 样本输出:

      [,1]              [,2]      [,3] [,4]  
 [1,] "Jan. 16"         "Jan."    "16" ""    
 [2,] "Jan 16 2017"     "Jan"     "16" "2017"
 [3,] "January 2, 2017" "January" "2"  "2017"
 [4,] "02/01/2017"      "02"      "01" "2017"
 [5,] "01/02/2017"      "01"      "02" "2017"
 [6,] "01-02-17"        "01"      "02" "17"  
 [7,] "jan. 16"         "jan."    "16" ""    
 [8,] "Jan 2, 2017"     "Jan"     "2"  "2017"
 [9,] "02/01/2017"      "02"      "01" "2017"
[10,] "01/02/17"        "01"      "02" "17"  
[11,] "Jan. 16"         "Jan."    "16" ""    
[12,] "01-02-2017"      "01"      "02" "2017"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM