简体   繁体   中英

issue extracting date part from various formats in an unstructured text

I am trying to extract only date part from a bunch of unstructured text.

Issue is, the date could be in any of the following formats:

  • Jan. 16 or Jan 16 2017 (for January 16th, 2017)
  • Januray 2, 2017
  • 02/01/2017 (dd/mm/yyyy)
  • 01/02/2017 (mm/dd/yyyy)
  • 01-02-17 (mm-dd-yy)

Sample Text:

x <- "There is a date which is Jan 2, 2017. Here is another date example 02/01/2017. This is third example date type [01/02/17]. This is fourth example date Jan. 16 and finally one more example is 01-02-2017"

What I was trying is one of the other options (from the examples in this answer):

gsub(".*[(]|[)].*", "", string)

Any other generalized possibility?

First of all, Without knowing the date format, for this instance 02/03/2002 you can not tell whether a day is a day and a month is a month.... and in case year can be 2 digit too... eg dd/mm/yy or yy/mm/dd or mm/yy/dd ... you can not say which one is day, which one is month and which one is year...

Taking all these things into account... there could be strings that may come from third party on which you may not have any way to determine the format ... thus no solution can guarantee to define day or month or year for you.

But it is possible to identify all the digit patterns that you have mentioned. The following solution will give you three group . You will get the three part of your date for all the formats that you have mentioned in group 1,2 and 3 . You will have to analyze / guess a way to figure which one is day, which one is month, and which one is year. That can't be covered by regex.

Taking all these facts into account, you may try the following regex:

((?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z ]*\.?)|(?:\d{1,2}))[\/ ,-](\d{1,2})(?:[\/ ,-]\s*(\d{4}|\d{2}))?

Regex 101 Demo

Sample Source ( run here ):

library(stringr)
str<-"Jan. 16  bla bla bla Jan 16 2017 bla bla bla January 2, 2017 bla bla bla 02/01/2017 bla bla bla 01/02/2017 bla bla bla 01-02-17 bla bla bla jan. 16 There is a date which is Jan 2, 2017. Here is another date example 02/01/2017. This is third example date type [01/02/17]. This is fourth example date Jan. 16 and finally one more example is 01-02-2017"
patt <- "(?i)((?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z ]*\\.?)|(?:\\d{1,2}))[\\/ ,-](\\d{1,2})(?:[\\/ ,-]\\s*(\\d{4}|\\d{2}))?"
result<-str_match_all(str,patt)
result

Sample Output:

      [,1]              [,2]      [,3] [,4]  
 [1,] "Jan. 16"         "Jan."    "16" ""    
 [2,] "Jan 16 2017"     "Jan"     "16" "2017"
 [3,] "January 2, 2017" "January" "2"  "2017"
 [4,] "02/01/2017"      "02"      "01" "2017"
 [5,] "01/02/2017"      "01"      "02" "2017"
 [6,] "01-02-17"        "01"      "02" "17"  
 [7,] "jan. 16"         "jan."    "16" ""    
 [8,] "Jan 2, 2017"     "Jan"     "2"  "2017"
 [9,] "02/01/2017"      "02"      "01" "2017"
[10,] "01/02/17"        "01"      "02" "17"  
[11,] "Jan. 16"         "Jan."    "16" ""    
[12,] "01-02-2017"      "01"      "02" "2017"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM