[英]issue extracting date part from various formats in an unstructured text
我試圖從一堆非結構化文本中僅提取日期部分。
問題是,日期可以采用以下任何格式:
示范文本:
x <- "There is a date which is Jan 2, 2017. Here is another date example 02/01/2017. This is third example date type [01/02/17]. This is fourth example date Jan. 16 and finally one more example is 01-02-2017"
我正在嘗試的是其他選項之一(來自本答案中的示例):
gsub(".*[(]|[)].*", "", string)
還有其他一般化的可能嗎?
首先,在不知道日期格式的情況下,對於這個實例02/03/2002,你無法判斷一天是一天,一個月是一個月......如果年份也可以是2位數......例如dd / mm / yy或yy / mm / dd或mm / yy / dd ...你不能說哪一天是哪一天,哪一個是月,哪一個是年...
考慮到所有這些因素......可能存在可能來自第三方的字符串,您可能無法確定格式......因此,沒有任何解決方案可以保證為您定義日期,月份或年份。
但是可以識別您提到的所有數字模式。 以下解決方案將為您提供三組 。 對於您在組1,2和3中提到的所有格式,您將獲得日期的三個部分。 你將不得不分析/猜測一種方法,以確定哪一個是一天,哪一個是月,哪一個是年。 正則表達式無法涵蓋這一點。
考慮到所有這些事實,您可以嘗試以下正則表達式:
((?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z ]*\.?)|(?:\d{1,2}))[\/ ,-](\d{1,2})(?:[\/ ,-]\s*(\d{4}|\d{2}))?
示例源( 在此處運行 ):
library(stringr)
str<-"Jan. 16 bla bla bla Jan 16 2017 bla bla bla January 2, 2017 bla bla bla 02/01/2017 bla bla bla 01/02/2017 bla bla bla 01-02-17 bla bla bla jan. 16 There is a date which is Jan 2, 2017. Here is another date example 02/01/2017. This is third example date type [01/02/17]. This is fourth example date Jan. 16 and finally one more example is 01-02-2017"
patt <- "(?i)((?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z ]*\\.?)|(?:\\d{1,2}))[\\/ ,-](\\d{1,2})(?:[\\/ ,-]\\s*(\\d{4}|\\d{2}))?"
result<-str_match_all(str,patt)
result
樣本輸出:
[,1] [,2] [,3] [,4]
[1,] "Jan. 16" "Jan." "16" ""
[2,] "Jan 16 2017" "Jan" "16" "2017"
[3,] "January 2, 2017" "January" "2" "2017"
[4,] "02/01/2017" "02" "01" "2017"
[5,] "01/02/2017" "01" "02" "2017"
[6,] "01-02-17" "01" "02" "17"
[7,] "jan. 16" "jan." "16" ""
[8,] "Jan 2, 2017" "Jan" "2" "2017"
[9,] "02/01/2017" "02" "01" "2017"
[10,] "01/02/17" "01" "02" "17"
[11,] "Jan. 16" "Jan." "16" ""
[12,] "01-02-2017" "01" "02" "2017"
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.