简体   繁体   English

从R中的字符串中提取带有点子字符串的模式

[英]extract a pattern with dot substring from a string in R

I have a set of characters like 我有一组像

data <- c("ABS Spring Meeting 5.14.15", "DEFG Sellors Tour 10.28.14", "DDCC Fun at the Museum 4.4.15", "GAME CS vs. Washington 11.01.14", "BSS Studio 54 5.13.15","Pas-12 3.5.15")

As you can notice, the last set of digit is date of event. 您会注意到,最后一组数字是事件的日期。 I want to convert them into date 我想将它们转换为日期

date <- c("2015-05-14","2014-10-28","2015-04-04","2014-11-01","2015-05-13","2015-03-05")

Feel like I have to substring this kind ("5.14.15", "10.28.14", "4.4.15", "11.01.14", "5.13.15", "3.5.15") of pattern, then do the date convert. 感觉我必须将这种模式(“ 5.14.15”,“ 10.28.14”,“ 4.4.15”,“ 11.01.14”,“ 5.13.15”,“ 3.5.15”)进行子字符串化,然后执行日期转换。

Can anyone help me with this? 谁能帮我这个? Thank you! 谢谢!

In base R, and provided the date is always at the end of the string, you can use 在基数R中,如果日期始终在字符串的末尾,则可以使用

as.Date(sub(".*\\s", "", data), "%m.%d.%y")
# [1] "2015-05-14" "2014-10-28" "2015-04-04" "2014-11-01"

Here, the regular expression is simply 在这里,正则表达式很简单

  • .* everything .*一切
  • \\\\s a space character \\\\s是空格字符

So this removes everything up to and including the final space character. 因此,这将删除所有内容,包括最终的空格字符。

The quickest way is with lubridate . 最快的方法是lubridate If you supply the general format, it will try to figure it out for you: 如果您提供常规格式,它将尝试为您解决:

library(lubridate)
mdy(data)
[1] "2015-05-14 UTC" "2014-10-28 UTC" "2015-04-04 UTC" "2014-11-01 UTC"

If your data becomes more complicated with other numbers you can use a string extraction method. 如果您的数据因其他数字而变得更加复杂,则可以使用字符串提取方法。 Like so: 像这样:

mdy(sub(".*?([0-9.]+)$","\\1", data))

In the pattern ".*?([0-9.]+)$" : 在模式".*?([0-9.]+)$"

  • .*? matches all characters and spaces. 匹配所有字符和空格。 The question mark allows the next part of pattern be matched fully. 问号允许模式的下一部分完全匹配。

  • ([0-9.]+)$ searches for the longest stretch of digits and decimal points reaching the end of the string marked by the dollar sign. ([0-9.]+)$搜索到达由美元符号标记的字符串末尾的最长数字位数和小数点。 The parantheses create a group of the tokens inside of it. 括号在其中创建了一组令牌。 We will use that group for the next step. 我们将使用该组进行下一步。

  • "\\\\1" returns the capture group from the pattern and discards the rest of the match. "\\\\1"从模式中返回捕获组,并丢弃其余的匹配项。

There are many websites that will go much further into regular expressions than I can. 有许多网站会比我更进一步地使用正则表达式。 Since it is used in nearly every programming language, it will be well worth your time to invest at least a few hours in its study. 由于几乎所有编程语言都使用了它,因此值得您花时间至少花几个小时进行研究。

I learned a lot with this free Perl book online. 我从这本免费的Perl在线书中学到了很多东西。 Check out Ch. 签出Ch。 5 here: 5在这里:

https://www.perl.org/books/beginning-perl/ https://www.perl.org/books/beginning-perl/

This site has a sub-section focusing on R 该站点有一个小节,重点放在R

http://www.regular-expressions.info/rlanguage.html http://www.regular-expressions.info/rlanguage.html

data <- data <- c("ABS Spring Meeting 5.14.15", 
   "DEFG Sellors Tour 10.28.14", "DDCC Fun at the Museum 4.4.15", 
    "GAME CS vs. Washington 11.01.14", "BSS Studio 54 5.13.15",
    "Pas-12 3.5.15")
library("lubridate")
library("stringr")

mdy(str_extract(data,"[0-9]+(\\.[0-9]+){2}$"))
## [1] "2015-05-14 UTC" "2014-10-28 UTC" "2015-04-04 UTC" "2014-11-01 UTC"
## [5] "2015-05-13 UTC" "2015-03-05 UTC"

The regular expression "[0-9]+(\\\\.[0-9]+){2}$" means "more than one numeral ( [0-9]+ ), followed by two ( {2} ) instances of (one dot ( \\\\. ) followed by more than one numeral [0-9]+ ), followed by the end of the string ( $ )" 正则表达式"[0-9]+(\\\\.[0-9]+){2}$"意思是“多个数字( [0-9]+ ),后跟两个( {2} )实例(一个点( \\\\. ),后接多个数字[0-9]+ ),然后是字符串( $ )的结尾

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM