使用R语言的正则表达式从HTML页面提取日期

Question

How do I extract just the date in title="11:53 AM - 27 May 2018" using REGEX. 如何使用REGEX仅提取title =“ 11:53 AM-2018 May 27”中的日期。

FYI this is from a HTML page. 仅供参考，这是来自HTML页面。 I want to extract all such matches to a list using R language. 我想使用R语言将所有此类匹配提取到列表中。

My output should be 27 May 2018. 我的输出应该是2018年5月27日。

Thanks in advance for your time :) 在此先感谢您的时间：）

Answer 1

Figured it out: 弄清楚了：

rawHTML <- paste(readLines("D:\\practicum\\CSK.html"), collapse="\n")

b<-unlist(str_match_all(rawHTML, '\\d{2} \\w+ 2018'))

Answer 2

Considering you have HTML code of the page in which you want to find the date, the simplest way will be to use regex to find all parts of the code that look like title="11:53 AM - 27 May 2018" Then you can simply again use regex to extract the date from the string. 考虑到您要在其中找到日期的页面的HTML代码，最简单的方法是使用正则表达式查找代码中看起来像title="11:53 AM - 27 May 2018"所有部分title="11:53 AM - 27 May 2018"那么您可以只需再次使用正则表达式从字符串中提取日期即可。 I have written a basic code, you can modify it and use it according to your nee. 我已经写了一个基本代码，您可以根据自己的需要对其进行修改和使用。

first_match <- regexpr(pattern='title\\s*=\\s*"\\d\\d:\\d\\d\\s*(AM|PM)\\s*-\\s*\\d\\d\\s[a-zA-Z]{3}\\s\\d{4}"', str)`
match_str <- regmatches(str,m)
date_exp <- regexpr(pattern='\\d\\d\\s[a-zA-Z]{3}\\s\\d{4}', match_str)
date <- regmatches(match_str, date_exp)

date is your required output and str is the code as string. date是所需的输出，str是作为字符串的代码。

使用R语言的正则表达式从HTML页面提取日期

问题描述

2 个解决方案

解决方案1
0 2018-06-12 06:07:07

解决方案2
0 已采纳 2018-06-12 06:54:08

使用R语言的正则表达式从HTML页面提取日期

问题描述

2 个解决方案

解决方案1 0 2018-06-12 06:07:07

解决方案2 0 已采纳 2018-06-12 06:54:08

解决方案1
0 2018-06-12 06:07:07

解决方案2
0 已采纳 2018-06-12 06:54:08