R正则表达式问题

Question

我有一个包含页面路径的数据框列：

pagePath
/text/other_text/123-some_other_txet-4571/text.html
/text/other_text/another_txet/15-some_other_txet.html
/text/other_text/25189-some_other_txet/45112-text.html
/text/other_text/text/text/5418874-some_other_txet.html
/text/other_text/text/text/some_other_txet-4157/text.html

我想要做的是从/后提取第一个数字，例如从每一行中提取123 。

为了解决这个问题，我尝试了以下方法：

 num = gsub("\\D"," ", mydata$pagePath) /*to delete all characters other than digits */

 num1 = gsub("\\s+"," ",num) /*to let only one space between numbers*/

 num2 = gsub("^\\s","",num1) /*to delete the first space in my string*/

 my_number = gsub( " .*$", "", num2 ) /*to select the first number on my string*/

我认为我想要的是什么，但是我遇到了一些麻烦，尤其是像例子中最后一行的行： /text/other_text/text/text/some_other_txet-4157/text.html

所以，我真正想要的是在/之后提取第一个数字。

任何帮助都会非常受欢迎。

Answer 1

您可以使用以下正则表达式与gsub ：

"^(?:.*?/(\\d+))?.*$"

并替换为"\\\\1" 。 请参阅正则表达式演示。

码：

> s <- c("/text/other_text/123-some_other_txet-4571/text.html", "/text/other_text/another_txet/15-some_other_txet.html", "/text/other_text/25189-some_other_txet/45112-text.html", "/text/other_text/text/text/5418874-some_other_txet.html", "/text/other_text/text/text/some_other_txet-4157/text.html")
> gsub("^(?:.*?/(\\d+))?.*$", "\\1", s, perl=T)
[1] "123"     "15"      "25189"   "5418874" ""

正则表达式将可选地匹配（使用(?:.*?/(\\\\d+))?子模式）从开头到第一个/ （带有.*?/ ）后跟一个或多个数字的字符串的一部分（捕获）将数字放入第1组，使用(\\\\d+) ），然后将其余字符串放到最后（使用.*$ ）。

注意perl=T是必需的。

使用stringr str_extract ，您的代码和模式可以缩短为：

> str_extract(s, "(?<=/)\\d+")
[1] "123"     "15"      "25189"   "5418874" NA       
>

str_extract将提取前1个或多个数字，如果它们前面带有/ （ /本身不作为匹配的一部分返回，因为它是一个lookbehind子模式，零宽度断言，不会将匹配的文本放入结果）。

Answer 2

试试这个

\/(\d+).*

演示

输出：

MATCH 1
1.  [26-29] `123`
MATCH 2
1.  [91-93] `15`
MATCH 3
1.  [132-137]   `25189`
MATCH 4
1.  [197-204]   `5418874`

R正则表达式问题

问题描述

2 个解决方案

解决方案1
5 已采纳 2016-03-11 09:56:28

解决方案2
2 2016-03-11 09:50:06

R正则表达式问题

问题描述

2 个解决方案

解决方案1 5 已采纳 2016-03-11 09:56:28

解决方案2 2 2016-03-11 09:50:06

解决方案1
5 已采纳 2016-03-11 09:56:28

解决方案2
2 2016-03-11 09:50:06