简体   繁体   English

当模式不太清楚时,在R中提取子字符串

[英]Extracting a substring in R when the pattern is not that clear

I started R a week ago and I've been working on extracting some information from htmls to get started. 我一周前开始使用R,并且我一直在努力从html中提取一些信息以开始使用。

I know this is a frequent and basic question, because I've already asked it in a different context and I read quite a few threads. 我知道这是一个常见且基本的问题,因为我已经在不同的上下文中提出过这个问题,并且阅读了很多线程。

I also know the functions I could use: sub / str_match, etc. 我也知道我可以使用的功能:sub / str_match等。

I chose to use sub() and here is what my code looks like for the time being: 我选择使用sub(),这是我的代码目前的样子:

#libraries
library('xml2')
library('rvest')
library('stringr')

#author page:
url <- paste('https://ideas.repec.org/e/',sample[4,3],'.html',sep="")
url <- gsub(" ", "", url, fixed = TRUE)
webpage <- read_html(url)

#get all published articles:
list_articles <- html_text(html_nodes(webpage,'#articles-body ol > li'))

#get titles:
titles <- html_text(html_nodes(webpage, '#articles-body b a'))

#get co-authors:
authors <- sub(".* ([A-Za-z_]+),([0-9]+).\n.*","\\1", list_articles)

Here is what an element of list_articles looks like: 这是list_articles元素的外观:

" Theo Sparreboom & Lubna Shahnaz, 2007.\n\"Assessing Labour Market 
Vulnerability among Young People,\"\nThe Pakistan Development 
Review,\nPakistan Institute of Development Economics, vol. 46(3), pages 193-
213.\n"  

When I try to get the co-authors, R gives me the whole string instead of just the co-authors, so I'm clearly specifying the pattern incorrectly, but I don't get why. 当我尝试找共同作者时,R给了我整个字符串,而不仅仅是共同作者,所以我清楚地指定了错误的模式,但是我不明白为什么。

If someone could help me out, that would be great. 如果有人可以帮助我,那就太好了。

Hope you have a good day, G. Gauthier 祝您有个美好的一天,G。Gauthier

Is this helpful? 这有帮助吗?

It says extract the string from the first upper case letter until there is a comma, space and then digit. 它说从第一个大写字母中提取字符串,直到有逗号,空格和数字为止。

library(stringr)

#get co-authors:
authors <- str_extract(list_articles,"[[:upper:]].*(?=, [[:digit:]])")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM