I'm trying to retrieve information from a text file that contains tags, eg:
<name> Joe </name>
The text file consists of multiple lines some with more of these tags (eg for height and weight) and some with just other text. I refer to the text file as "sheet" (see code below).
I would like to retrieve the text between the tags. I have come up with the following solution to do so:
m1 <- regexpr("<name> [a-zA-Z]+ </name>", sheet)
m2 <- regmatches(sheet,m1)
m3 <- gsub("<name> ", "", gsub(" </name>", "", m2))
m3
I have not worked with regular expressions before, but I was wondering whether I am not taking a detour with my 'regmatches'. It seems there should be a more direct way to retrieve text inside tags?
Thanks,
Richard
You could do this with one gsub
call. Therefore you have create a group by surrounding your pattern by (
and )
. This group could be accessed with a number \\\\1
(backreferences), eg:
sheet <- "<name>foobar</name>"
gsub(pattern="<name>([a-zA-Z]+)</name>", replacement="\\1", x=sheet)
# [1] "foobar"
But as @DieterMenne suggests you should try the XML package for HTML (it supports XPath ):
library("XML")
doc <- xmlParse("<html><name>foobar</name></html>")
xpathSApply(doc, "//name", xmlValue)
# [1] "foobar"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.