I have the following string:
string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"
I would like to extract the string between the two <body>
tags. The result I am looking for is:
substring = "<body>Iwant\to+extr@ctth!sstr|ng<body>"
Note that the substring between the two <body>
tags can contain letters, numbers, punctuation and special characters.
Is there an easy way of doing this? Thank you!
这是正则表达式方式:
regmatches(string, regexpr('<body>.+<body>', string))
regex = '<body>.+?<body>'
You want the non-greedy ( .+?
), so that it doesn't group as many <body>
tags as possible.
If you're solely using a regex with no auxiliary functions, you're going to need a capturing group to extract what is required, ie:
regex = '(<body>.+?<body>)'
strsplit() should help you:
>string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"
>x = strsplit(string, '<body>', fixed = FALSE, perl = FALSE, useBytes = FALSE)
[[1]]
[1] "asflkjsdhlkjsdhglk" "Iwant\to+extr@ctth!sstr|ng" "sdgdfsghsghsgh"
> x[[1]][2]
[1] "Iwant\to+extr@ctth!sstr|ng"
Of course, this gives you all three parts of the string and does not include the tag.
I believe that Matthew's and Steve's answers are both acceptable. Here is another solution:
string = "asflkjsdhlkjsdhglk<body>Iwant\\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh" regmatches(string, regexpr('<body>.+<body>', string)) output = sub(".*(<body>.+<body>).*", "\\\\1", string) print (output)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.