My dataframe looks like this -
df = data.frame(Entity=c('MM > OSS > EUROPE_lv3 > FRANCElv4 > FRANCElv5 > FRANCElv6 > FR_07_FRANCE > FR_08_FRANCE > FR_09_S50 > FR_10_DVPOL12 > FR_11_DRYPBA > FR_12_RYPOP9000 > FR_13_SX362707 > SO362707',
'MM > OSS > AMERICA_lv3 > AMERICA_11lv4 > AMERICA_11lv5 > INC_11 > INCEDUSCHOOLFACIL_11 > INCEDUC > 30-00002 > 40-00056 > 50-00556 > 60-59003 > 27747001lv13 > 27747003lv14',
'MM > OSS > AMERICA_lv3 > AMERICA_11lv4 > AMERICA_11lv5 > INC_11 > INCEDUSCHOOLFACIL_11 > INCEDUC > 30-00002 > 40-00056 > 50-00061 > 60-23617 > 76929001lv13 > 76929017lv14',
'MM > OSS > EUROPE_lv3 > UKIRELAND_13lv4 > UKIRELAND_13lv5 > UKIRELAND_13lv6 > UKIE160000 > UKIE160000_lv8 > UKIE160000_lv9 > UKIE262000 > UKIE362004 > UKIE462006 > UKIE562072 > GB344496',
'MM > OSS > AMERICA_lv3 > AMERICA_11lv4 > AMERICA_11lv5 > INC_11 > INCEDUSCHOOLFACIL_11 > INCEDUC > 30-00002 > 40-00056 > 50-00065 > 60-22505 > 94276001lv13 > 94276002lv14'))
My Objective is -
My Attempt
To extract everthing after the last instance of <, I tried this -
sub("^.+< ", "", df$Entity)
However it does not work as expected.
Any help on tackling points 1) & 2) would be appreciated.
We can try using sub
as follows for the last column:
df$last <- sub("^.*>\\s*", "", df$Entity)
For the column in between the second and third instance of >
:
df$between <- sub("^(?:[^>]+>){2}\\s*([^> ]+).*$", "\\1", df$Entity)
df[, c("last", "between")]
last between
1 SO362707 EUROPE_lv3
2 27747003lv14 AMERICA_lv3
3 76929017lv14 AMERICA_lv3
4 GB344496 EUROPE_lv3
5 94276002lv14 AMERICA_lv3
Here is an explanation of the second regex:
^ from the start of the input
(?:[^>]+>){2} match the first two components 'COMPONENT >'
\s* match optional whitespace
([^> ]+) then match AND capture the third component
.* consume the rest of the input until reaching
$ the end of the input
You can always strsplit on ' > ' and extract the elements you want to keep.
Limitations: Uses more memory, assumes an equal number of ' > ' in each string
data.table:
library(data.table)
setDT(df)
df[, tstrsplit(Entity, ' > ')][, .(two2three = V3, last = V14)]
# two2three last
# 1: EUROPE_lv3 SO362707
# 2: AMERICA_lv3 27747003lv14
# 3: AMERICA_lv3 76929017lv14
# 4: EUROPE_lv3 GB344496
# 5: AMERICA_lv3 94276002lv14
Base:
df$Entity <- as.character(df$Entity)
setNames(
as.data.frame(
do.call(rbind, lapply(strsplit(df$Entity, ' > '), '[', c(3, 14)))
), c('two2three', 'last'))
# two2three last
# 1 EUROPE_lv3 SO362707
# 2 AMERICA_lv3 27747003lv14
# 3 AMERICA_lv3 76929017lv14
# 4 EUROPE_lv3 GB344496
# 5 AMERICA_lv3 94276002lv14
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.