简体   繁体   中英

Extract substring based on position of character in string

My dataframe looks like this -

df = data.frame(Entity=c('MM > OSS > EUROPE_lv3 > FRANCElv4 > FRANCElv5 > FRANCElv6 > FR_07_FRANCE > FR_08_FRANCE > FR_09_S50 > FR_10_DVPOL12 > FR_11_DRYPBA > FR_12_RYPOP9000 > FR_13_SX362707 > SO362707',
                         'MM > OSS > AMERICA_lv3 > AMERICA_11lv4 > AMERICA_11lv5 > INC_11 > INCEDUSCHOOLFACIL_11 > INCEDUC > 30-00002 > 40-00056 > 50-00556 > 60-59003 > 27747001lv13 > 27747003lv14',
                         'MM > OSS > AMERICA_lv3 > AMERICA_11lv4 > AMERICA_11lv5 > INC_11 > INCEDUSCHOOLFACIL_11 > INCEDUC > 30-00002 > 40-00056 > 50-00061 > 60-23617 > 76929001lv13 > 76929017lv14',
                         'MM > OSS > EUROPE_lv3 > UKIRELAND_13lv4 > UKIRELAND_13lv5 > UKIRELAND_13lv6 > UKIE160000 > UKIE160000_lv8 > UKIE160000_lv9 > UKIE262000 > UKIE362004 > UKIE462006 > UKIE562072 > GB344496',
                         'MM > OSS > AMERICA_lv3 > AMERICA_11lv4 > AMERICA_11lv5 > INC_11 > INCEDUSCHOOLFACIL_11 > INCEDUC > 30-00002 > 40-00056 > 50-00065 > 60-22505 > 94276001lv13 > 94276002lv14'))

My Objective is -

  1. To extract everything after the last instance of <. And store this in a separate column.
  2. To extract everything between the second and third instance of <. And store this in a separate column.

My Attempt

To extract everthing after the last instance of <, I tried this -

sub("^.+< ", "", df$Entity)

However it does not work as expected.

Any help on tackling points 1) & 2) would be appreciated.

We can try using sub as follows for the last column:

df$last <- sub("^.*>\\s*", "", df$Entity)

For the column in between the second and third instance of > :

df$between <- sub("^(?:[^>]+>){2}\\s*([^> ]+).*$", "\\1", df$Entity)

df[, c("last", "between")]

          last     between
1     SO362707  EUROPE_lv3
2 27747003lv14 AMERICA_lv3
3 76929017lv14 AMERICA_lv3
4     GB344496  EUROPE_lv3
5 94276002lv14 AMERICA_lv3

Here is an explanation of the second regex:

^                  from the start of the input
    (?:[^>]+>){2}  match the first two components 'COMPONENT >'
    \s*            match optional whitespace
    ([^> ]+)       then match AND capture the third component
    .*             consume the rest of the input until reaching
$                  the end of the input

You can always strsplit on ' > ' and extract the elements you want to keep.

Limitations: Uses more memory, assumes an equal number of ' > ' in each string

data.table:

library(data.table)
setDT(df)

df[, tstrsplit(Entity, ' > ')][, .(two2three = V3, last = V14)]
#      two2three         last
# 1:  EUROPE_lv3     SO362707
# 2: AMERICA_lv3 27747003lv14
# 3: AMERICA_lv3 76929017lv14
# 4:  EUROPE_lv3     GB344496
# 5: AMERICA_lv3 94276002lv14

Base:

df$Entity <- as.character(df$Entity)
setNames(
  as.data.frame(
    do.call(rbind, lapply(strsplit(df$Entity, ' > '), '[', c(3, 14)))
  ), c('two2three', 'last'))

#     two2three         last
# 1  EUROPE_lv3     SO362707
# 2 AMERICA_lv3 27747003lv14
# 3 AMERICA_lv3 76929017lv14
# 4  EUROPE_lv3     GB344496
# 5 AMERICA_lv3 94276002lv14

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM