简体   繁体   中英

Extracting unique partial elements from vector

I need to a list of the unique subject IDs (the part before _ and after /) from the contents of a folder below.

[1] "."                      "./4101_0"               "./4101_0/4101 Baseline"
[4] "./4101_1"               "./4101_2"               "./4101_2_2"            
[7] "./4101_3"               "./4101_4"               "./4101_5"              
[10] "./4101_6"    

Right now I'm doing this (using the packages stringr and foreach).

# Create list of contents
Folder.list <- list.dirs()
# Split entries by the "/"
SubIDs <- str_split(Folder.list, "/")
# For each entry in the list, retrieve the second element
SubIDs <- unlist(foreach(i=1:length(SubIDs)) %do% SubIDs[[i]][2])
# Split entries by the "_"
SubIDs <- str_split(SubIDs, "_")
# Take the second element after splitting, unlist it, find the unique entries, remove the NA and coerce to numeric
SubIDs <- as.numeric(na.omit(unique(unlist(foreach(i=1:length(SubIDs)) %do% SubIDs[[i]][1]))))

This does the job but seems unnecessarily horrible. What's a cleaner way of getting from point A to point B?

Use q regular expression.

x <- c(".", "./4101_0", "./4101_0/4101 Baseline", "./4101_1", "./4101_2", "./4101_2_2", "./4101_3", "./4101_4", "./4101_5", "./4101_6")

One way of using a regular expression is to use gsub() to extract the subject code

gsub(".*/(\\d+)_.*", "\\1", x)
[1] "."    "4101" "4101" "4101" "4101" "4101" "4101" "4101" "4101" "4101"

stringr also has the str_extract function, which can be used to extract substrings that match a regex pattern. With a positive lookbehind for / and a positive lookahead for _ , you can achieve your aim.

Beginning with @Andrie's x :

str_extract(x, perl('(?<=/)\\d+(?=_)'))

# [1] NA     "4101" "4101" "4101" "4101" "4101" "4101" "4101" "4101" "4101"

The pattern above matches one or more numerals (ie \\\\d+ ) that are preceded by a forward slash and followed by an underscore. Wrapping the pattern in perl() is required for the lookarounds.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM