简体   繁体   中英

How to extract the last text after forward slash

I have a df that looks like this:

AF GT Sample_name
0.001 1/1 path/to/sample/name/ID0001.vcf.gz
0.005 0/1 path/to/sample/name/ID0002.vcf.gz

What I want is to only keep the ID name in the Sample_name column:

AF GT Sample_name
0.001 1/1 ID0001
0.005 0/1 ID0002

I would very much appreciate any help in achieving this.

There are some built in file name helpers that you can use here.

  • basename()
  • tools::file_path_sans_ext()

So in this example simply do:

library(tools)

df$Sample_name <- file_path_sans_ext(basename(df$Sample_name), compression = TRUE)

You can use a regex pattern with gsub() :

gsub(".*(ID\\d*).*", replacement = "\\1", x = "path/to/sample/name/ID0001.vcf.gz")
#> "ID0001"

Across your dataframe:

df$sample_name2 <- gsub(".*(ID\\d*).*", replacement = "\\1", x = df$sample_name)

Here is tidyverse solution. Note this only works if you ID string has always: ID followed by 4 numbers:

library(dplyr)
library(stringr)

df %>% 
  mutate(Sample_name=str_extract(Sample_name, 'ID\\d{4}'))
    AF  GT Sample_name
1 0.001 1/1      ID0001
2 0.005 0/1      ID0002

Using sub with basename to take the sample name:

df$Sample_name <- sub('\\..*$', '', basename(df$Sample_name))
df

Output:

     AF  GT Sample_name
1 0.001 1/1      ID0001
2 0.005 0/1      ID0002

Data

df <- data.frame(AF = c(0.001, 0.005),
                 GT = c("1/1", "0/1"),
                 Sample_name = c("path/to/sample/name/ID0001.vcf.gz", "path/to/sample/name/ID0002.vcf.gz"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM