简体   繁体   中英

looping through a column in R and extracting characters

I have a data frame and one column has the protein id's along with a bunch of nonsensical stuff, like the image below. The id that I want is always the 4th through 9th character so I want to loop through the column and extract these to export them to another csv file. The column is also full of NA's which I don't want. I'm struggeling to come up with a loop in R that will slice out the exact characters I want everytime and do nothing if there are NA's and then to stop when it finds a blank, since this would be the end of the list.

mock example of column

Prot Id's
sp|IDIDID|PSKSJ_45HELI^sp|IDIDID|FRUEHFJ^HSLHFHG#%$^9y7hiuahl
sp|IDIDID|PSKSJ_45HELI^spuegfuehfw3|IDIDID|FRUEHFJ^HDGFLFHEHFN
NA
NA
sp|IDIDID|PSKSJ_45HELIWUEU^#H63hHU6e^sp|IDIDID|FRUEHFJ^HFGHG:WHFUWH^hfue
NA
sp|IDIDID|PSKSJ_45HELI^spJFBEFBUEBFE|IDIDID|FRUEHFJ^
NA
NA

The part that says IDIDID is what I want to get, any help would be greatly appreciated

Use the substr function to extract the range that you want:

x = c("sp|456879|sequence1","sp|121212|sequence2",NA)
d = data.frame(Prot_Id = x)
substr(d[!is.na(d$Prot_Id),],4,9)

Output:

[1] "456879" "121212"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM