I have a data frame and one column has the protein id's along with a bunch of nonsensical stuff, like the image below. The id that I want is always the 4th through 9th character so I want to loop through the column and extract these to export them to another csv file. The column is also full of NA's which I don't want. I'm struggeling to come up with a loop in R that will slice out the exact characters I want everytime and do nothing if there are NA's and then to stop when it finds a blank, since this would be the end of the list.
mock example of column
Prot Id's
sp|IDIDID|PSKSJ_45HELI^sp|IDIDID|FRUEHFJ^HSLHFHG#%$^9y7hiuahl
sp|IDIDID|PSKSJ_45HELI^spuegfuehfw3|IDIDID|FRUEHFJ^HDGFLFHEHFN
NA
NA
sp|IDIDID|PSKSJ_45HELIWUEU^#H63hHU6e^sp|IDIDID|FRUEHFJ^HFGHG:WHFUWH^hfue
NA
sp|IDIDID|PSKSJ_45HELI^spJFBEFBUEBFE|IDIDID|FRUEHFJ^
NA
NA
The part that says IDIDID is what I want to get, any help would be greatly appreciated
Use the substr
function to extract the range that you want:
x = c("sp|456879|sequence1","sp|121212|sequence2",NA)
d = data.frame(Prot_Id = x)
substr(d[!is.na(d$Prot_Id),],4,9)
Output:
[1] "456879" "121212"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.