[英]Extracting specific characters from specific column in R dataframe
I am trying to extract gene name from a column names as "Proteins" in dataframe in R. My Datafame is like this.我正在尝试从 R 数据框中的“蛋白质”列名称中提取基因名称。我的 Datafame 就是这样。
Scan Proteins
1 7:: [sp|P02787|TRFE_HUMAN Serotransferrin OS=Homo sapiens GN=TF PE=1 SV=3 ||| sp|TRFE_HUMAN| ||| tr|B4DHZ6|B4DHZ6_HUMAN Transferrin, isoform CRA_c OS=Homo sapiens GN=TF PE=2 SV=1]
2 21:: [sp|P01876|IGHA1_HUMAN Ig alpha-1 chain C region OS=Homo sapiens GN=IGHA1 PE=1 SV=2 ||| sp|P01877|IGHA2_HUMAN Ig alpha-2 chain C region OS=Homo sapiens GN=IGHA2 PE=1 SV=3]
3 2:: [sp|P14543|NID1_HUMAN Nidogen-1 OS=Homo sapiens GN=NID1 PE=1 SV=3 ||| tr|B4DM05|B4DM05_HUMAN cDNA FLJ51241, highly similar to Nidogen-1 OS=Homo sapiens PE=2 SV=1]
I want to get only 1st gene name (eg, for Scan1, TF, for Scan2, IGHA1).我只想获得第一个基因名称(例如,对于 Scan1、TF、对于 Scan2、IGHA1)。 How can I do this in R?
我怎样才能在 R 中做到这一点?
Any comment is helpful for me.任何评论对我都有帮助。 Thanks.
谢谢。
A straight-forward regex does this:一个直接的正则表达式这样做:
dat <- data.frame(Scan=1:3, Proteins=c("7:: [sp|P02787|TRFE_HUMAN Serotransferrin OS=Homo sapiens GN=TF PE=1 SV=3 ||| sp|TRFE_HUMAN| ||| tr|B4DHZ6|B4DHZ6_HUMAN Transferrin, isoform CRA_c OS=Homo sapiens GN=TF PE=2 SV=1]", "21:: [sp|P01876|IGHA1_HUMAN Ig alpha-1 chain C region OS=Homo sapiens GN=IGHA1 PE=1 SV=2 ||| sp|P01877|IGHA2_HUMAN Ig alpha-2 chain C region OS=Homo sapiens GN=IGHA2 PE=1 SV=3]", "2:: [sp|P14543|NID1_HUMAN Nidogen-1 OS=Homo sapiens GN=NID1 PE=1 SV=3 ||| tr|B4DM05|B4DM05_HUMAN cDNA FLJ51241, highly similar to Nidogen-1 OS=Homo sapiens PE=2 SV=1]"))
gsub("^.*GN=([^ ]+).*", "\\1", dat$Proteins)
# [1] "TF" "IGHA2" "NID1"
You can use some string extraction powered by regular expressions.您可以使用一些由正则表达式提供支持的字符串提取。
library(tidyverse)
data %>%
mutate(Gene = str_extract(Protein, pattern = "GN=[a-zA-Z0-9]*")) %>%
mutate(Gene = str_extract(Gene, pattern = "[a-zA-Z0-9]*"))
Here is a break down of the regular expression:下面是正则表达式的分解:
GN= # pattern starts with GN=
[ # begin a grouping
a-z # any lowercase letter
A-Z # any uppercase letter
0-9 # any digit
] # end grouping
* # the group can repeat
So this regular expression looks for any string that has any number of alphanumeric characters so long as it follows a GN=
.所以这个正则表达式查找任何包含任意数量字母数字字符的字符串,只要它跟在
GN=
。 The second mutate()
removes the GN=
from the front of the string.第二个
mutate()
从字符串的前面删除GN=
。
Using data from @r2evans.使用来自@r2evans 的数据。 You can try this approach
你可以试试这个方法
library(dplyr)
library(stringr)
dat %>%
transmute(Scan,GENE = str_extract(Proteins, "(?<=GN=)\\w+(?=\\s)"))
# Scan GENE
# 1 1 TF
# 2 2 IGHA1
# 3 3 NID1
(?<=GN=)
: match behind GN=
(?<=GN=)
: 匹配在GN=
后面\\\\w
: match word characters at least one time \\\\w
: 匹配单词字符至少一次
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.