简体   繁体   English

从 R 数据框中的特定列中提取特定字符

[英]Extracting specific characters from specific column in R dataframe

I am trying to extract gene name from a column names as "Proteins" in dataframe in R. My Datafame is like this.我正在尝试从 R 数据框中的“蛋白质”列名称中提取基因名称。我的 Datafame 就是这样。

在此处输入图片说明

Scan    Proteins
1   7:: [sp|P02787|TRFE_HUMAN Serotransferrin OS=Homo sapiens GN=TF PE=1 SV=3 ||| sp|TRFE_HUMAN| ||| tr|B4DHZ6|B4DHZ6_HUMAN Transferrin, isoform CRA_c OS=Homo sapiens GN=TF PE=2 SV=1]
2   21:: [sp|P01876|IGHA1_HUMAN Ig alpha-1 chain C region OS=Homo sapiens GN=IGHA1 PE=1 SV=2 ||| sp|P01877|IGHA2_HUMAN Ig alpha-2 chain C region OS=Homo sapiens GN=IGHA2 PE=1 SV=3]
3   2:: [sp|P14543|NID1_HUMAN Nidogen-1 OS=Homo sapiens GN=NID1 PE=1 SV=3 ||| tr|B4DM05|B4DM05_HUMAN cDNA FLJ51241, highly similar to Nidogen-1 OS=Homo sapiens PE=2 SV=1]

I want to get only 1st gene name (eg, for Scan1, TF, for Scan2, IGHA1).我只想获得第一个基因名称(例如,对于 Scan1、TF、对于 Scan2、IGHA1)。 How can I do this in R?我怎样才能在 R 中做到这一点?

Any comment is helpful for me.任何评论对我都有帮助。 Thanks.谢谢。

A straight-forward regex does this:一个直接的正则表达式这样做:

dat <- data.frame(Scan=1:3, Proteins=c("7:: [sp|P02787|TRFE_HUMAN Serotransferrin OS=Homo sapiens GN=TF PE=1 SV=3 ||| sp|TRFE_HUMAN| ||| tr|B4DHZ6|B4DHZ6_HUMAN Transferrin, isoform CRA_c OS=Homo sapiens GN=TF PE=2 SV=1]", "21:: [sp|P01876|IGHA1_HUMAN Ig alpha-1 chain C region OS=Homo sapiens GN=IGHA1 PE=1 SV=2 ||| sp|P01877|IGHA2_HUMAN Ig alpha-2 chain C region OS=Homo sapiens GN=IGHA2 PE=1 SV=3]", "2:: [sp|P14543|NID1_HUMAN Nidogen-1 OS=Homo sapiens GN=NID1 PE=1 SV=3 ||| tr|B4DM05|B4DM05_HUMAN cDNA FLJ51241, highly similar to Nidogen-1 OS=Homo sapiens PE=2 SV=1]"))

gsub("^.*GN=([^ ]+).*", "\\1", dat$Proteins)
# [1] "TF"    "IGHA2" "NID1" 

You can use some string extraction powered by regular expressions.您可以使用一些由正则表达式提供支持的字符串提取。

library(tidyverse)
data %>% 
  mutate(Gene = str_extract(Protein, pattern = "GN=[a-zA-Z0-9]*")) %>%
  mutate(Gene = str_extract(Gene, pattern = "[a-zA-Z0-9]*"))

Here is a break down of the regular expression:下面是正则表达式的分解:

GN=     # pattern starts with GN=
[       # begin a grouping
a-z     # any lowercase letter
A-Z     # any uppercase letter
0-9     # any digit
]       # end grouping
*       # the group can repeat

So this regular expression looks for any string that has any number of alphanumeric characters so long as it follows a GN= .所以这个正则表达式查找任何包含任意数量字母数字字符的字符串,只要它跟在GN= The second mutate() removes the GN= from the front of the string.第二个mutate()从字符串的前面删除GN=

Using data from @r2evans.使用来自@r2evans 的数据。 You can try this approach你可以试试这个方法

library(dplyr)
library(stringr) 
dat %>% 
  transmute(Scan,GENE = str_extract(Proteins, "(?<=GN=)\\w+(?=\\s)"))
#   Scan  GENE
# 1    1    TF
# 2    2 IGHA1
# 3    3  NID1
  • (?<=GN=) : match behind GN= (?<=GN=) : 匹配在GN=后面
  • \\\\w : match word characters at least one time \\\\w : 匹配单词字符至少一次

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM