简体   繁体   English

根据data.frame r中另一列的值从一列中提取信息

[英]extract info from a column based on value from another column in data.frame r

I have a big file ~100k rows and 100 columns and I want to create extract the information of four columns based on another column. 我有一个大文件〜100k行和100列,我想基于另一列创建提取四列的信息。 There is a column named Caller and that column tell you which columns with .sample will have info other than noSample . 有一个名为Caller的列,该列告诉您.sample哪些列将具有noSample以外的noSample

I have tried with if and else if statements but sometimes two conditions are met and writting all the possible combinations would take a lot of effort and I am pretty sure there is a better way of doing it 我已经尝试过if and else if语句,但有时会满足两个条件,并且写所有可能的组合会花费很多精力,而且我敢肯定,这样做有更好的方法

My real data.frame looks like this one: 我的真实data.frame看起来像这样:

EDIT 编辑

 Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
             B= c(10,12,13,14,15,16,17),
             Caller = c("A", "B", "C",  "D", "A,C", "A,B,C", "B,D"),
             A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
             dummy1 = 1:7,
             B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
             dummy2 = 1:7,
             C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
             dummy3 = 1:7,
             D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"), stringsAsFactors=FALSE)

I want to extract for each one of the rows a vector of samples. 我想为每一行提取一个样本向量。 This could be stored on a list or another R object. 这可以存储在列表或另一个R对象上。 I will use these samples to be matched against a data.frame where each sample is associated with a process. 我将使用这些样本与data.frame(每个样本与一个流程相关联)进行匹配。

  My desired output would be

  >row1
  3xd|432 
  >row2
   456|789|asd
  >row3
  zxc|vbn|mn
  >row4
  poi|uyh|gfrt|562
  >row5
  [1]1234|567|87sd [2]gfd3|123|456|789
  >row6
  [1]234|456|897a [2]674e|7892|123|432  [3]674e|7892|123
  >row7
  [1]bgcf|12er|567|zxs3|12ple  [2]567|zxs3|12ple

My desired output wouldn't include the pipe | 我想要的输出将不包括管道| between samples but I can get rid of it using strsplit 样本之间,但我可以使用strsplit摆脱它

Since the data.frame is big the speed would be essential. 由于data.frame很大,因此速度至关重要。

Here is a possible solution: 这是一个可能的解决方案:

Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
                 B= c(10,12,13,14,15,16,17),
                 Caller = c("A", "B", "C",  "D", "A,C", "A,B,C", "B,D"),
                 A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
                 B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
                 C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
                 D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"),
                 stringsAsFactors=FALSE)

#find names of columns
names<-substr(names(Df), 1, 1)
#Set unwanted names to NA
names[-c(4:ncol(Df))]<-NA

#create a regular expression by replacing the comma with the or |
reg<-gsub(",", "\\|", Df$Caller)

#find the column matches
columns<-sapply(reg, function(x){grep(x, names)})    

#extract the desired columns out into a list
lapply(seq_along(columns), function(x){Df[x,columns[[x]]]})

I added stringsAsFactors=FALSE to the data frame definition in order to remove the baggage related to the Factor levels. 我将stringsAsFactors=FALSE添加到数据框定义中,以删除与因素级别相关的stringsAsFactors=FALSE

Showing just one of many possible ways to achieve the desired result. 仅显示实现所需结果的多种可能方法之一。 Note that I use the same dataframe as @Dave2e, ie I have added stringsAsFactors=F to the call to data.frame . 请注意,我使用与@ Dave2e相同的数据帧,即,我已将stringsAsFactors=F添加到对data.frame的调用中。

library(tidyverse)
out <- df %>% rowid_to_column() %>% # adding explicit row IDs
       gather(key, value, -rowid, -A, -B, -Caller) %>% # reshaping the dataframe
       filter(value != "noSample")

The resulting dataframe will look like this: 产生的数据框将如下所示:

out
   rowid    A  B Caller      key                    value
1      1 chr1 10      A A.sample                  3xd|432
2      5 chr1 15    A,C A.sample            1234|567|87sd
3      6 chr1 16  A,B,C A.sample             234|456|897a
4      2 chr1 12      B B.sample              456|789|asd
5      6 chr1 16  A,B,C B.sample        674e|7892|123|432
6      7 chr1 17    B,D B.sample bgcf|12er|567|zxs3|12ple
7      3 chr1 13      C C.sample               zxc|vbn|mn
8      5 chr1 15    A,C C.sample         gfd3|123|456|789
9      6 chr1 16  A,B,C C.sample            674e|7892|123
10     4 chr1 14      D D.sample         poi|uyh|gfrt|562
11     7 chr1 17    B,D D.sample           567|zxs3|12ple

Now we can simply subset to retrieve the desired result: 现在,我们可以简单地子集检索所需的结果:

out[out$rowid == 1,"value"]
[1] "3xd|432"
out[out$rowid == 5,"value"]
[1] "1234|567|87sd"    "gfd3|123|456|789"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM