简体   繁体   English

为 R 中的大型数据框逐行查找和替换条件列值

[英]Find and replace conditional column value by row for a large data frame in R

All, I have a large genomic data that I want to do find-and-replace based on "REF" and "ALT" column values for each row.总之,我有一个大型基因组数据,我想根据每一行的“REF”和“ALT”列值进行查找和替换。 The replacement rules are:替换规则是:

  • If the value = 0, replace it with "-/-"如果值 = 0,则将其替换为“-/-”
  • If the value = 1, replace it with two REF alleles如果值 = 1,则将其替换为两个 REF 等位基因
  • If the value = 2, replace it with two ALT alleles如果值 = 2,则将其替换为两个 ALT 等位基因
  • If the value = 3, replace it with one REF allele and one ALT allele如果值 = 3,则将其替换为一个 REF 等位基因和一个 ALT 等位基因

Here is a small subset of an example data:这是示例数据的一小部分:

Chr Pos REF ALT  A  B  C  D  E
  1  70   A   G  1  0  1  1  1
  1  80   T   G  1  0  3  3  3
  1 100   C   T  1  0  1  1  1
  2  20   G   A  1  0  0  0  1
  2  80   C   T  1  0  0  0  2

The desired output is:所需的 output 是:

Chr Pos REF ALT    A    B    C    D    E
  1  70   A   G  A/A  -/-  A/A  A/A  A/A
  1  80   T   G  T/T  -/-  T/G  T/G  T/G
  1 100   C   T  C/C  -/-  C/C  C/C  C/C
  2  20   G   A  G/G  -/-  -/-  -/-  G/G
  2  80   C   T  C/C  -/-  -/-  -/-  T/T

Reproducible data frame:可重现的数据框:

df=data.frame(
  Chr=c(1,1,1,2,2),
  Pos=c(70,80,100,20,80),
  REF=c("A","T","C","G","C"),
  ALT=c("G","G","T","A","T"),
  A=c(1,1,1,1,1),
  B=c(0,0,0,0,0),
  C=c(1,3,1,0,0),
  D=c(1,3,1,0,0),
  E=c(1,3,1,1,2)
)

I wrote a for loop for the task:我为这个任务写了一个for循环:

K=data.frame()
for (r in 1:nrow(df)){
  k=df[r,]
  ks=df[r,-1:-4]
  ks[ks==0]="-/-"
  ks[ks==1]=paste0(k$REF,"/",k$REF)
  ks[ks==2]=paste0(k$ALT,"/",k$ALT)
  ks[ks==3]=paste0(k$REF,"/",k$ALT)
  ks=cbind(k[,1:4],ks)
  K=rbind(K,ks)
}

That worked okay, however, I have about 150,000 lines of them and the row by row operation takes very long time so I was wondering if there is a faster way to process this?这没问题,但是,我有大约 150,000 行,逐行操作需要很长时间,所以我想知道是否有更快的方法来处理这个问题?

Thank you very much for the help!非常感谢你的帮助!

We may use case_when - loop across the columns 'A' to 'E', create the format with sprintf based on the condition created in `case_when我们可以使用case_when - 遍历“A” across “E”列,根据在“case_when”中创建的条件使用sprintf创建格式

library(dplyr)
df %>% 
   mutate(across(A:E, ~ case_when(. == 0 ~ '-/-', 
                  . == 1 ~ sprintf('%1$s/%1$s', REF),
                  .==2 ~ sprintf('%1$s/%1$s', ALT), 
                   TRUE ~ sprintf('%s/%s', REF, ALT))))

-output -输出

 Chr Pos REF ALT   A   B   C   D   E
1   1  70   A   G A/A -/- A/A A/A A/A
2   1  80   T   G T/T -/- T/G T/G T/G
3   1 100   C   T C/C -/- C/C C/C C/C
4   2  20   G   A G/G -/- -/- -/- G/G
5   2  80   C   T C/C -/- -/- -/- T/T

NOTE: As this is done by column and not by row, it should be much faster than the OP's approach注意:由于这是按列而不是按行完成的,因此它应该比 OP 的方法快得多

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R:根据数据框中的值,用行名替换列名 - R: Replace Column Name with Row Name on Bases of a Value in Data Frame R:在数据帧的列中查找大于或等于不同数据帧中列的行值的最小值 - R: Find Minimum Value in Column of Data Frame that is Greater Than or Equal to Row Value of Column in a Different Data Frame 在R中的大数据帧上使用重新编码来查找/替换 - Use recode across large data frame in R to find/replace 在 R 数据框中,对于给定的行,如何找到 A 列中的值与 B 列中的值的百分比? - In an R data frame, for a given row, how can I find what percentage a value in column A is of a value in column B? 根据数据框的列值 R 对数据框行中的所有值求和并求平均值 - Sum and find average of all the value's in a data frame row based upon one of the data frame's column value R R 在数据框中查找替换 - R find replace in data frame 如何在R中查找和替换数据帧的String列值 - How to find and Replace String column values of a Data frame in R 如何找到数据框或矩阵的最小值/最大值的位置(行/列)(R问题) - How to find the location (row/column) of the minimum/maximum value of a data frame or a matrix (R question) position(按列给出)在data.frame的一行中的Select值并替换它 - Select value in a row of a data.frame by position (given by column) and replace it 将数据框列中的因子替换为R中的数值? - Replace a factor in a data frame column into a numeric value in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM