提取行名的一部分以在 R 的數據框中創建一個新列

Question

我在 R 中有一個名為 cryptdeltact 的數據框，其中包含如下示例信息

# A tibble: 2,293 x 7
# Groups:   Name [72]
   Name          Detector      N   Value    sd    se    ci
   <fct>         <fct>     <dbl>   <dbl> <dbl> <dbl> <dbl>
 1 VG 2H 1 SB1 C ATM           6 11.4    0.653 0.267 0.686
 2 VG 2H 1 SB1 C BetaActin     6  0.0199 0.588 0.240 0.617
 3 VG 2H 1 SB1 C BMPR1a        6  6.49   0.591 0.241 0.620
 4 VG 2H 1 SB1 C BMPR2         6  7.19   0.614 0.251 0.645
 5 VG 2H 1 SB1 C Brca1         6 11.5    0.640 0.261 0.672
 6 VG 2H 1 SB1 C Brca2         6 11.9    0.840 0.343 0.882
 7 VG 2H 1 SB1 C cmyc          6  8.20   0.580 0.237 0.608
 8 VG 2H 1 SB1 C DNAPKCs       6 11.5    0.651 0.266 0.683
 9 VG 2H 1 SB1 C Ercc1         6 11.4    0.783 0.320 0.822
10 VG 2H 1 SB1 C Fen1          6  9.05   0.629 0.257 0.660
# … with 2,283 more rows

我想在此數據框中添加三個新列：Model、時間和區域。 這些新列的所有信息都包含在現有的“名稱”列中。 時間是“名稱”中的第二條信息，即。 “0h”、“2h”或“5h”。 區域是倒數第二個，即“SB1”、“SB2”、“SB3”或“SB4”。 但是 Model 是前兩個字母和最后一個字母 ie 的組合。 “VG C”或“VG V”或“WT C”或“WT V”。 我知道答案在於從 Name 字符串中提取適當的信息並將其放入一個新列中，但我正在努力解決語法問題。

最終表格列理想情況下看起來像這樣（一旦提取，我可以將“VG V”更改為“VG Villus”並完全刪除名稱列）

   Model      Time Region Detector     N     sd     se    ci
   <chr>     <dbl> <chr>  <chr>    <dbl>  <dbl>  <dbl> <dbl>
 1 VG Villus     0 SB1    Fen1         1 NA     NA     NA   
 2 VG Villus     0 SB1    Lig3         1 NA     NA     NA   
 3 VG Villus     0 SB1    PARP1        1 NA     NA     NA   
 4 VG Villus     0 SB1    PolTheta     1 NA     NA     NA   
 5 VG Villus     0 SB1    WRN          1 NA     NA     NA   
 6 VG Villus     2 SB1    Fen1         3  1.22   0.706  3.04
 7 VG Villus     2 SB1    Lig3         3  2.11   1.22   5.25
 8 VG Villus     2 SB1    Mre11a       3  0.601  0.347  1.49
 9 VG Villus     2 SB1    PARP1        3  1.94   1.12   4.82
10 VG Villus     2 SB1    PolTheta     3  2.74   1.58   6.82

為基本問題道歉，但我相信這可能會比目前花費的時間少得多！

Answer 1

我們可以使用tidyr extract和適當的regex ，然后unite列

library(tidyr)

extract(df, Name, into = c("Model", "Time", "Region", "temp"), 
           regex = "(.*)(\\d)H.*(SB\\d).*([A-Z])$") %>%
unite(Model, Model, temp, sep = "")

#   Model Time Region  Detector N   Value    sd    se    ci
#1   VG C    2    SB1       ATM 6 11.4000 0.653 0.267 0.686
#2   VG C    2    SB1 BetaActin 6  0.0199 0.588 0.240 0.617
#3   VG C    2    SB1    BMPR1a 6  6.4900 0.591 0.241 0.620
#4   VG C    2    SB1     BMPR2 6  7.1900 0.614 0.251 0.645
#5   VG C    2    SB1     Brca1 6 11.5000 0.640 0.261 0.672
#6   VG C    2    SB1     Brca2 6 11.9000 0.840 0.343 0.882
#7   VG C    2    SB1      cmyc 6  8.2000 0.580 0.237 0.608
#8   VG C    2    SB1   DNAPKCs 6 11.5000 0.651 0.266 0.683
#9   VG C    2    SB1     Ercc1 6 11.4000 0.783 0.320 0.822
#10  VG C    2    SB1      Fen1 6  9.0500 0.629 0.257 0.660

數據

df <- structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "VG 2H 1 SB1 C", class = "factor"), Detector = 
structure(1:10, .Label = c("ATM", "BetaActin", "BMPR1a", "BMPR2", "Brca1", "Brca2", 
"cmyc", "DNAPKCs", "Ercc1", "Fen1"), class = "factor"), N = c(6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L), Value = c(11.4, 0.0199, 6.49, 7.19, 11.5, 
11.9, 8.2, 11.5, 11.4, 9.05), sd = c(0.653, 0.588, 0.591, 0.614, 
0.64, 0.84, 0.58, 0.651, 0.783, 0.629), se = c(0.267, 0.24, 0.241, 
0.251, 0.261, 0.343, 0.237, 0.266, 0.32, 0.257), ci = c(0.686, 
0.617, 0.62, 0.645, 0.672, 0.882, 0.608, 0.683, 0.822, 0.66)), 
class = "data.frame", row.names = c(NA, -10L))

Answer 2

這應該可以工作，它只使用基礎 R。 另外，我會給你一個獎金，並為你提供一個數字時間變量（我認為這就是你想要的？）。

（假設您的數據框稱為data ）

#string split to create a list of all names
split_col = strsplit(as.character(data$Name), " ")

#create the lists for each new variable
time_var = c()
region_var = c()
model_var = c()

#create a counter for the for loop
i = 1

#go through all the name strings
for (s in split_col){

  #add to the lists
  time_var[[i]] = s[2]
  region_var[[i]] = s[4]
  model_var[[i]] = paste(s[1], s[5])

  #add to the counter
  i = i + 1
}

#add these lists to the dataset
data$model = model_var
data$region = region_var
data$time = time_var

#make the time variable numeric
data$time_numeric = ifelse(data$time == '2H', 2, ifelse(data$time == '5H', 5, ifelse(data$time == '0H', 0, NA)))

希望有效！

提取行名的一部分以在 R 的數據框中創建一個新列

問題描述

2 個解決方案

解決方案1
1 2019-11-11 13:47:40

解決方案2
1 已采納 2019-11-12 18:34:04

提取行名的一部分以在 R 的數據框中創建一個新列

問題描述

2 個解決方案

解決方案1 1 2019-11-11 13:47:40

解決方案2 1 已采納 2019-11-12 18:34:04

解決方案1
1 2019-11-11 13:47:40

解決方案2
1 已采納 2019-11-12 18:34:04