[英]Extracting parts of a row name to make a new column in a data frame in R
我在 R 中有一個名為 cryptdeltact 的數據框,其中包含如下示例信息
# A tibble: 2,293 x 7
# Groups: Name [72]
Name Detector N Value sd se ci
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 VG 2H 1 SB1 C ATM 6 11.4 0.653 0.267 0.686
2 VG 2H 1 SB1 C BetaActin 6 0.0199 0.588 0.240 0.617
3 VG 2H 1 SB1 C BMPR1a 6 6.49 0.591 0.241 0.620
4 VG 2H 1 SB1 C BMPR2 6 7.19 0.614 0.251 0.645
5 VG 2H 1 SB1 C Brca1 6 11.5 0.640 0.261 0.672
6 VG 2H 1 SB1 C Brca2 6 11.9 0.840 0.343 0.882
7 VG 2H 1 SB1 C cmyc 6 8.20 0.580 0.237 0.608
8 VG 2H 1 SB1 C DNAPKCs 6 11.5 0.651 0.266 0.683
9 VG 2H 1 SB1 C Ercc1 6 11.4 0.783 0.320 0.822
10 VG 2H 1 SB1 C Fen1 6 9.05 0.629 0.257 0.660
# … with 2,283 more rows
我想在此數據框中添加三個新列:Model、時間和區域。 這些新列的所有信息都包含在現有的“名稱”列中。 時間是“名稱”中的第二條信息,即。 “0h”、“2h”或“5h”。 區域是倒數第二個,即“SB1”、“SB2”、“SB3”或“SB4”。 但是 Model 是前兩個字母和最后一個字母 ie 的組合。 “VG C”或“VG V”或“WT C”或“WT V”。 我知道答案在於從 Name 字符串中提取適當的信息並將其放入一個新列中,但我正在努力解決語法問題。
最終表格列理想情況下看起來像這樣(一旦提取,我可以將“VG V”更改為“VG Villus”並完全刪除名稱列)
Model Time Region Detector N sd se ci
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 VG Villus 0 SB1 Fen1 1 NA NA NA
2 VG Villus 0 SB1 Lig3 1 NA NA NA
3 VG Villus 0 SB1 PARP1 1 NA NA NA
4 VG Villus 0 SB1 PolTheta 1 NA NA NA
5 VG Villus 0 SB1 WRN 1 NA NA NA
6 VG Villus 2 SB1 Fen1 3 1.22 0.706 3.04
7 VG Villus 2 SB1 Lig3 3 2.11 1.22 5.25
8 VG Villus 2 SB1 Mre11a 3 0.601 0.347 1.49
9 VG Villus 2 SB1 PARP1 3 1.94 1.12 4.82
10 VG Villus 2 SB1 PolTheta 3 2.74 1.58 6.82
為基本問題道歉,但我相信這可能會比目前花費的時間少得多!
我們可以使用tidyr
extract
和適當的regex
,然后unite
列
library(tidyr)
extract(df, Name, into = c("Model", "Time", "Region", "temp"),
regex = "(.*)(\\d)H.*(SB\\d).*([A-Z])$") %>%
unite(Model, Model, temp, sep = "")
# Model Time Region Detector N Value sd se ci
#1 VG C 2 SB1 ATM 6 11.4000 0.653 0.267 0.686
#2 VG C 2 SB1 BetaActin 6 0.0199 0.588 0.240 0.617
#3 VG C 2 SB1 BMPR1a 6 6.4900 0.591 0.241 0.620
#4 VG C 2 SB1 BMPR2 6 7.1900 0.614 0.251 0.645
#5 VG C 2 SB1 Brca1 6 11.5000 0.640 0.261 0.672
#6 VG C 2 SB1 Brca2 6 11.9000 0.840 0.343 0.882
#7 VG C 2 SB1 cmyc 6 8.2000 0.580 0.237 0.608
#8 VG C 2 SB1 DNAPKCs 6 11.5000 0.651 0.266 0.683
#9 VG C 2 SB1 Ercc1 6 11.4000 0.783 0.320 0.822
#10 VG C 2 SB1 Fen1 6 9.0500 0.629 0.257 0.660
數據
df <- structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "VG 2H 1 SB1 C", class = "factor"), Detector =
structure(1:10, .Label = c("ATM", "BetaActin", "BMPR1a", "BMPR2", "Brca1", "Brca2",
"cmyc", "DNAPKCs", "Ercc1", "Fen1"), class = "factor"), N = c(6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L), Value = c(11.4, 0.0199, 6.49, 7.19, 11.5,
11.9, 8.2, 11.5, 11.4, 9.05), sd = c(0.653, 0.588, 0.591, 0.614,
0.64, 0.84, 0.58, 0.651, 0.783, 0.629), se = c(0.267, 0.24, 0.241,
0.251, 0.261, 0.343, 0.237, 0.266, 0.32, 0.257), ci = c(0.686,
0.617, 0.62, 0.645, 0.672, 0.882, 0.608, 0.683, 0.822, 0.66)),
class = "data.frame", row.names = c(NA, -10L))
這應該可以工作,它只使用基礎 R。 另外,我會給你一個獎金,並為你提供一個數字時間變量(我認為這就是你想要的?)。
(假設您的數據框稱為data
)
#string split to create a list of all names
split_col = strsplit(as.character(data$Name), " ")
#create the lists for each new variable
time_var = c()
region_var = c()
model_var = c()
#create a counter for the for loop
i = 1
#go through all the name strings
for (s in split_col){
#add to the lists
time_var[[i]] = s[2]
region_var[[i]] = s[4]
model_var[[i]] = paste(s[1], s[5])
#add to the counter
i = i + 1
}
#add these lists to the dataset
data$model = model_var
data$region = region_var
data$time = time_var
#make the time variable numeric
data$time_numeric = ifelse(data$time == '2H', 2, ifelse(data$time == '5H', 5, ifelse(data$time == '0H', 0, NA)))
希望有效!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.