[英]How to search column names of a data frame by a character string and replace the entire column name with a new one (for downstream PCA)
I am trying to create a PCA plot so I want to regroup my columns by batch (so that I cause use my column names as factors).我正在尝试创建一个 PCA 图,所以我想按批次重新组合我的列(以便我使用我的列名作为因素)。 I have read these two ( one , two ) questions and have tried what they suggested, but it has not worked correctly (or I'm doing something wrong).
我已经阅读了这两个问题( 一、 二)并尝试了他们的建议,但它没有正常工作(或者我做错了什么)。
What I have is a dataframe with a few thousand columns with sample names like:我所拥有的是一个包含几千列的数据框,其示例名称如下:
Measure Br_LV_05_BC1_1_POS Br_Lv_05_BC1_2_POS Br_Lv_05_BC1_3_POS Br_Lv_05_LR_1_POS Br_Lv_05_LR_2_POS
500 3000 8000 5000 1000 2000
600 4000 4000 4000 8000 8000
700 5000 6000 4000 9000 8000
800 6000 7000 8000 2000 1000
What I would like to do is perform a search and replace of all columns containing the string "BC1" and renaming that BC1 and same with "LR".我想要做的是搜索并替换包含字符串“BC1”的所有列,并将该 BC1 重命名为“LR”。 This way I can have R use these columns as factors for PCA instead of the PCA measuring each column as an individual sample.
通过这种方式,我可以让 R 使用这些列作为 PCA 的因子,而不是 PCA 将每个列作为单个样本进行测量。
Measure BC1 BC1 BC1 LR LR
500 3000 8000 5000 1000 2000
600 4000 4000 4000 8000 8000
700 5000 6000 4000 9000 8000
800 6000 7000 8000 2000 1000
That way I can transpose the data (if needed) and cluster my PCA with the samples as factors.这样我就可以转置数据(如果需要)并将我的 PCA 与样本作为因子进行聚类。 I hope I am correct in my thinking.
我希望我的想法是正确的。 Thank you kindly for you help.
非常感谢您的帮助。
Here is a base R
option with sub
where wee extract the 4th word from the column names and update it这是一个带有
sub
的base R
选项,其中我们从列名中提取第 4 个单词并更新它
names(df1)[-1] <- sub("^([^_]+_){3}([^_]+)_.*", "\\2", names(df1)[-1])
names(df1)[-1]
#[1] "BC1" "BC1" "BC1" "LR" "LR"
Or another option is strsplit
at _
and extract the 4th element或者另一个选项是
strsplit
at _
并提取第 4 个元素
names(df1)[-1] <- sapply(strsplit(names(df1)[-1], "_"), `[`, 4)
We can also use word
from stringr
我们也可以用
word
从stringr
library(stringr)
names(df1)[-1] <- word(names(df1)[-1], 4, sep="_")
NOTE: It is better not to have duplicate column names and it would be anyway changed in data.frame
by the make.unique
注意:最好不要有重复的列名,无论如何
data.frame
通过make.unique
在data.frame
中make.unique
df1 <- structure(list(Measure = c(500L, 600L, 700L, 800L), Br_LV_05_BC1_1_POS = c(3000L,
4000L, 5000L, 6000L), Br_Lv_05_BC1_2_POS = c(8000L, 4000L, 6000L,
7000L), Br_Lv_05_BC1_3_POS = c(5000L, 4000L, 4000L, 8000L), Br_Lv_05_LR_1_POS = c(1000L,
8000L, 9000L, 2000L), Br_Lv_05_LR_2_POS = c(2000L, 8000L, 8000L,
1000L)), class = "data.frame", row.names = c(NA, -4L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.