[英]How can I do a regular expression loop?
因此,我的情況是我有一個物理化學數據集中的文件列表,該數據集是通過多次計算創建的,我想在數據框中名為CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES的文件列中運行一次foreach或while循環。
我的文件名看起來像這樣:“ 1AH7A_TRP-16-A_GLU-9-A.log:”,“ 1AH7A_TRP-198-A_ASP-197-A.log:”,“ 1BGFA_TRP-43-A_GLU-44-A.log :”,“ 1CXQA_TRP-61-A_ASP-82-A.log:”等...
我想在“文件”列中運行一會兒或foreach循環,如果存在單詞“ GLU”或“ ASP”,然后在文件中找到“ GLU”或“ ASP”,則要打印它到列表。
因此,在以上文件中,打印順序為“ GLU”,“ ASP”,“ GLU”,“ ASP”。 同樣,我的文件沒有以任何特定的方式排序,一直到我的1273個文件條目一直到下。 然后,我可以保存此列表,並將其放入數據框中的標題為“ Residues”的列中,並進行一些有用的探索性數據分析。
注意:ASP用於氨基酸天冬氨酸,而GLU用於氨基酸谷氨酸。
我知道我可以像這樣在“文件”列中的正則表達式搜索grep。
搜索“ ASP”:
> grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)
[1] "1AH7A_TRP-198-A_ASP-197-A.log:"
[2] "1CXQA_TRP-61-A_ASP-82-A.log:"
[3] "1EJDA_TRP-279-A_ASP-278-A.log:"
[4] "1EU1A_TRP-32-A_ASP-33-A.log:"
如您所見,我得到了一些比賽。 實際上,我得到683場比賽。 但這還不夠。 我需要匹配發生的地方,而不是匹配發生的地方。
當然,我可以grep表示“ GLU”:
> grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)
[1] "1AH7A_TRP-16-A_GLU-9-A.log:"
[2] "1BGFA_TRP-43-A_GLU-44-A.log:"
[3] "1D8WA_TRP-17-A_GLU-14-A.log:"
我得到了一堆火柴!
我嘗試了一個for循環。 當然失敗了!!!
> for(i in 1:length(CD1_and_CH2_Distances$Distance_Files))
{if(grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))
{print("ASP")}
else if(grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))
{print("GLU")}}
它所做的只是打印:
[1] "ASP"
[1] "ASP"
[1] "ASP"
...
即使有“ GLU”!
我的意思是我可以進行對任何人都無所謂的基本代數循環:
> for(i in 1:10){print(i^2)}
[1] 1
[1] 4
[1] 9
[1] 16
無論如何,我檢查了警告,看看出了什么問題:
> warnings()
Warning messages:
1: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
the condition has length > 1 and only the first element will be used
2: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
the condition has length > 1 and only the first element will be used
如您所見,我一次又一次遇到相同的錯誤。 我想這很有意義,因為這是一個循環。 但是為什么會這樣,為什么我不能在循環內grep?
我要解析的數據框如下所示:
"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms"
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437
逗號分隔列。
這就是我希望結果看起來像的樣子:
"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms", "Residue",
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896, "GLU",
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204, "ASP",
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897, "GLU",
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956, "ASP",
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145, "GLU",
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058, "GLU",
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436, "GLU",
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437, "GLU",
...
任何幫助表示贊賞! 謝謝!
我們可以使用通過sub
派生的sub
字符串將數據集split
為data.frame
list
lst <- split(df1, sub(".*_([A-Z]{3})-.*", "\\1", df1$Files))
df1 <- structure(list(X = 1:8, Files = c("1AH7A_TRP-16-A_GLU-9-A.log:",
"1AH7A_TRP-198-A_ASP-197-A.log:", "1BGFA_TRP-43-A_GLU-44-A.log:",
"1CXQA_TRP-61-A_ASP-82-A.log:", "1D8WA_TRP-17-A_GLU-14-A.log:",
"1D8WA_TRP-17-A_GLU-18-A.log:", "1DJ0A_TRP-223-A_GLU-226-A.log:",
"1E58A_TRP-15-A_GLU-18-A.log:"), Interaction_Energy_kcal_per_Mole = c(-8.49787784468197,
-7.92648167142146, -6.73507800775909, -9.39887176290279, -9.74720319145055,
-11.3235196065977, -7.46891330209553, -6.59830781067777), atom = c("CD1",
"CD1", "CD1", "CD1", "CD1", "CD1", "CD1", "CD1"), Distance_Angstroms = c(4.03269909613896,
3.54307493570204, 4.17179517713897, 5.29897291934956, 3.69398565238145,
3.52345441293058, 5.41108436452436, 4.79790235415437)), .Names = c("X",
"Files", "Interaction_Energy_kcal_per_Mole", "atom", "Distance_Angstroms"
), class = "data.frame", row.names = c(NA, -8L))
我不確定我是否完全理解了您的問題,但考慮到您的數據位於“ dat”數據(其中包含GLU和ASP的行)中。 使用下面的表格制表一個可以包含“ ASP”和“ GLU”數據的字段。
library(stringr)
newvar <- NULL
newvar$GLU <- str_extract(dat$Files,"(GLU)")
newvar$ASP <- str_extract(dat$Files,"(ASP)")
newvar1 <- data.frame(newvar)
newvar1
library(tidyr)
newvar1[is.na(newvar1)] = ""
new <- unite(newvar1, new, GLU:ASP, sep='')
dat$new <- new
在這里,名為new的字段將包含您的GLU和ASP值
回答:
dat
X Files Interaction_Energy_kcal_per_Mole atom Distance_Angstroms new
1 1 1AH7A_TRP-16-A_GLU-9-A.log: -8.497878 CD1 4.032699 GLU
2 2 1AH7A_TRP-198-A_ASP-197-A.log: -7.926482 CD1 3.543075 ASP
3 3 1BGFA_TRP-43-A_GLU-44-A.log: -6.735078 CD1 4.171795 GLU
4 4 1CXQA_TRP-61-A_ASP-82-A.log: -9.398872 CD1 5.298973 ASP
5 5 1D8WA_TRP-17-A_GLU-14-A.log: -9.747203 CD1 3.693986 GLU
6 6 1D8WA_TRP-17-A_GLU-18-A.log: -11.323520 CD1 3.523454 GLU
7 7 1DJ0A_TRP-223-A_GLU-226-A.log: -7.468913 CD1 5.411084 GLU
8 8 1E58A_TRP-15-A_GLU-18-A.log: -6.598308 CD1 4.797902 GLU
很長一段時間后,我想出了解決問題的方法:
#將我的專欄另存為矢量,因為各種因素正在使世界燃燒:
Files <- as.vector(CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)
#將文件沿着兩個下划線分成三部分,並將其保存回我的向量中,並保留下划線周圍的第三個剪切點。
Files <- str_split_fixed(Files, "_", 3)[,3]
結果:
[1]“ GLU-9-A.log:”
“ ASP-197-A.log:”等...
#將這些結果沿連字符分開,並取下第一個連字符或第一個剪切片段旁邊的內容:
Residues <- str_split_fixed(Files, "-", 3)[,1]
> Residues
[1] "GLU" "ASP" "GLU", ...
將“殘渣”列添加到我的data.frame中。
CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Residue <- Residue
我猜grep函數被高估了。 我不得不為此功能努力。
假設您將要分析的數據保存在文件glu_vs_asp.csv
。
以下是如何創建兩個數據幀的示例,一個用於GLU,一個用於ASP:
# Read .csv file.
dt <- read.table(file = "glu_vs_asp.csv", sep = ",", header = TRUE)
# Create two data frames, one for GLU and one for ASP.
dt_glu <- dt[grep("GLU", dt$Files),]
dt_asp <- dt[grep("ASP", dt$Files),]
要創建同時包含GLU和ASP的數據框,您可以嘗試以下操作:
dt_glu_asp <- dt[grep("(ASP|GLU)", dt$Files),]
命令
grep("ASP", dt$Files)
grep("GLU", dt$Files)
給您“ Files
列中分別包含“ ASP”和“ GLU”的行的索引。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.