簡體   English   中英

我如何做一個正則表達式循環?

[英]How can I do a regular expression loop?

因此,我的情況是我有一個物理化學數據集中的文件列表,該數據集是通過多次計算創建的,我想在數據框中名為CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES的文件列中運行一次foreach或while循環。

我的文件名看起來像這樣:“ 1AH7A_TRP-16-A_GLU-9-A.log:”,“ 1AH7A_TRP-198-A_ASP-197-A.log:”,“ 1BGFA_TRP-43-A_GLU-44-A.log :”,“ 1CXQA_TRP-61-A_ASP-82-A.log:”等...

我想在“文件”列中運行一會兒或foreach循環,如果存在單詞“ GLU”或“ ASP”,然后在文件中找到“ GLU”或“ ASP”,則要打印它到列表。

因此,在以上文件中,打印順序為“ GLU”,“ ASP”,“ GLU”,“ ASP”。 同樣,我的文件沒有以任何特定的方式排序,一直到我的1273個文件條目一直到下。 然后,我可以保存此列表,並將其放入數據框中的標題為“ Residues”的列中,並進行一些有用的探索性數據分析。

注意:ASP用於氨基酸天冬氨酸,而GLU用於氨基酸谷氨酸。


我知道我可以像這樣在“文件”列中的正則表達式搜索grep。

搜索“ ASP”:

> grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)

[1] "1AH7A_TRP-198-A_ASP-197-A.log:"  
[2] "1CXQA_TRP-61-A_ASP-82-A.log:"    
[3] "1EJDA_TRP-279-A_ASP-278-A.log:"  
[4] "1EU1A_TRP-32-A_ASP-33-A.log:" 

如您所見,我得到了一些比賽。 實際上,我得到683場比賽。 但這還不夠。 我需要匹配發生的地方,而不是匹配發生的地方。

當然,我可以grep表示“ GLU”:

> grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files, value = TRUE)

[1] "1AH7A_TRP-16-A_GLU-9-A.log:"     
[2] "1BGFA_TRP-43-A_GLU-44-A.log:"    
[3] "1D8WA_TRP-17-A_GLU-14-A.log:"

我得到了一堆火柴!

我嘗試了一個for循環。 當然失敗了!!!

  > for(i in 1:length(CD1_and_CH2_Distances$Distance_Files))
{if(grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))

{print("ASP")} 

else if(grep("GLU", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files))

{print("GLU")}}

它所做的只是打印:

[1] "ASP"

[1] "ASP"

[1] "ASP"

...

即使有“ GLU”!

我的意思是我可以進行對任何人都無所謂的基本代數循環:

> for(i in 1:10){print(i^2)}
[1] 1
[1] 4
[1] 9
[1] 16

無論如何,我檢查了警告,看看出了什么問題:

> warnings() 
Warning messages: 

1: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
  the condition has length > 1 and only the first element will be used
2: In if (grep("ASP", CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)) { ... :
  the condition has length > 1 and only the first element will be used

如您所見,我一次又一次遇到相同的錯誤。 我想這很有意義,因為這是一個循環。 但是為什么會這樣,為什么我不能在循環內grep?


我要解析的數據框如下所示:

"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms"
"1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896
"2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204
"3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897
"4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956
"5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145
"6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058
"7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436
"8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437

逗號分隔列。

這就是我希望結果看起來像的樣子:

"","Files","Interaction_Energy_kcal_per_Mole","atom","Distance_Angstroms", "Residue",

    "1","1AH7A_TRP-16-A_GLU-9-A.log:",-8.49787784468197,"CD1",4.03269909613896, "GLU",

    "2","1AH7A_TRP-198-A_ASP-197-A.log:",-7.92648167142146,"CD1",3.54307493570204, "ASP",

    "3","1BGFA_TRP-43-A_GLU-44-A.log:",-6.73507800775909,"CD1",4.17179517713897, "GLU",

    "4","1CXQA_TRP-61-A_ASP-82-A.log:",-9.39887176290279,"CD1",5.29897291934956, "ASP",

    "5","1D8WA_TRP-17-A_GLU-14-A.log:",-9.74720319145055,"CD1",3.69398565238145, "GLU",

    "6","1D8WA_TRP-17-A_GLU-18-A.log:",-11.3235196065977,"CD1",3.52345441293058, "GLU",

    "7","1DJ0A_TRP-223-A_GLU-226-A.log:",-7.46891330209553,"CD1",5.41108436452436, "GLU",

    "8","1E58A_TRP-15-A_GLU-18-A.log:",-6.59830781067777,"CD1",4.79790235415437, "GLU",

...

任何幫助表示贊賞! 謝謝!

我們可以使用通過sub派生的sub字符串將數據集splitdata.frame list

lst <- split(df1, sub(".*_([A-Z]{3})-.*", "\\1", df1$Files))

數據

  df1 <- structure(list(X = 1:8, Files = c("1AH7A_TRP-16-A_GLU-9-A.log:", 
"1AH7A_TRP-198-A_ASP-197-A.log:", "1BGFA_TRP-43-A_GLU-44-A.log:", 
"1CXQA_TRP-61-A_ASP-82-A.log:", "1D8WA_TRP-17-A_GLU-14-A.log:", 
"1D8WA_TRP-17-A_GLU-18-A.log:", "1DJ0A_TRP-223-A_GLU-226-A.log:", 
"1E58A_TRP-15-A_GLU-18-A.log:"), Interaction_Energy_kcal_per_Mole = c(-8.49787784468197, 
-7.92648167142146, -6.73507800775909, -9.39887176290279, -9.74720319145055, 
-11.3235196065977, -7.46891330209553, -6.59830781067777), atom = c("CD1", 
"CD1", "CD1", "CD1", "CD1", "CD1", "CD1", "CD1"), Distance_Angstroms = c(4.03269909613896, 
3.54307493570204, 4.17179517713897, 5.29897291934956, 3.69398565238145, 
3.52345441293058, 5.41108436452436, 4.79790235415437)), .Names = c("X", 
"Files", "Interaction_Energy_kcal_per_Mole", "atom", "Distance_Angstroms"
), class = "data.frame", row.names = c(NA, -8L))

我不確定我是否完全理解了您的問題,但考慮到您的數據位於“ dat”數據(其中包含GLU和ASP的行)中。 使用下面的表格制表一個可以包含“ ASP”和“ GLU”數據的字段。

library(stringr)
    newvar <- NULL
    newvar$GLU <- str_extract(dat$Files,"(GLU)")
    newvar$ASP <- str_extract(dat$Files,"(ASP)")
    newvar1 <- data.frame(newvar)
    newvar1
    library(tidyr)
    newvar1[is.na(newvar1)] = ""
    new <- unite(newvar1, new, GLU:ASP, sep='')
    dat$new <- new

在這里,名為new的字段將包含您的GLU和ASP值

回答:

    dat
  X                          Files Interaction_Energy_kcal_per_Mole atom Distance_Angstroms new
1 1    1AH7A_TRP-16-A_GLU-9-A.log:                        -8.497878  CD1           4.032699 GLU
2 2 1AH7A_TRP-198-A_ASP-197-A.log:                        -7.926482  CD1           3.543075 ASP
3 3   1BGFA_TRP-43-A_GLU-44-A.log:                        -6.735078  CD1           4.171795 GLU
4 4   1CXQA_TRP-61-A_ASP-82-A.log:                        -9.398872  CD1           5.298973 ASP
5 5   1D8WA_TRP-17-A_GLU-14-A.log:                        -9.747203  CD1           3.693986 GLU
6 6   1D8WA_TRP-17-A_GLU-18-A.log:                       -11.323520  CD1           3.523454 GLU
7 7 1DJ0A_TRP-223-A_GLU-226-A.log:                        -7.468913  CD1           5.411084 GLU
8 8   1E58A_TRP-15-A_GLU-18-A.log:                        -6.598308  CD1           4.797902 GLU

很長一段時間后,我想出了解決問題的方法:

#將我的專欄另存為矢量,因為各種因素正在使世界燃燒:

Files <- as.vector(CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Files)

#將文件沿着兩個下划線分成三部分,並將其保存回我的向量中,並保留下划線周圍的第三個剪切點。

Files <- str_split_fixed(Files, "_", 3)[,3]

結果:

[1]“ GLU-9-A.log:”
“ ASP-197-A.log:”等...

#將這些結果沿連字符分開,並取下第一個連字符或第一個剪切片段旁邊的內容:

Residues <- str_split_fixed(Files, "-", 3)[,1]

> Residues
   [1] "GLU" "ASP" "GLU", ... 

將“殘渣”列添加到我的data.frame中。

CD1_and_CH2_INTERACTION_ENERGIES_and_DISTANCES$Residue <- Residue

我猜grep函數被高估了。 我不得不為此功能努力。

假設您將要分析的數據保存在文件glu_vs_asp.csv

以下是如何創建兩個數據幀的示例,一個用於GLU,一個用於ASP:

# Read .csv file.
dt <- read.table(file = "glu_vs_asp.csv", sep = ",", header = TRUE)

# Create two data frames, one for GLU and one for ASP.
dt_glu <- dt[grep("GLU", dt$Files),]

dt_asp <- dt[grep("ASP", dt$Files),]

要創建同時包含GLU和ASP的數據框,您可以嘗試以下操作:

dt_glu_asp <- dt[grep("(ASP|GLU)", dt$Files),]

命令

grep("ASP", dt$Files)
grep("GLU", dt$Files)

給您“ Files列中分別包含“ ASP”和“ GLU”的行的索引。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM