简体   繁体   中英

correlation matrix from text file

I am trying to make correlation matrix from a text files what I have. I want to get the correlation values from these files.

text file what I have

[56] "[1] \”values “”of the                                                                                                          
[57] "[1] \”e”xamples                                                                                                              
[58] "[1] \”dummy “”lines                                                                                            
[59] "[1] \”testing”                                                                                                                     
[60] "[1] \"Correlation Values\””                                                                                                         
[61] "[1] \"Correlation between XXX and YYY: 0.7054 (0.0429)\""                                                                            
[62] "[1] \"Correlation between XXX and ZZZ: 0.601 (0.0289)\""                                                                             
[63] "[1] \"Correlation between YYY and ZZZ: 0.6434 (0.0306)\""                                                                            
[64] "[1] \”Finished\””                                                                                        
[65] "[1] \”testing “”linne                                                                            
[66] “test”                                                                                                                                          
[67] “test “again   

The matrix will look like

      XXX       YYY      ZZZ
XXX   1        0.7054    0.601
YYY   0.7054   1         0.6434
ZZZ   0.601    0.6434    1

I understand that there is some regex technique involved, but think its too advanced for a novice like me. I can get the lines what I want from the file using the following, but still not able to workout the way to extract those numbers and put in a matrix.

mm[grep("Correlation Values”, mm, value = FALSE) + c(1:3)] ## m is the above file that I loaded.

To add the complexity to it the variables and number change in all files. Say this is the case of 4*4 matrix

[95] "[1] \"Correlation Values\””                                                                                                                                 
 [96] "[1] \"Correlation between XXX and YYY: 0.7054 (0.0429)\""                                                                                                    
 [97] "[1] \"Correlation between XXX and ZZZ: 0.601 (0.0289)\""                                                                                                     
 [98] "[1] \"Correlation between XXX and CCC: 0.0178 (0.0281)\""                                                                                                    
 [99] "[1] \"Correlation between YYY and ZZZ: 0.6434 (0.0306)\""                                                                                                    
[100] "[1] \"Correlation between YYY and CCC: 0.0103 (0.0286)\""                                                                                                    
[101] "[1] \"Correlation between ZZZ and CCC: 0.0174 (0.0202)\""                                                                                                    
[102] "[1] \”Finished\””    

Well this is a start anyway... not elegant but step by step gets you to just having the relevant information in a list. I put your file in a file called sofile.txt.

# read the messy file
filedata <- readLines("../bugs/sofile.txt", warn = FALSE)
# get rid of lines you don't need.
preline<- grep("Correlation Values", filedata, fixed = TRUE)
postline<- grep("Finished", filedata, fixed = TRUE)
filedata <- filedata[(preline+1):(postline-1)]
# just keep the important parts of the strings
filedata <- substr(filedata, 33, nchar(filedata)-13)
filedata <- sub( ":", "", filedata, fixed = TRUE)
filedata <- sub( " and", "", filedata, fixed = TRUE)
# split them up and make a list
filedata_list<- strsplit(filedata, split = " ")
# put it into a matrix 
new <- Reduce(rbind, filedata_list)
# extract the variable names
names <- unique(c(new[,1], new[,2]))
#create a matrix of NAs with the right dimensions and names.
corrmat <- matrix(nrow =length(names),  ncol = (length(names)), dimnames = list(names, names))

Then you would work on replacing the NAs. Which you could do by working your way through the list to assign the values.

Again ugly as anything but will get you started.

for (i in 1:length(names)){
 corrmat[filedata_list[[i]][1], filedata_list[[i]][2]] <- filedata_list[[i]][3]
 corrmat[filedata_list[[i]][2], filedata_list[[i]][1]] <- filedata_list[[i]][3]
 corrmat[i, i] <- 1
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM