简体   繁体   English

从文本文件中提取列

[英]Extracting columns from text file

I load a text file (tree.txt) to R, with the below content (copy pasted from JWEKA - J48 command). 我将具有以下内容的文本文件(tree.txt)加载到R中(从JWEKA-J48命令粘贴的副本)。 I use the following command to load the text file: 我使用以下命令加载文本文件:

data3 <-read.table (file.choose(), header = FALSE,sep = ",")

I would like to insert each column into a separate variables named like the following format COL1, COL2 ... COL8 (in this example since we have 8 columns). 我想将每一列插入到一个单独的变量中,该变量的名称类似于以下格式COL1,COL2 ... COL8(在此示例中,因为我们有8列)。 If you load it to EXCEL with delimited separation each row will be separated in one column (this is the required result). 如果您以定界分隔符将其加载到EXCEL,则每一行将分隔为一列(这是必需的结果)。 Each COLn will contain the relevant characters of the tree in this example. 在此示例中,每个COLn将包含树的相关字符。 How can separate and insert the text file into these columns automatically while ignoring the header and footer content of the file? 如何在不忽略文件的页眉和页脚内容的情况下自动将文本文件分离并插入这些列中?

Here is the text file content: 这是文本文件的内容:

[[1]]                                                               
J48 pruned  tree                                                        
------------------                                                              

MSTV    <=  0.4                                                     
|   MLTV    <=  4.1:    3   -2                                          
|   MLTV    >   4.1                                                 
|   |   ASTV    <=  79                                              
|   |   |   b   <=  1383:00:00  2   -18                                 
|   |   |   b   >   1383                                            
|   |   |   |   UC  <=  05:00   1   -2                              
|   |   |   |   UC  >   05:00   2   -2                              
|   |   ASTV    >   79:00:00    3   -2                                      
MSTV    >   0.4                                                     
|   DP  <=  0                                                   
|   |   ALTV    <=  09:00   1   (170.0/2.0)                                     
|   |   ALTV    >   9                                               
|   |   |   FM  <=  7                                           
|   |   |   |   LBE <=  142:00:00   1   (27.0/1.0)                              
|   |   |   |   LBE >   142                                     
|   |   |   |   |   AC  <=  2                                   
|   |   |   |   |   |   e   <=  1058:00:00  1   -5                      
|   |   |   |   |   |   e   >   1058                                
|   |   |   |   |   |   |   DL  <=  04:00   2   (9.0/1.0)                   
|   |   |   |   |   |   |   DL  >   04:00   1   -2                  
|   |   |   |   |   AC  >   02:00   1   -3                          
|   |   |   FM  >   07:00   2   -2                                  
|   DP  >   0                                                   
|   |   DP  <=  1                                               
|   |   |   UC  <=  03:00   2   (4.0/1.0)                                   
|   |   |   UC  >   3                                           
|   |   |   |   MLTV    <=  0.4:    3   -2                              
|   |   |   |   MLTV    >   0.4:    1   -8                              
|   |   DP  >   01:00   3   -8                                      

Number  of  Leaves  :   16                                              

Size    of  the tree    :   31

An example of the COL1 content will be: MSTV | COL1内容的示例为:MSTV | | | | | | | | | | | | | | | MSTV | MSTV | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

COL2 content will be: MLTV MLTV | COL2内容将是:MLTV MLTV | | | | | | | | | | | > DP | > DP | | | | | | | | | | | | | | | | | | | | | | | DP | DP | | | | | | | | | | |

Try this: 尝试这个:

cleaned.txt <- capture.output(cat(paste0(tail(head(readLines("FILE_LOCATION"), -4), -4), collapse = '\n'), sep = '\n'))
cleaned.df <- read.fwf(file = textConnection(cleaned.txt), 
                   header = FALSE, 
                   widths = rep.int(4, max(nchar(cleaned.txt)/4)),
                   strip.white= TRUE
                   )
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]

For the cleaning process, I end up using a combination of head and tail to remove the 4 spaces on the top and the bottom. 对于清洁过程,我最终使用头部和尾部的组合来去除顶部和底部的4个空格。 There's probably a more efficient way to do this outside of R, but this isn't so bad. 可能有一种更有效的方法可以在R之外执行此操作,但这还不错。 Generally, I'm just making the file readable to R. 通常,我只是使文件对R可读。

Your file looks like a fixed-width file so I use read.fwf, and use textConnection() to point the function to the cleaned output. 您的文件看起来像是固定宽度的文件,因此我使用read.fwf,并使用textConnection()将函数指向已清除的输出。

Finally, I'm not sure how your data is actually structured, but when I copied it from stackoverflow, it pasted with a bunch of whitespace at the end of each line. 最后,我不确定您的数据的实际结构,但是当我从stackoverflow复制数据时,在每行末尾都粘贴了一堆空格。 I'm using some tricks to guess at how long the file is, and removing extraneous columns over here 我正在使用一些技巧来猜测文件的长度,并在此处删除多余的列

widths = rep.int(4, max(nchar(cleaned.txt)/4))
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]

Next, I'm creating the data in the way you would like it structured. 接下来,我将按照您希望的结构化方式来创建数据。

for (i in colnames(cleaned.df)) {
  assign(i, subset(cleaned.df, select=i))
  assign(i, capture.output(cat(paste0(unlist(get(i)[get(i)!=""])),sep = ' ', fill = FALSE)))
}

rm(i)
rm(cleaned.df)
rm(cleaned.txt)

What this does is it creates a loop for each column header in your data frame. 这样做是为数据框中的每个列标题创建一个循环。

From there it uses assign() to put all the data in each column into its' own data frame. 从那里开始,它使用assign()将每一列中的所有数据放入其自己的数据帧中。 In your case, they are named V1 through V15. 在您的情况下,它们被命名为V1至V15。

Next, it uses a combination of cat() and paste() with unlist() an capture.output() to concatenate your list into a single character vectors, for each of the data frames, so they are now character vectors, instead of data frames. 接下来,对于每个数据帧,它结合使用cat()和paste()与unlist()一个capture.output()的组合,将您的列表连接到单个字符向量中,因此它们现在是字符向量,而不是数据帧。

Keep in mind that because you wanted a space at each new character, I'm using a space as a separator. 请记住,由于每个新字符都需要一个空格,因此我使用空格作为分隔符。 But because this is a fixed-width file, some columns are completely blank, which I'm removing using 但是因为这是一个固定宽度的文件,所以某些列完全为空白,我正在使用删除

get(i)[get(i)!=""]

(Your question said you wanted COL2 to be: MLTV MLTV | | | | | | > DP | | | | | | | | | | | | DP | | | | | |). (您的问题说您希望COL2为:MLTV MLTV | | | | | |> DP | | | | | | | | | | | | DP | | | | | | |)。

If we just use get(i), there will be a leading whitespace in the output. 如果仅使用get(i),则输出中将有一个前导空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM