使用R读取带有两个定界符的dat文件

Question

I am trying to read a data sample such as below: 我正在尝试读取如下数据示例：

1344428:-1,1,-1,415,-649,0.00;-1,2,-1,1090,-2167,0.00;-1,3,-1,-881,-3164,0.00;-1,4,-1,-624,1529,0.00;-1,5,-1,-849,-2875,0.00;-1,6,-1,856,-2341,0.00;-1,7,-1,758,-2408,0.00;-1,8,-1,-201,-2307,0.00;-1,9,-1,-963,-2807,0.00;-1,10,-1,-460,-3309,0.00;-1,11,-1,-1645,-1773,0.00;-1,12,-1,1487,-518,0.00;-1,13,-1,685,-3113,0.00;-1,14,-1,-935,-3217,0.00;-1,15,-1,-1101,-2430,0.00;-1,16,-1,754,-2946,0.00;-1,17,-1,823,-2497,0.00;-1,18,-1,-948,-2431,0.00;-1,19,-1,774,-2242,0.00;-1,20,-1,861,-2192,0.00;-1,21,-1,433,-3391,0.00;-1,22,-1,133,-2190,0.00;-1,23,-1,-977,-2585,0.00;-1,24,-1,-968,-2107,0.00;-1,25,-1,175,-3062,0.00;-1,26,-1,265,-2736,0.00;-1,27,-1,67,-2735,0.00;-1,28,-1,-281,-2752,0.00;4,29,-1,5550,4400,0.00;:174,-2563,11,28.67,A,Dead,SetAway;: 1344429:-1,1,-1,415,-649,0.00;-1,2,-1,1090,-2167,0.00;-1,3,-1,-885,-3169,0.00;-1,4,-1,-626,1527,0.00;-1,5,-1,-852,-2887,0.00;-1,6,-1,854,-2340,0.00;-1,7,-1,761,-2411,0.00;-1,8,-1,-201,-2307,0.00;-1,9,-1,-967,-2808,0.00;-1,10,-1,-460,-3309,0.00;-1,11,-1,-16 1344428：-1,1，-1,415，-649,0.00; -1,2，-1,1090，-2167,0.00; -1,3，-1，-881，-3164,0.00; -1,4 ，-1，-624,1529,0.00; -1,5，-1，-849，-2875,0.00; -1,6，-1,856，-2341,0.00; -1,7，-1,758，-2408 ，0.00; -1,8，-1，-201，-2307,0.00; -1,9，-1，-963，-2807,0.00; -1,10，-1，-460，-3309,0.00 ; -1,11，-1，-1645，-1773,0.00; -1,12，-1,1487，-518,0.00; -1,13，-1,685，-3113,0.00; -1,14， -1，-935，-3217,0.00; -1,15，-1，-1101，-2430,0.00; -1,16，-1,754，-2946,0.00; -1,17，-1,823，-2497 ，0.00; -1,18，-1，-948，-2431,0.00; -1,19，-1,774，-2242,0.00; -1,20，-1,861，-2192,0.00; -1,21， -1,433，-3391,0.00; -1,22，-1,133，-2190,0.00; -1,23，-1，-977，-2585,0.00; -1,24，-1，-968，-2107 ，0.00; -1,25，-1,175，-3062,0.00; -1,26，-1,265，-2736,0.00; -1,27，-1,67，-2735,0.00; -1,28，- 1，-281，-2752,0.00; 4,29，-1,5550,4400,0.00;：174，-2563,11,28.67，A，Dead，SetAway ;: 1344429：-1,1，-1,415， -649,0.00; -1,2，-1,1090，-2167,0.00; -1,3，-1，-885，-3169,0.00; -1,4，-1，-626,1527,0.00 ; -1,5，-1，-852，-2887,0.00; -1,6，-1,854，-2340,0.00; -1,7，-1,761，-2411,0.00; -1,8，-1 ，-201，-2307,0.00; -1,9，-1，-967，-2808,0.00; -1,10，-1，-460，-3309,0.00; -1,11，-1，- 16 47,-1777,0.00;-1,12,-1,1485,-518,0.00;-1,13,-1,687,-3118,0.00;-1,14,-1,-938,-3222,0.00;-1,15,-1,-1100,-2430,0.00;-1,16,-1,744,-2946,0.00;-1,17,-1,815,-2505,0.00;-1,18,-1,-950,-2429,0.00;-1,19,-1,773,-2237,0.00;-1,20,-1,861,-2190,0.00;-1,21,-1,433,-3392,0.00;-1,22,-1,133,-2189,0.00;-1,23,-1,-980,-2593,0.00;-1,24,-1,-961,-2109,0.00;-1,25,-1,176,-3056,0.00;-1,26,-1,265,-2731,0.00;-1,27,-1,67,-2736,0.00;-1,28,-1,-283,-2746,0.00;4,29,-1,5550,4400,0.00;:174,-2563,11,28.67,A,Dead,SetAway;: 47，-1777,0.00; -1,12，-1,1485，-518,0.00; -1,13，-1,687，-3118,0.00; -1,14，-1，-938，-3222,0.00 ; -1,15，-1，-1100，-2430,0.00; -1,16，-1,744，-2946,0.00; -1,17，-1,815，-2505,0.00; -1,18，-1 ，-950，-2429,0.00; -1,19，-1,773，-2237,0.00; -1,20，-1,861，-2190,0.00; -1,21，-1,433，-3392,0.00; -1 ，22，-1,133，-2189,0.00; -1,23，-1，-980，-2593,0.00; -1,24，-1，-961，-2109,0.00; -1,25，-1,176 ，-3056,0.00; -1,26，-1,265，-2731,0.00; -1,27，-1,67，-2736,0.00; -1,28，-1，-283，-2746,0.00; 4,29，-1,5550,4400,0.00;：174，-2563,11,28.67，A，Dead，SetAway ;：

The data is separated to 3 chunks: 数据分为3个块：

The first is a time stamp ending with ":" we can keep this as numerical 第一个是以“：”结尾的时间戳，我们可以将其保留为数字
then multiple sets of numbers (multiple of six) ending with ";:" 然后是多组数字（六个），以“;：”结尾
finally a third chunk (7 elements, mixed between string and numerical) ending with ";:" 最后是第三个块（7个元素，在字符串和数字之间混合），以“;：”结尾

Is there an elegant way to read this data into R data frame? 是否有一种优雅的方法将此数据读入R数据帧？ I tried 我试过了

read.table("855360.dat",
                        header = FALSE,
                            sep = ";")

but it requires a lot of manipulation to set the elements into the 3 chunks that I can them join and manipulate? 但是需要大量的操作才能将元素设置为3个大块，我可以将它们加入并操纵？

Answer 1

If a single data frame result is OK then just replace colon and semicolon with comma and read it in: 如果单个数据帧结果正常，则只需用逗号替换冒号和分号，然后将其读取：

L <- readLines("myfile")
read.table(text =  gsub("[:;]+", ",", L), sep = ",", as.is = TRUE)

or if you want to generate a nested list structure then using L from above: 或者，如果您想生成一个嵌套列表结构，则从上方使用L ：

lapply(lapply(strsplit(L, ":"), strsplit, ";"), lapply, strsplit, ",")

Answer 2

If this is turned into a multi-pass process, it may take a little longer but may be simpler to modify and maintain in the long run. 如果将其转换为多遍处理，则可能会花费更长的时间，但从长远来看，可能更易于修改和维护。

If you start by splitting the string by the ";:" characters, you'll notice that the odd-indices are the main data (including timestamp), and the even-indices are your "third chunk" with mixed num/char entries. 如果先将字符串用";:"字符分隔，您会注意到奇数索引是主要数据（包括时间戳记），偶数索引是您的“第三块”，其中包含混合的num / char项。 Once you break this down, you may realize that we still have a parsing problem, but it's a little simpler. 分解后，您可能会意识到我们仍然有一个解析问题，但这要简单一些。

txt <- "1344428:-1,1,-1,415,-649,0.00;-1,2,-1,1090,-2167,0.00;-1,3,-1,-881,-3164,0.00;-1,4,-1,-624,1529,0.00;-1,5,-1,-849,-2875,0.00;-1,6,-1,856,-2341,0.00;-1,7,-1,758,-2408,0.00;-1,8,-1,-201,-2307,0.00;-1,9,-1,-963,-2807,0.00;-1,10,-1,-460,-3309,0.00;-1,11,-1,-1645,-1773,0.00;-1,12,-1,1487,-518,0.00;-1,13,-1,685,-3113,0.00;-1,14,-1,-935,-3217,0.00;-1,15,-1,-1101,-2430,0.00;-1,16,-1,754,-2946,0.00;-1,17,-1,823,-2497,0.00;-1,18,-1,-948,-2431,0.00;-1,19,-1,774,-2242,0.00;-1,20,-1,861,-2192,0.00;-1,21,-1,433,-3391,0.00;-1,22,-1,133,-2190,0.00;-1,23,-1,-977,-2585,0.00;-1,24,-1,-968,-2107,0.00;-1,25,-1,175,-3062,0.00;-1,26,-1,265,-2736,0.00;-1,27,-1,67,-2735,0.00;-1,28,-1,-281,-2752,0.00;4,29,-1,5550,4400,0.00;:174,-2563,11,28.67,A,Dead,SetAway;: 1344429:-1,1,-1,415,-649,0.00;-1,2,-1,1090,-2167,0.00;-1,3,-1,-885,-3169,0.00;-1,4,-1,-626,1527,0.00;-1,5,-1,-852,-2887,0.00;-1,6,-1,854,-2340,0.00;-1,7,-1,761,-2411,0.00;-1,8,-1,-201,-2307,0.00;-1,9,-1,-967,-2808,0.00;-1,10,-1,-460,-3309,0.00;-1,11,-1,-1647,-1777,0.00;-1,12,-1,1485,-518,0.00;-1,13,-1,687,-3118,0.00;-1,14,-1,-938,-3222,0.00;-1,15,-1,-1100,-2430,0.00;-1,16,-1,744,-2946,0.00;-1,17,-1,815,-2505,0.00;-1,18,-1,-950,-2429,0.00;-1,19,-1,773,-2237,0.00;-1,20,-1,861,-2190,0.00;-1,21,-1,433,-3392,0.00;-1,22,-1,133,-2189,0.00;-1,23,-1,-980,-2593,0.00;-1,24,-1,-961,-2109,0.00;-1,25,-1,176,-3056,0.00;-1,26,-1,265,-2731,0.00;-1,27,-1,67,-2736,0.00;-1,28,-1,-283,-2746,0.00;4,29,-1,5550,4400,0.00;:174,-2563,11,28.67,A,Dead,SetAway;:"

x <- strsplit(txt, ";:")[[1]]
x <- sapply(x, trimws, USE.NAMES = FALSE)
x[1]
# [1] "1344428:-1,1,-1,415,-649,0.00;-1,2,-1,1090,-2167,0.00;-1,3,-1,-881,-3164,0.00;-1,4,-1,-624,1529,0.00;-1,5,-1,-849,-2875,0.00;-1,6,-1,856,-2341,0.00;-1,7,-1,758,-2408,0.00;-1,8,-1,-201,-2307,0.00;-1,9,-1,-963,-2807,0.00;-1,10,-1,-460,-3309,0.00;-1,11,-1,-1645,-1773,0.00;-1,12,-1,1487,-518,0.00;-1,13,-1,685,-3113,0.00;-1,14,-1,-935,-3217,0.00;-1,15,-1,-1101,-2430,0.00;-1,16,-1,754,-2946,0.00;-1,17,-1,823,-2497,0.00;-1,18,-1,-948,-2431,0.00;-1,19,-1,774,-2242,0.00;-1,20,-1,861,-2192,0.00;-1,21,-1,433,-3391,0.00;-1,22,-1,133,-2190,0.00;-1,23,-1,-977,-2585,0.00;-1,24,-1,-968,-2107,0.00;-1,25,-1,175,-3062,0.00;-1,26,-1,265,-2736,0.00;-1,27,-1,67,-2735,0.00;-1,28,-1,-281,-2752,0.00;4,29,-1,5550,4400,0.00"
x[2]
# [1] "174,-2563,11,28.67,A,Dead,SetAway"

An important assumption here is that we will always have pairs of timestamp/data and follow-on chunks: 这里的一个重要假设是，我们将始终有一对时间戳/数据和后续块：

if (length(x) %% 2 != 0) stop("oops, uneven pairs")
odds <- seq(1, length(x), by = 2)
str(x[odds])
#  chr [1:2] "1344428:-1,1,-1,415,-649,0.00;-1,2,-1,1090,-2167,0.00;-1,3,-1,-881,-3164,0.00;-1,4,-1,-624,1529,0.00;-1,5,-1,-849,-2875,0.00;-1"| __truncated__ ...
x[-odds]
# [1] "174,-2563,11,28.67,A,Dead,SetAway" "174,-2563,11,28.67,A,Dead,SetAway"

From here, realize that we can easily extract the timestamp with another strsplit , and then the rest can be converted into something read.csv likes by replacing ";" 从这里开始，意识到我们可以轻松地用另一个strsplit提取时间戳，然后可以通过替换";"将其余部分转换成类似read.csv ";" with newlines (same with your third-chunk): 与换行符（与第三块相同）：

timestamps <- lapply(firstsplit, function(z) data.frame(timestamp = as.numeric(z[1])))
data1 <- lapply(firstsplit, function(lst) read.csv(textConnection(gsub(";", "\n", lst[[2]])), header = FALSE))
data2 <- lapply(secondsplit, function(z) read.csv(textConnection(z), header = FALSE))

Taking a peek at one of the pairs of data: 看一下其中一对数据：

bothlst <- mapply(list, timestamps, data1, data2, SIMPLIFY = FALSE)
str(bothlst[[1]])
# List of 3
#  $ :'data.frame': 1 obs. of  1 variable:
#   ..$ timestamp: num 1344428
#  $ :'data.frame': 29 obs. of  6 variables:
#   ..$ V1: int [1:29] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#   ..$ V2: int [1:29] 1 2 3 4 5 6 7 8 9 10 ...
#   ..$ V3: int [1:29] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#   ..$ V4: int [1:29] 415 1090 -881 -624 -849 856 758 -201 -963 -460 ...
#   ..$ V5: int [1:29] -649 -2167 -3164 1529 -2875 -2341 -2408 -2307 -2807 -3309 ...
#   ..$ V6: num [1:29] 0 0 0 0 0 0 0 0 0 0 ...
#  $ :'data.frame': 1 obs. of  7 variables:
#   ..$ V1: int 174
#   ..$ V2: int -2563
#   ..$ V3: int 11
#   ..$ V4: num 28.7
#   ..$ V5: Factor w/ 1 level "A": 1
#   ..$ V6: Factor w/ 1 level "Dead": 1
#   ..$ V7: Factor w/ 1 level "SetAway": 1

This is a nice nested-list depiction of your data. 这是一个很好的嵌套列表数据描述。 I intentionally made the timestamp a data.frame to simplify a step later, though this is certainly not a requirement. 我故意将timestamp为data.frame以简化以后的步骤，尽管这当然不是data.frame 。

If you want this depicted in a single data.frame with all data, there are two things to keep in mind: 如果要在包含所有数据的单个data.frame进行描述， data.frame两件事：

Your timestamp and "third-chunk data" will be repeated across all rows within the data. 您的timestamp和“第三块数据”将在数据中的所有行中重复。 This may not be a problem depending on how you intend to use the data. 根据您打算如何使用数据，这可能不是问题。 This method breaks if the assumption of a single row of data in the "third-chunk" is invalid. 如果“第三块”中单行数据的假设无效，则此方法会中断 。
We have the same column names in both of the two data elements. 我们在两个数据元素中都有相同的列名。 This is a problem that is easily avoided if you have pre-defined columns (always 6 and 7 columns) or if the columns are defined in the data (they are not in your example). 如果您有预定义的列（总是6和7列），或者如果在数据中定义了列（在您的示例中没有），则很容易避免此问题。 If neither of those work, then you need to decide on a naming convention that works for you. 如果这些都不起作用，那么您需要确定一个适合您的命名约定。 For the sake of this example, I will change the second data.frame from V1 naming to X1 naming. 为了这个示例，我将第二个data.frame从V1命名更改为X1命名。

With number 2 in mind: 牢记2：

data2mod <- lapply(data2, function(df) setNames(df, paste("X", seq_along(df), sep = "")))
bothlst2 <- mapply(list, timestamps, data1, data2mod, SIMPLIFY = FALSE)

Now, for each element, we can "column-bind" the elements into a single data.frame : 现在，对于每个元素，我们可以将元素“列绑定”到单个data.frame ：

# bothdf <- lapply(bothlst2, cbind.data.frame)
str(bothdf)
# List of 2
#  $ :'data.frame': 29 obs. of  14 variables:
#   ..$ timestamp: num [1:29] 1344428 1344428 1344428 1344428 1344428 ...
#   ..$ V1       : int [1:29] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#   ..$ V2       : int [1:29] 1 2 3 4 5 6 7 8 9 10 ...
#   ..$ V3       : int [1:29] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#   ..$ V4       : int [1:29] 415 1090 -881 -624 -849 856 758 -201 -963 -460 ...
#   ..$ V5       : int [1:29] -649 -2167 -3164 1529 -2875 -2341 -2408 -2307 -2807 -3309 ...
#   ..$ V6       : num [1:29] 0 0 0 0 0 0 0 0 0 0 ...
#   ..$ X1       : int [1:29] 174 174 174 174 174 174 174 174 174 174 ...
#   ..$ X2       : int [1:29] -2563 -2563 -2563 -2563 -2563 -2563 -2563 -2563 -2563 -2563 ...
#   ..$ X3       : int [1:29] 11 11 11 11 11 11 11 11 11 11 ...
#   ..$ X4       : num [1:29] 28.7 28.7 28.7 28.7 28.7 ...
#   ..$ X5       : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
#   ..$ X6       : Factor w/ 1 level "Dead": 1 1 1 1 1 1 1 1 1 1 ...
#   ..$ X7       : Factor w/ 1 level "SetAway": 1 1 1 1 1 1 1 1 1 1 ...
#  $ :'data.frame': 29 obs. of  14 variables:
#   ..$ timestamp: num [1:29] 1344429 1344429 1344429 1344429 1344429 ...
#   ..$ V1       : int [1:29] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#   ..$ V2       : int [1:29] 1 2 3 4 5 6 7 8 9 10 ...
#   ..$ V3       : int [1:29] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#   ..$ V4       : int [1:29] 415 1090 -885 -626 -852 854 761 -201 -967 -460 ...
#   ..$ V5       : int [1:29] -649 -2167 -3169 1527 -2887 -2340 -2411 -2307 -2808 -3309 ...
#   ..$ V6       : num [1:29] 0 0 0 0 0 0 0 0 0 0 ...
#   ..$ X1       : int [1:29] 174 174 174 174 174 174 174 174 174 174 ...
#   ..$ X2       : int [1:29] -2563 -2563 -2563 -2563 -2563 -2563 -2563 -2563 -2563 -2563 ...
#   ..$ X3       : int [1:29] 11 11 11 11 11 11 11 11 11 11 ...
#   ..$ X4       : num [1:29] 28.7 28.7 28.7 28.7 28.7 ...
#   ..$ X5       : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
#   ..$ X6       : Factor w/ 1 level "Dead": 1 1 1 1 1 1 1 1 1 1 ...
#   ..$ X7       : Factor w/ 1 level "SetAway": 1 1 1 1 1 1 1 1 1 1 ...

From here it's rather straight-forward to deal with them independently, or to combine them in a similar fashion: 从这里开始，直接处理它们或以类似的方式将它们组合起来是很直接的：

head(do.call("rbind", bothdf))
#   timestamp V1 V2 V3   V4    V5 V6  X1    X2 X3    X4 X5   X6      X7
# 1   1344428 -1  1 -1  415  -649  0 174 -2563 11 28.67  A Dead SetAway
# 2   1344428 -1  2 -1 1090 -2167  0 174 -2563 11 28.67  A Dead SetAway
# 3   1344428 -1  3 -1 -881 -3164  0 174 -2563 11 28.67  A Dead SetAway
# 4   1344428 -1  4 -1 -624  1529  0 174 -2563 11 28.67  A Dead SetAway
# 5   1344428 -1  5 -1 -849 -2875  0 174 -2563 11 28.67  A Dead SetAway
# 6   1344428 -1  6 -1  856 -2341  0 174 -2563 11 28.67  A Dead SetAway

Based on my first bullet above, you'll notice that the timestamp column and all X* columns are redundant, similar to a join of tables. 基于上面的第一个项目符号，您会注意到timestamp列和所有X*列都是多余的，类似于表的联接。

使用R读取带有两个定界符的dat文件

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-03-14 00:03:37

解决方案2
0 2017-03-13 21:28:15

使用R读取带有两个定界符的dat文件

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-03-14 00:03:37

解决方案2 0 2017-03-13 21:28:15

解决方案1
2 已采纳 2017-03-14 00:03:37

解决方案2
0 2017-03-13 21:28:15