简体   繁体   English

将带有两个标头的csv读入data.frame

[英]Read csv with two headers into a data.frame

Apologies for the seemingly simple question, but I can't seem to find a solution to the following re-arrangement problem. 对于看似简单的问题道歉,但我似乎无法找到解决以下重新安排问题的方法。

I'm used to using read.csv to read in files with a header row, but I have an excel spreadsheet with two 'header' rows - cell identifier (a, b, c ... g) and three sets of measurements (x, y and z; 1000s each) for each cell: 我习惯使用read.csv读取带有标题行的文件,但我有一个带有两个“标题”行的excel电子表格 - 单元格标识符(a,b,c ... g)和三组测量值(每个单元格的x,y和z;每个1000):

a           b       
x    y  z   x   y   z
10   1  5   22  1   6
12   2  6   21  3   5
12   2  7   11  3   7
13   1  4   33  2   8
12   2  5   44  1   9

csv file below: csv文件如下:

a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9

How can I get to a data.frame in R as shown below? 如何进入R中的data.frame,如下所示?

cell x  y   z
a    10 1   5
a    12 2   6
a    12 2   7
a    13 1   4
a    12 2   5
b    22 1   6
b    21 3   5
b    11 3   7
b    33 2   8
b    44 1   9

Use base R reshape() : 使用base R reshape()

temp = read.delim(text="a,,,b,,
x,y,z,x,y,z
10,1,5,22,1,6
12,2,6,21,3,5
12,2,7,11,3,7
13,1,4,33,2,8
12,2,5,44,1,9", header=TRUE, skip=1, sep=",")
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT
#     time  x y z id
# 1.0    0 10 1 5  1
# 2.0    0 12 2 6  2
# 3.0    0 12 2 7  3
# 4.0    0 13 1 4  4
# 5.0    0 12 2 5  5
# 1.1    1 22 1 6  1
# 2.1    1 21 3 5  2
# 3.1    1 11 3 7  3
# 4.1    1 33 2 8  4
# 5.1    1 44 1 9  5

Basically, you should just skip the first row, where there are the letters ag every third column. 基本上,你应该跳过第一行,每三行都有字母。 Since the sub-column names are all the same, R will automatically append a grouping number after all of the columns after the third column; 由于子列名称都相同,因此R将自动在第三列之后的所有列之后附加分组编号; so we need to add a grouping number to the first three columns. 所以我们需要在前三列中添加分组编号。

You can either then create an "id" variable, or, as I've done here, just use the row names for the IDs. 然后,您可以创建一个“id”变量,或者,就像我在这里所做的那样,只需使用ID的行名称即可。

You can change the "time" variable to your "cell" variable as follows: 您可以将“time”变量更改为“cell”变量,如下所示:

# Change the following to the number of levels you actually have
OUT$cell = factor(OUT$time, labels=letters[1:2])

Then, drop the "time" column: 然后,删除“时间”列:

OUT$time = NULL

Update 更新

To answer a question in the comments below, if the first label was something other than a letter, this should still pose no problem. 要回答下面评论中的问题,如果第一个标签不是字母,那么这仍然没有问题。 The sequence I would take would be as follows: 我将采取的顺序如下:

temp = read.csv("path/to/file.csv", skip=1, stringsAsFactors = FALSE)
GROUPS = read.csv("path/to/file.csv", header=FALSE, 
                  nrows=1, stringsAsFactors = FALSE)
GROUPS = GROUPS[!is.na(GROUPS)]
names(temp)[1:3] = paste0(names(temp[1:3]), ".0")
OUT = reshape(temp, direction="long", ids=rownames(temp), varying=1:ncol(temp))
OUT$cell = factor(temp$time, labels=GROUPS)
OUT$time = NULL

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM