[英]Split a single column into multiple columns based on Rows
I have a dataset in R that is comprised of a single column containing variables that I ideally would like in multiple columns.我在 R 中有一个数据集,它由一个单列组成,其中包含我在多列中理想的变量。 The structure of the single column dataframe is this.
单列数据框的结构是这样的。
A1
200
250
Brand x
A2
400
300
Brand x
A4
100
320
Brand x2
I would like to split this column in such a way that it ends up in a multi-column frame like this ("|" is purely to denote a column separator):我想以这样的方式拆分此列,使其以这样的多列框架结束(“|”纯粹是为了表示列分隔符):
A1 | 200 | 250 | Brand x
A2 | 400 | 300 | Brand x1
A4 | 100 | 320 | Brand x2
How could I do this?我怎么能这样做? There is at most times a sequence in the horizontal data - for example: 4 variables - A1,200, 250,Brand x.
水平数据中最多有一个序列 - 例如:4 个变量 - A1,200, 250,Brand x。 Naive equivalent would be copying and transpose pasting in Excel, but for a predefined sequence of 4 values.
天真的等价物是在 Excel 中复制和转置粘贴,但对于 4 个值的预定义序列。 Could anyone please help me with this?
任何人都可以帮我解决这个问题吗?
Here's how I would do it:这是我将如何做到的:
df2 <- as.data.frame(matrix(df1[,1], byrow=TRUE, ncol = 4))
or, equivalently:或者,等效地:
df2 <- as.data.frame(t(matrix(df1[,1],nrow = 4)))
In both cases this yields the desired result:在这两种情况下,这都会产生所需的结果:
#> df2
# V1 V2 V3 V4
#1 A1 200 250 Brand x
#2 A2 400 300 Brand x
#3 A4 100 320 Brand x2
data数据
df1 <-read.table(text="A1
200
250
'Brand x'
A2
400
300
'Brand x'
A4
100
320
'Brand x2'", header=FALSE)
This is not an ellegant solution but should work.这不是一个优雅的解决方案,但应该有效。
Some explanations:一些解释:
The first two lines should only provide the dataframe which you usually obtain by reading in your data.前两行应该只提供您通常通过读取数据获得的数据帧。
If there is a character string in a column R will transform this column in a factor variable.如果列中存在字符串,则 R 将在因子变量中转换该列。 For this reason I transformed it in line 3 back to a character vector.
出于这个原因,我在第 3 行将其转换回字符向量。
With matrix you can rearrange this vector in the shape you want and than you can transform it back to a dataframe (setting stringAsFactors=FALSE
to prevent that everything is transformed into factors which would be the default).使用矩阵,您可以将这个向量重新排列为您想要的形状,然后您可以将其转换回数据帧(设置
stringAsFactors=FALSE
以防止所有内容都转换为默认值)。
However, now all variables are character variables.但是,现在所有的变量都是字符变量。 For this reason you need to encode the variables appropriately.
因此,您需要适当地对变量进行编码。
dat<-c("A1",200,250,"Brand x" ,"A2",400,0300, "Brand x", "A4",100, 320,"Brand x2")
dat<-data.frame(dat)
dat<-as.character(dat[,1])
dat<-matrix(dat, ncol = 4, byrow=TRUE)
dat<-data.frame(dat, stringsAsFactors = FALSE)
dat[] <- lapply(dat, type.convert)
> str(dat)
'data.frame': 3 obs. of 4 variables:
$ X1: Factor w/ 3 levels "A1","A2","A4": 1 2 3
$ X2: int 200 400 100
$ X3: int 250 300 320
$ X4: Factor w/ 2 levels "Brand x","Brand x2": 1 1 2
> dat
X1 X2 X3 X4
1 A1 200 250 Brand x
2 A2 400 300 Brand x
3 A4 100 320 Brand x2
Just a hint here - if the sequence always repeats (ie is deterministic), you could read a vector and change the dimensions, something like:这里只是一个提示 - 如果序列总是重复(即是确定性的),您可以读取向量并更改维度,例如:
data <- c("A1","200","250","Brand x","A2","400","300","Brand x","A4","100","320","Brand x2")
dim(data) <- c(4,3)
data <- t(data) # transpose
class(data)
data.df <- as.data.frame(data)
class (data.df)
This change the dims of the data to a matrix (since internally vector and matrix are stored the same, it's the dimensions that differ).这将数据的维度更改为矩阵(因为内部向量和矩阵存储相同,因此维度不同)。
When executed, it will print执行时会打印
> class(data)
[1] "matrix"
> class (data.df)
[1] "data.frame"
and the data.df
is then a data.frame object, so you can do whatever you need to do with the data (eg change column to be numeric/character/etc) before processing the data.然后
data.df
是一个 data.frame 对象,所以你可以在处理数据之前对数据做任何你需要做的事情(例如将列更改为数字/字符/等)。
If it's always 4 values the loop below did the work for me:如果它总是 4 个值,下面的循环对我有用:
df <- read.csv("df.csv", sep = ";", header = FALSE)
new.df <- data.frame()
j <- 1
i <- 1
while(i < length(df[,1])-1){
temp.df <- data.frame()
temp.df[j,1] <- df[i,1]
temp.df[j,2] <- df[i + 1, 1]
temp.df[j,3] <- df[i + 2, 1]
temp.df[j,3] <- df[i + 3, 1]
new.df <- rbind(new.df, temp.df)
j <- j + 1
i <- i + 4
}
na.omit(new.df)
it's not entirely optimized but it does the job!它没有完全优化,但它完成了工作! Hope it works for you.
希望对你有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.