简体   繁体   English

根据行将单列拆分为多列

[英]Split a single column into multiple columns based on Rows

I have a dataset in R that is comprised of a single column containing variables that I ideally would like in multiple columns.我在 R 中有一个数据集,它由一个单列组成,其中包含我在多列中理想的变量。 The structure of the single column dataframe is this.单列数据框的结构是这样的。

A1
200
250
Brand x 
A2
400
300
Brand x
A4
100
320
Brand x2

I would like to split this column in such a way that it ends up in a multi-column frame like this ("|" is purely to denote a column separator):我想以这样的方式拆分此列,使其以这样的多列框架结束(“|”纯粹是为了表示列分隔符):

A1 | 200 | 250 | Brand x  
A2 | 400 | 300 | Brand x1
A4 | 100 | 320 | Brand x2

How could I do this?我怎么能这样做? There is at most times a sequence in the horizontal data - for example: 4 variables - A1,200, 250,Brand x.水平数据中最多有一个序列 - 例如:4 个变量 - A1,200, 250,Brand x。 Naive equivalent would be copying and transpose pasting in Excel, but for a predefined sequence of 4 values.天真的等价物是在 Excel 中复制和转置粘贴,但对于 4 个值的预定义序列。 Could anyone please help me with this?任何人都可以帮我解决这个问题吗?

Here's how I would do it:这是我将如何做到的:

df2 <- as.data.frame(matrix(df1[,1], byrow=TRUE, ncol = 4))

or, equivalently:或者,等效地:

df2 <- as.data.frame(t(matrix(df1[,1],nrow = 4)))

In both cases this yields the desired result:在这两种情况下,这都会产生所需的结果:

#> df2
#  V1  V2  V3       V4
#1 A1 200 250  Brand x
#2 A2 400 300  Brand x
#3 A4 100 320 Brand x2

data数据

df1 <-read.table(text="A1
                       200
                       250
                       'Brand x' 
                       A2
                       400
                       300
                      'Brand x'
                       A4
                       100
                       320
                       'Brand x2'", header=FALSE)

This is not an ellegant solution but should work.这不是一个优雅的解决方案,但应该有效。

Some explanations:一些解释:

The first two lines should only provide the dataframe which you usually obtain by reading in your data.前两行应该只提供您通常通过读取数据获得的数据帧。

If there is a character string in a column R will transform this column in a factor variable.如果列中存在字符串,则 R 将在因子变量中转换该列。 For this reason I transformed it in line 3 back to a character vector.出于这个原因,我在第 3 行将其转换回字符向量。

With matrix you can rearrange this vector in the shape you want and than you can transform it back to a dataframe (setting stringAsFactors=FALSE to prevent that everything is transformed into factors which would be the default).使用矩阵,您可以将这个向量重新排列为您想要的形状,然后您可以将其转换回数据帧(设置stringAsFactors=FALSE以防止所有内容都转换为默认值)。

However, now all variables are character variables.但是,现在所有的变量都是字符变量。 For this reason you need to encode the variables appropriately.因此,您需要适当地对变量进行编码。

dat<-c("A1",200,250,"Brand x" ,"A2",400,0300, "Brand x", "A4",100,  320,"Brand x2")
dat<-data.frame(dat)
dat<-as.character(dat[,1])
dat<-matrix(dat, ncol = 4, byrow=TRUE)
dat<-data.frame(dat, stringsAsFactors = FALSE)

dat[] <- lapply(dat, type.convert)

> str(dat)
'data.frame':   3 obs. of  4 variables:
 $ X1: Factor w/ 3 levels "A1","A2","A4": 1 2 3
 $ X2: int  200 400 100
 $ X3: int  250 300 320
 $ X4: Factor w/ 2 levels "Brand x","Brand x2": 1 1 2

> dat
     X1  X2  X3       X4
 1 A1 200 250  Brand x
 2 A2 400 300  Brand x
 3 A4 100 320 Brand x2

Just a hint here - if the sequence always repeats (ie is deterministic), you could read a vector and change the dimensions, something like:这里只是一个提示 - 如果序列总是重复(即是确定性的),您可以读取向量并更改维度,例如:

data <- c("A1","200","250","Brand x","A2","400","300","Brand x","A4","100","320","Brand x2")
dim(data) <- c(4,3)
data <- t(data) # transpose
class(data)
data.df <- as.data.frame(data)
class (data.df)

This change the dims of the data to a matrix (since internally vector and matrix are stored the same, it's the dimensions that differ).这将数据的维度更改为矩阵(因为内部向量和矩阵存储相同,因此维度不同)。

When executed, it will print执行时会打印

> class(data)
[1] "matrix"
> class (data.df)
[1] "data.frame"

and the data.df is then a data.frame object, so you can do whatever you need to do with the data (eg change column to be numeric/character/etc) before processing the data.然后data.df是一个 data.frame 对象,所以你可以在处理数据之前对数据做任何你需要做的事情(例如将列更改为数字/字符/等)。

If it's always 4 values the loop below did the work for me:如果它总是 4 个值,下面的循环对我有用:

df <- read.csv("df.csv", sep = ";", header = FALSE)


new.df <- data.frame()
j <- 1
i <- 1
while(i < length(df[,1])-1){

    temp.df <- data.frame()

    temp.df[j,1] <- df[i,1]
    temp.df[j,2] <- df[i + 1, 1]
    temp.df[j,3] <- df[i + 2, 1]
    temp.df[j,3] <- df[i + 3, 1]

    new.df <- rbind(new.df, temp.df)

    j <- j + 1
    i <- i + 4
}
na.omit(new.df)

it's not entirely optimized but it does the job!它没有完全优化,但它完成了工作! Hope it works for you.希望对你有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM