[英]Stacking columns with similar names in R
I have a CSV file whose awful format I cannot change (simplified here): 我有一个CSV文件,无法更改其糟糕的格式(此处简化):
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
3,9,9.5,"15 Things",10,10.5,"30 Things"
My desired output is a new CSV containing: 我想要的输出是包含以下内容的新CSV:
inc,label,one,two,three
1,"a",1,1.5,"5 Things"
2,"a",5,5.5,"10 Things"
3,"a",9,9.5,"15 Things"
1,"b",2,2.5,"10 Things"
2,"b",6,6.5,"20 Things"
3,"b",10,10.5,"30 Things"
Basically: 基本上:
a_One
and b_One
values should be merged into the same column). 堆叠共享其名称后半部分的每一列(例如a_One
和b_One
值应合并到同一列中)。 Inc
value from the original row (there may be more than one row like this in various places). 在此过程中,请保留原始行的Inc
值(在不同位置可能会有不止一行这样的行)。 With caveats: 注意事项:
Inc
that need to be preserved when everything gets stacked. 在堆叠所有内容时,可能需要保留一个或多个列,而Inc
属性需要保留。 Generally, Inc
represents any column that does not have a prefix like a_
or b_
. 通常, Inc
代表不具有前缀a_
或b_
任何列。 I have a regex to strip out these prefixes already. 我已经有一个正则表达式删除这些前缀。 So far, I've accomplished this: 到目前为止,我已经做到了:
> wip_path <- 'C:/path/to/horrible.csv'
> rawwip <- read.csv(wip_path, header = FALSE, fill = FALSE)
> rawwip
V1 V2 V3 V4 V5 V6 V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three
2 1 1 1.5 5 Things 2 2.5 10 Things
3 2 5 5.5 10 Things 6 6.5 20 Things
4 Inc a_One a_Two a_Three b_One b_Two b_Three
5 3 9 9.5 15 Things 10 10.5 30 Things
> skips <- which(rawwip$V1==rawwip[1,1])
> skips
[1] 1 4
> filwip <- rawwip[-skips,]
> filwip
V1 V2 V3 V4 V5 V6 V7
2 1 1 1.5 5 Things 2 2.5 10 Things
3 2 5 5.5 10 Things 6 6.5 20 Things
5 3 9 9.5 15 Things 10 10.5 30 Things
> rawwip[1,]
V1 V2 V3 V4 V5 V6 V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three
But then when I try to apply a tolower() to these strings, I get: 但是,当我尝试将tolower()应用于这些字符串时,我得到:
> tolower(rawwip[1,])
[1] "4" "4" "4" "4" "4" "4" "4"
And this is quite unexpected. 这是非常出乎意料的。
So my questions are: 所以我的问题是:
1) How can I gain access to the header strings in rawwip[1,]
so that I can reformat them with tolower()
and other string-manipulating functions? 1)如何获取对rawwip[1,]
的标头字符串的访问权限rawwip[1,]
以便可以使用tolower()
和其他字符串操纵函数重新格式化标头字符串?
2) Once I've done that, what's the most effective way to stack the columns with shared names while preserving the inc
value for each row? 2)完成此操作后,在保留每一行的inc
值的同时,将具有共享名称的列堆叠起来的最有效方法是什么?
Bear in mind, there will be well over a thousand repetitious columns that can be filtered down to perhaps 20 shared column names. 请记住,将有超过一千个重复的列,可以将其过滤为大约20个共享列名。 I will not know the position of each stackable column ahead of time. 我不会提前知道每个可堆叠列的位置。 This needs to be determined within the script. 这需要在脚本中确定。
You can use the base reshape()
function. 您可以使用基本的reshape()
函数。 For example with the input 例如输入
dd<-read.csv(text='Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
inc,a_one,a_two,a_three,b_one,b_two,b_three
3,9,9.5,"15 Things",10,10.5,"30 Things"')
you can do 你可以做
dx <- reshape(subset(dd, Inc!="inc"),
varying=Map(function(x) paste(c("a","b"), x, sep="_"), c("One","Two","Three")),
v.names=c("One","Two","Three"),
idvar="Inc",
timevar="label",
times = c("a","b"),
direction="long")
dx
to get 要得到
Inc label One Two Three
1.a 1 a 1 1.5 5 Things
2.a 2 a 5 5.5 10 Things
3.a 3 a 9 9.5 15 Things
1.b 1 b 2 2.5 10 Things
2.b 2 b 6 6.5 20 Things
3.b 3 b 10 10.5 30 Things
Because your input data is messy (embedded headers), this creates everything as factors. 由于您的输入数据杂乱无章(嵌入标头),因此将所有内容创建为因素。 You could try to convert to proper data types with 您可以尝试使用以下方法转换为正确的数据类型
dx[]<-lapply(lapply(dx, as.character), type.convert)
I would suggest a combination of read.mtable
from my GitHub-only "SOfun" package and merged.stack
from my "splitstackshape" package. 我建议的组合read.mtable
从我的GitHub上,只有“SOfun”包和merged.stack
从我的“splitstackshape”包。
Here's the approach. 这是方法。 I'm assuming your data is stored in a file called "somedata.txt" in your working directory. 我假设您的数据存储在工作目录中的“ somedata.txt”文件中。
The packages we need: 我们需要的软件包:
library(splitstackshape) # for merged.stack
library(SOfun) # for read.mtable
First, grab a vector of the names. 首先,获取名称的向量。 While we are at it, change the name structure from "a_one" to "one_a" -- it's a much more convenient format for both merged.stack
and reshape
. 在此过程中,将名称结构从“ a_one”更改为“ one_a”-这是merged.stack
和reshape
的一种更加方便的格式。
theNames <- gsub("(.*)_(.*)", "\\2_\\1",
tolower(scan(what = "", sep = ",",
text = readLines("somefile.txt", n = 1))))
Second, use read.mtable
to read the data in. We create the data chunks by identifying all the lines that start with letters. 其次,使用read.mtable
读取数据。我们通过识别所有以字母开头的行来创建数据块。 You can use a more specific regular expression if that doesn't match your actual data. 如果与您的实际数据不匹配,则可以使用更具体的正则表达式。
This will create a list
of data.frame
s, so we use do.call(rbind, ...)
to put it together in a single data.frame
: 这将创建一个data.frame
的list
,因此我们使用do.call(rbind, ...)
将其放到单个data.frame
:
theData <- read.mtable("somefile.txt", "^[A-Za-z]", header = FALSE, sep = ",")
theData <- setNames(do.call(rbind, theData), theNames)
This is what the data now look like: 现在的数据如下所示:
theData
# inc one_a two_a three_a one_b two_b three_b
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.1 1 1 1.5 5 Things 2 2.5 10 Things
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.2 2 5 5.5 10 Things 6 6.5 20 Things
# inc,a_one,a_two,a_three,b_one,b_two,b_three 3 9 9.5 15 Things 10 10.5 30 Things
From here, you can use merged.stack
from "splitstackshape".... 从这里,您可以使用“ splitstackshape”中的merged.stack
。
merged.stack(theData, var.stubs = c("one", "two", "three"), sep = "_")
# inc .time_1 one two three
# 1: 1 a 1 1.5 5 Things
# 2: 1 b 2 2.5 10 Things
# 3: 2 a 5 5.5 10 Things
# 4: 2 b 6 6.5 20 Things
# 5: 3 a 9 9.5 15 Things
# 6: 3 b 10 10.5 30 Things
... or reshape
from base R: ...或从基数R reshape
:
reshape(theData, direction = "long", idvar = "inc",
varying = 2:ncol(theData), sep = "_")
# inc time one two three
# 1.a 1 a 1 1.5 5 Things
# 2.a 2 a 5 5.5 10 Things
# 3.a 3 a 9 9.5 15 Things
# 1.b 1 b 2 2.5 10 Things
# 2.b 2 b 6 6.5 20 Things
# 3.b 3 b 10 10.5 30 Things
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.