I have a CSV file whose awful format I cannot change (simplified here):
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
3,9,9.5,"15 Things",10,10.5,"30 Things"
My desired output is a new CSV containing:
inc,label,one,two,three
1,"a",1,1.5,"5 Things"
2,"a",5,5.5,"10 Things"
3,"a",9,9.5,"15 Things"
1,"b",2,2.5,"10 Things"
2,"b",6,6.5,"20 Things"
3,"b",10,10.5,"30 Things"
Basically:
a_One
and b_One
values should be merged into the same column). Inc
value from the original row (there may be more than one row like this in various places). With caveats:
Inc
that need to be preserved when everything gets stacked. Generally, Inc
represents any column that does not have a prefix like a_
or b_
. I have a regex to strip out these prefixes already. So far, I've accomplished this:
> wip_path <- 'C:/path/to/horrible.csv'
> rawwip <- read.csv(wip_path, header = FALSE, fill = FALSE)
> rawwip
V1 V2 V3 V4 V5 V6 V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three
2 1 1 1.5 5 Things 2 2.5 10 Things
3 2 5 5.5 10 Things 6 6.5 20 Things
4 Inc a_One a_Two a_Three b_One b_Two b_Three
5 3 9 9.5 15 Things 10 10.5 30 Things
> skips <- which(rawwip$V1==rawwip[1,1])
> skips
[1] 1 4
> filwip <- rawwip[-skips,]
> filwip
V1 V2 V3 V4 V5 V6 V7
2 1 1 1.5 5 Things 2 2.5 10 Things
3 2 5 5.5 10 Things 6 6.5 20 Things
5 3 9 9.5 15 Things 10 10.5 30 Things
> rawwip[1,]
V1 V2 V3 V4 V5 V6 V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three
But then when I try to apply a tolower() to these strings, I get:
> tolower(rawwip[1,])
[1] "4" "4" "4" "4" "4" "4" "4"
And this is quite unexpected.
So my questions are:
1) How can I gain access to the header strings in rawwip[1,]
so that I can reformat them with tolower()
and other string-manipulating functions?
2) Once I've done that, what's the most effective way to stack the columns with shared names while preserving the inc
value for each row?
Bear in mind, there will be well over a thousand repetitious columns that can be filtered down to perhaps 20 shared column names. I will not know the position of each stackable column ahead of time. This needs to be determined within the script.
You can use the base reshape()
function. For example with the input
dd<-read.csv(text='Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
inc,a_one,a_two,a_three,b_one,b_two,b_three
3,9,9.5,"15 Things",10,10.5,"30 Things"')
you can do
dx <- reshape(subset(dd, Inc!="inc"),
varying=Map(function(x) paste(c("a","b"), x, sep="_"), c("One","Two","Three")),
v.names=c("One","Two","Three"),
idvar="Inc",
timevar="label",
times = c("a","b"),
direction="long")
dx
to get
Inc label One Two Three
1.a 1 a 1 1.5 5 Things
2.a 2 a 5 5.5 10 Things
3.a 3 a 9 9.5 15 Things
1.b 1 b 2 2.5 10 Things
2.b 2 b 6 6.5 20 Things
3.b 3 b 10 10.5 30 Things
Because your input data is messy (embedded headers), this creates everything as factors. You could try to convert to proper data types with
dx[]<-lapply(lapply(dx, as.character), type.convert)
I would suggest a combination of read.mtable
from my GitHub-only "SOfun" package and merged.stack
from my "splitstackshape" package.
Here's the approach. I'm assuming your data is stored in a file called "somedata.txt" in your working directory.
The packages we need:
library(splitstackshape) # for merged.stack
library(SOfun) # for read.mtable
First, grab a vector of the names. While we are at it, change the name structure from "a_one" to "one_a" -- it's a much more convenient format for both merged.stack
and reshape
.
theNames <- gsub("(.*)_(.*)", "\\2_\\1",
tolower(scan(what = "", sep = ",",
text = readLines("somefile.txt", n = 1))))
Second, use read.mtable
to read the data in. We create the data chunks by identifying all the lines that start with letters. You can use a more specific regular expression if that doesn't match your actual data.
This will create a list
of data.frame
s, so we use do.call(rbind, ...)
to put it together in a single data.frame
:
theData <- read.mtable("somefile.txt", "^[A-Za-z]", header = FALSE, sep = ",")
theData <- setNames(do.call(rbind, theData), theNames)
This is what the data now look like:
theData
# inc one_a two_a three_a one_b two_b three_b
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.1 1 1 1.5 5 Things 2 2.5 10 Things
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.2 2 5 5.5 10 Things 6 6.5 20 Things
# inc,a_one,a_two,a_three,b_one,b_two,b_three 3 9 9.5 15 Things 10 10.5 30 Things
From here, you can use merged.stack
from "splitstackshape"....
merged.stack(theData, var.stubs = c("one", "two", "three"), sep = "_")
# inc .time_1 one two three
# 1: 1 a 1 1.5 5 Things
# 2: 1 b 2 2.5 10 Things
# 3: 2 a 5 5.5 10 Things
# 4: 2 b 6 6.5 20 Things
# 5: 3 a 9 9.5 15 Things
# 6: 3 b 10 10.5 30 Things
... or reshape
from base R:
reshape(theData, direction = "long", idvar = "inc",
varying = 2:ncol(theData), sep = "_")
# inc time one two three
# 1.a 1 a 1 1.5 5 Things
# 2.a 2 a 5 5.5 10 Things
# 3.a 3 a 9 9.5 15 Things
# 1.b 1 b 2 2.5 10 Things
# 2.b 2 b 6 6.5 20 Things
# 3.b 3 b 10 10.5 30 Things
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.