简体   繁体   English

在R中堆叠具有相似名称的列

[英]Stacking columns with similar names in R

I have a CSV file whose awful format I cannot change (simplified here): 我有一个CSV文件,无法更改其糟糕的格式(此处简化):

Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
3,9,9.5,"15 Things",10,10.5,"30 Things"

My desired output is a new CSV containing: 我想要的输出是包含以下内容的新CSV:

inc,label,one,two,three
1,"a",1,1.5,"5 Things"
2,"a",5,5.5,"10 Things"
3,"a",9,9.5,"15 Things"
1,"b",2,2.5,"10 Things"
2,"b",6,6.5,"20 Things"
3,"b",10,10.5,"30 Things"

Basically: 基本上:

  • lowercase the headers 小写标题
  • strip off header prefixes and preserve them by adding them to a new column 剥离标题前缀并通过将其添加到新列中来保留它们
  • remove header repetitions in later rows 在以后的行中删除标题重复
  • stack each column that shares the latter part of their names (eg a_One and b_One values should be merged into the same column). 堆叠共享其名称后半部分的每一列(例如a_Oneb_One值应合并到同一列中)。
  • During this process, preserve the Inc value from the original row (there may be more than one row like this in various places). 在此过程中,请保留原始行的Inc值(在不同位置可能会有不止一行这样的行)。

With caveats: 注意事项:

  • I don't know the column names ahead of time (many files, many different columns). 我不知道列名的提前(许多文件,许多不同的列)。 These need to be parsed if they are to be used as logic for stripping the repetitious header rows. 如果将它们用作剥离重复标题行的逻辑,则需要对其进行解析。
  • There may or may not be more than one column with properties like Inc that need to be preserved when everything gets stacked. 在堆叠所有内容时,可能需要保留一个或多个列,而Inc属性需要保留。 Generally, Inc represents any column that does not have a prefix like a_ or b_ . 通常, Inc代表不具有前缀a_b_任何列。 I have a regex to strip out these prefixes already. 我已经有一个正则表达式删除这些前缀。

So far, I've accomplished this: 到目前为止,我已经做到了:

> wip_path <- 'C:/path/to/horrible.csv'
> rawwip <- read.csv(wip_path, header = FALSE, fill = FALSE)
> rawwip
   V1    V2    V3        V4    V5    V6        V7
1 Inc a_One a_Two   a_Three b_One b_Two   b_Three
2   1     1   1.5  5 Things     2   2.5 10 Things
3   2     5   5.5 10 Things     6   6.5 20 Things
4 Inc a_One a_Two   a_Three b_One b_Two   b_Three
5   3     9   9.5 15 Things    10  10.5 30 Things

> skips <- which(rawwip$V1==rawwip[1,1])
> skips
[1] 1 4

> filwip <- rawwip[-skips,]
> filwip
  V1 V2  V3        V4 V5   V6        V7
2  1  1 1.5  5 Things  2  2.5 10 Things
3  2  5 5.5 10 Things  6  6.5 20 Things
5  3  9 9.5 15 Things 10 10.5 30 Things

> rawwip[1,]
   V1    V2    V3      V4    V5    V6      V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three

But then when I try to apply a tolower() to these strings, I get: 但是,当我尝试将tolower()应用于这些字符串时,我得到:

> tolower(rawwip[1,])
[1] "4" "4" "4" "4" "4" "4" "4"

And this is quite unexpected. 这是非常出乎意料的。

So my questions are: 所以我的问题是:

1) How can I gain access to the header strings in rawwip[1,] so that I can reformat them with tolower() and other string-manipulating functions? 1)如何获取对rawwip[1,]的标头字符串的访问权限rawwip[1,]以便可以使用tolower()和其他字符串操纵函数重新格式化标头字符串?

2) Once I've done that, what's the most effective way to stack the columns with shared names while preserving the inc value for each row? 2)完成此操作后,在保留每一行的inc值的同时,将具有共享名称的列堆叠起来的最有效方法是什么?

Bear in mind, there will be well over a thousand repetitious columns that can be filtered down to perhaps 20 shared column names. 请记住,将有超过一千个重复的列,可以将其过滤为大约20个共享列名。 I will not know the position of each stackable column ahead of time. 我不会提前知道每个可堆叠列的位置。 This needs to be determined within the script. 这需要在脚本中确定。

You can use the base reshape() function. 您可以使用基本的reshape()函数。 For example with the input 例如输入

dd<-read.csv(text='Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
inc,a_one,a_two,a_three,b_one,b_two,b_three
3,9,9.5,"15 Things",10,10.5,"30 Things"')

you can do 你可以做

dx <- reshape(subset(dd, Inc!="inc"), 
    varying=Map(function(x) paste(c("a","b"), x, sep="_"), c("One","Two","Three")),
    v.names=c("One","Two","Three"),
    idvar="Inc",    
    timevar="label",
    times = c("a","b"),
    direction="long")
dx

to get 要得到

    Inc label One  Two     Three
1.a   1     a   1  1.5  5 Things
2.a   2     a   5  5.5 10 Things
3.a   3     a   9  9.5 15 Things
1.b   1     b   2  2.5 10 Things
2.b   2     b   6  6.5 20 Things
3.b   3     b  10 10.5 30 Things

Because your input data is messy (embedded headers), this creates everything as factors. 由于您的输入数据杂乱无章(嵌入标头),因此将所有内容创建为因素。 You could try to convert to proper data types with 您可以尝试使用以下方法转换为正确的数据类型

dx[]<-lapply(lapply(dx, as.character), type.convert)

I would suggest a combination of read.mtable from my GitHub-only "SOfun" package and merged.stack from my "splitstackshape" package. 我建议的组合read.mtable我的GitHub上,只有“SOfun”包merged.stack从我的“splitstackshape”包。

Here's the approach. 这是方法。 I'm assuming your data is stored in a file called "somedata.txt" in your working directory. 我假设您的数据存储在工作目录中的“ somedata.txt”文件中。

The packages we need: 我们需要的软件包:

library(splitstackshape) # for merged.stack
library(SOfun)           # for read.mtable

First, grab a vector of the names. 首先,获取名称的向量。 While we are at it, change the name structure from "a_one" to "one_a" -- it's a much more convenient format for both merged.stack and reshape . 在此过程中,将名称结构从“ a_one”更改为“ one_a”-这是merged.stackreshape的一种更加方便的格式。

theNames <- gsub("(.*)_(.*)", "\\2_\\1", 
                 tolower(scan(what = "", sep = ",", 
                              text = readLines("somefile.txt", n = 1))))

Second, use read.mtable to read the data in. We create the data chunks by identifying all the lines that start with letters. 其次,使用read.mtable读取数据。我们通过识别所有以字母开头的行来创建数据块。 You can use a more specific regular expression if that doesn't match your actual data. 如果与您的实际数据不匹配,则可以使用更具体的正则表达式。

This will create a list of data.frame s, so we use do.call(rbind, ...) to put it together in a single data.frame : 这将创建一个data.framelist ,因此我们使用do.call(rbind, ...)将其放到单个data.frame

theData <- read.mtable("somefile.txt", "^[A-Za-z]", header = FALSE, sep = ",")

theData <- setNames(do.call(rbind, theData), theNames)

This is what the data now look like: 现在的数据如下所示:

theData
#                                               inc one_a two_a   three_a one_b two_b   three_b
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.1   1     1   1.5  5 Things     2   2.5 10 Things
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.2   2     5   5.5 10 Things     6   6.5 20 Things
# inc,a_one,a_two,a_three,b_one,b_two,b_three     3     9   9.5 15 Things    10  10.5 30 Things

From here, you can use merged.stack from "splitstackshape".... 从这里,您可以使用“ splitstackshape”中的merged.stack

merged.stack(theData, var.stubs = c("one", "two", "three"), sep = "_")
#    inc .time_1 one  two     three
# 1:   1       a   1  1.5  5 Things
# 2:   1       b   2  2.5 10 Things
# 3:   2       a   5  5.5 10 Things
# 4:   2       b   6  6.5 20 Things
# 5:   3       a   9  9.5 15 Things
# 6:   3       b  10 10.5 30 Things

... or reshape from base R: ...或从基数R reshape

reshape(theData, direction = "long", idvar = "inc", 
        varying = 2:ncol(theData), sep = "_")
#     inc time one  two     three
# 1.a   1    a   1  1.5  5 Things
# 2.a   2    a   5  5.5 10 Things
# 3.a   3    a   9  9.5 15 Things
# 1.b   1    b   2  2.5 10 Things
# 2.b   2    b   6  6.5 20 Things
# 3.b   3    b  10 10.5 30 Things

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM