在R中堆疊具有相似名稱的列

Question

我有一個CSV文件，無法更改其糟糕的格式（此處簡化）：

Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
3,9,9.5,"15 Things",10,10.5,"30 Things"

我想要的輸出是包含以下內容的新CSV：

inc,label,one,two,three
1,"a",1,1.5,"5 Things"
2,"a",5,5.5,"10 Things"
3,"a",9,9.5,"15 Things"
1,"b",2,2.5,"10 Things"
2,"b",6,6.5,"20 Things"
3,"b",10,10.5,"30 Things"

基本上：

小寫標題
剝離標題前綴並通過將其添加到新列中來保留它們
在以后的行中刪除標題重復
堆疊共享其名稱后半部分的每一列（例如a_One和b_One值應合並到同一列中）。
在此過程中，請保留原始行的Inc值（在不同位置可能會有不止一行這樣的行）。

注意事項：

我不知道列名的提前（許多文件，許多不同的列）。 如果將它們用作剝離重復標題行的邏輯，則需要對其進行解析。
在堆疊所有內容時，可能需要保留一個或多個列，而Inc屬性需要保留。 通常， Inc代表不具有前綴a_或b_任何列。 我已經有一個正則表達式刪除這些前綴。

到目前為止，我已經做到了：

> wip_path <- 'C:/path/to/horrible.csv'
> rawwip <- read.csv(wip_path, header = FALSE, fill = FALSE)
> rawwip
   V1    V2    V3        V4    V5    V6        V7
1 Inc a_One a_Two   a_Three b_One b_Two   b_Three
2   1     1   1.5  5 Things     2   2.5 10 Things
3   2     5   5.5 10 Things     6   6.5 20 Things
4 Inc a_One a_Two   a_Three b_One b_Two   b_Three
5   3     9   9.5 15 Things    10  10.5 30 Things

> skips <- which(rawwip$V1==rawwip[1,1])
> skips
[1] 1 4

> filwip <- rawwip[-skips,]
> filwip
  V1 V2  V3        V4 V5   V6        V7
2  1  1 1.5  5 Things  2  2.5 10 Things
3  2  5 5.5 10 Things  6  6.5 20 Things
5  3  9 9.5 15 Things 10 10.5 30 Things

> rawwip[1,]
   V1    V2    V3      V4    V5    V6      V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three

但是，當我嘗試將tolower（）應用於這些字符串時，我得到：

> tolower(rawwip[1,])
[1] "4" "4" "4" "4" "4" "4" "4"

這是非常出乎意料的。

所以我的問題是：

1）如何獲取對rawwip[1,]的標頭字符串的訪問權限rawwip[1,]以便可以使用tolower()和其他字符串操縱函數重新格式化標頭字符串？

2）完成此操作后，在保留每一行的inc值的同時，將具有共享名稱的列堆疊起來的最有效方法是什么？

請記住，將有超過一千個重復的列，可以將其過濾為大約20個共享列名。 我不會提前知道每個可堆疊列的位置。 這需要在腳本中確定。

Answer 1

您可以使用基本的reshape()函數。 例如輸入

dd<-read.csv(text='Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
inc,a_one,a_two,a_three,b_one,b_two,b_three
3,9,9.5,"15 Things",10,10.5,"30 Things"')

你可以做

dx <- reshape(subset(dd, Inc!="inc"), 
    varying=Map(function(x) paste(c("a","b"), x, sep="_"), c("One","Two","Three")),
    v.names=c("One","Two","Three"),
    idvar="Inc",    
    timevar="label",
    times = c("a","b"),
    direction="long")
dx

要得到

    Inc label One  Two     Three
1.a   1     a   1  1.5  5 Things
2.a   2     a   5  5.5 10 Things
3.a   3     a   9  9.5 15 Things
1.b   1     b   2  2.5 10 Things
2.b   2     b   6  6.5 20 Things
3.b   3     b  10 10.5 30 Things

由於您的輸入數據雜亂無章（嵌入標頭），因此將所有內容創建為因素。 您可以嘗試使用以下方法轉換為正確的數據類型

dx[]<-lapply(lapply(dx, as.character), type.convert)

Answer 2

我建議的組合read.mtable從我的GitHub上，只有“SOfun”包和merged.stack從我的“splitstackshape”包。

這是方法。 我假設您的數據存儲在工作目錄中的“ somedata.txt”文件中。

我們需要的軟件包：

library(splitstackshape) # for merged.stack
library(SOfun)           # for read.mtable

首先，獲取名稱的向量。 在此過程中，將名稱結構從“ a_one”更改為“ one_a”-這是merged.stack和reshape的一種更加方便的格式。

theNames <- gsub("(.*)_(.*)", "\\2_\\1", 
                 tolower(scan(what = "", sep = ",", 
                              text = readLines("somefile.txt", n = 1))))

其次，使用read.mtable讀取數據。我們通過識別所有以字母開頭的行來創建數據塊。 如果與您的實際數據不匹配，則可以使用更具體的正則表達式。

這將創建一個data.frame的list ，因此我們使用do.call(rbind, ...)將其放到單個data.frame ：

theData <- read.mtable("somefile.txt", "^[A-Za-z]", header = FALSE, sep = ",")

theData <- setNames(do.call(rbind, theData), theNames)

現在的數據如下所示：

theData
#                                               inc one_a two_a   three_a one_b two_b   three_b
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.1   1     1   1.5  5 Things     2   2.5 10 Things
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.2   2     5   5.5 10 Things     6   6.5 20 Things
# inc,a_one,a_two,a_three,b_one,b_two,b_three     3     9   9.5 15 Things    10  10.5 30 Things

從這里，您可以使用“ splitstackshape”中的merged.stack 。

merged.stack(theData, var.stubs = c("one", "two", "three"), sep = "_")
#    inc .time_1 one  two     three
# 1:   1       a   1  1.5  5 Things
# 2:   1       b   2  2.5 10 Things
# 3:   2       a   5  5.5 10 Things
# 4:   2       b   6  6.5 20 Things
# 5:   3       a   9  9.5 15 Things
# 6:   3       b  10 10.5 30 Things

...或從基數R reshape ：

reshape(theData, direction = "long", idvar = "inc", 
        varying = 2:ncol(theData), sep = "_")
#     inc time one  two     three
# 1.a   1    a   1  1.5  5 Things
# 2.a   2    a   5  5.5 10 Things
# 3.a   3    a   9  9.5 15 Things
# 1.b   1    b   2  2.5 10 Things
# 2.b   2    b   6  6.5 20 Things
# 3.b   3    b  10 10.5 30 Things

在R中堆疊具有相似名稱的列

問題描述

2 個解決方案

解決方案1
3 已采納 2015-05-12 20:17:27

解決方案2
0 2015-05-13 02:17:56

在R中堆疊具有相似名稱的列

問題描述

2 個解決方案

解決方案1 3 已采納 2015-05-12 20:17:27

解決方案2 0 2015-05-13 02:17:56

解決方案1
3 已采納 2015-05-12 20:17:27

解決方案2
0 2015-05-13 02:17:56