简体   繁体   English

在R中转置并合并具有缺失数据和空白列名称的多个数据帧/在dcast之前重命名融化的列

[英]In R transpose and combine multiple dataframes with missing data and blank column names / rename melted columns prior to dcast

I have searched and found many solutions that came close, but never quite worked in the end. 我已经搜索并找到了许多接近的解决方案,但最终却从未奏效。 This is probably something very simple, for those with experience... 对于那些有经验的人来说,这可能很简单。

Here is a snippet of my data. 这是我的数据片段。 This was created automatically from a JSON import by the package jsonlite. 这是由jsonlite包从JSON导入自动创建的。 The data is very nicely structured, but I am nevertheless helpless. 数据的结构非常好,但是我还是很无奈。 Update2: I have added the relevant data below Update2:我在下面添加了相关数据

    structure(list(rightsize = c(42L, 50L, 52L, 49L, 41L, 41L, 41L, 
41L, 41L, 45L, 47L, 42L, 45L, 46L, 42L, 44L, 44L, 37L, 44L, 41L
), hitlen = c("", "", "", "", "", "", "", "", "", "", "", "", 
"", "", "", "", "", "", "", ""), linegroup = c("_", "_", "_", 
"_", "_", "_", "_", "_", "_", "_", "_", "_", "_", "_", "_", "_", 
"_", "_", "_", "_"), leftsize = c(46L, 43L, 43L, 37L, 49L, 43L, 
43L, 45L, 45L, 43L, 44L, 46L, 45L, 46L, 44L, 43L, 54L, 45L, 51L, 
47L), leftspace = c("        ", "           ", "           ", 
"                 ", "     ", "           ", "           ", "         ", 
"         ", "           ", "          ", "        ", "         ", 
"        ", "          ", "           ", "", "         ", "   ", 
"       "), Left = list(structure(list(class = c("", "coll", 
""), str = c("patients with ", "chronic", " obstructive pulmonary"
)), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c("respect to ", 
"chronic", " obstructive pulmonary")), .Names = c("class", "str"
), class = "data.frame", row.names = c(NA, 3L)), structure(list(
    class = c("", "coll", ""), str = c("While there is no cure for this ", 
    "chronic", " ")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "strc", "", "coll", ""), str = c(".", 
"</p><p>", "When patients with ", "chronic", " liver")), .Names = c("class", 
"str"), class = "data.frame", row.names = c(NA, 5L)), structure(list(
    class = c("", "coll", ""), str = c("bronchitis , and ", "chronic", 
    " obstructive pulmonary")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c("offers the possibility that ", 
"chronic", " lung")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c(" , such as ", 
"chronic", " obstructive pulmonary")), .Names = c("class", "str"
), class = "data.frame", row.names = c(NA, 3L)), structure(list(
    class = c("", "coll", ""), str = c("always as clear in other ", 
    "chronic", " incurable")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c("may have the potential to prevent ", 
"chronic", " ")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c(" half the estimated cost of all ", 
"chronic", " ")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c("is consistent with the tact that ", 
"chronic", " ")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c("used to treat ", 
"chronic", " obstructive pulmonary")), .Names = c("class", "str"
), class = "data.frame", row.names = c(NA, 3L)), structure(list(
    class = c("", "coll", ""), str = c("ingredient for dietary therapy of ", 
    "chronic", " ")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c("patients with ", 
"chronic", " obstructive pulmonary")), .Names = c("class", "str"
), class = "data.frame", row.names = c(NA, 3L)), structure(list(
    class = c("", "coll", ""), str = c("greater for ", "chronic", 
    " obstructive pulmonary")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c(" departments , with schemes for ", 
"chronic", " ")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c("postponement of death by means of managing ", 
"chronic", " ")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c("certainly be ", 
"chronic", " obstructive pulmonary")), .Names = c("class", "str"
), class = "data.frame", row.names = c(NA, 3L)), structure(list(
    class = c("", "coll", ""), str = c("cardiovascular disease , cancer , other ", 
    "chronic", " ")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = c("", "coll", ""), str = c("terminal illnesses are converted to ", 
"chronic", " ")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L))), Right = list(structure(list(class = "", str = " who may be at risk of developing steroid"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " - plausibly related to exposure to environmental"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " , it can be treated , Black says . Antidepressants"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " ask what they can do to improve their condition"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " [ COPD ] ) was 15 % ( estimated within "), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " is part of the continuum of development"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " ( 70 , 71 ) and sleep apnea . Elevation"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " . Patients with heart failure highlight"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " other than heart disease , and helps us"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " in this country . Furthermore , the portion"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " are multigenic and multifactorial . Therefore"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " . Nasal corticosteroids are increasingly"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " such as diabetes mellitus or hyperlipidemia"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " ( COPD ) concluded exercise relieves dyspnea"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " than for any other disease. 5 The number"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " management in patients with COPD receiving"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " and disability is costly , and it is bound"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = c("", "strc", ""), str = c(" .", "</p><p>", "Much rarer condition , but people"
    )), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = "", str = " , and in fact those rates have been rising"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " . The panel 's report is negative about"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L)), Kwic = list(structure(list(
    class = "col0 coll", str = " disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = "disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " diseases"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = "diseases"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = "diseases"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = "diseases"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = "diseases"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = "disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = "disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " disease"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = "diseases"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = "diseases"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L)), toknum = c(580661252L, 
585871494L, 572902309L, 596182644L, 611091300L, 604962106L, 605346237L, 
585102838L, 575701411L, 616556239L, 548908661L, 604489309L, 548601059L, 
617460845L, 585870185L, 591049175L, 581965276L, 592616458L, 592591831L, 
599295354L), rightspace = c("          ", "  ", "", "   ", "           ", 
"           ", "           ", "           ", "           ", "       ", 
"     ", "          ", "       ", "      ", "          ", "        ", 
"        ", "               ", "        ", "           "), Tbl_refs = list(
    "11.99.0023.006", "11.99.0031.001", "11.99.0012.004", "11.99.0046.013", 
    "11.99.0069.003", "11.99.0059.007", "11.99.0060.003", "11.99.0030.001", 
    "11.99.0016.007", "11.99.0077.021", "11.01.0003.015", "11.99.0059.003", 
    "11.01.0003.006", "11.99.0078.034", "11.99.0031.001", "11.99.0038.005", 
    "11.99.0025.005", "11.99.0040.006", "11.99.0040.006", "11.99.0051.011"), 
    ref = c("11.99.0023.006", "11.99.0031.001", "11.99.0012.004", 
    "11.99.0046.013", "11.99.0069.003", "11.99.0059.007", "11.99.0060.003", 
    "11.99.0030.001", "11.99.0016.007", "11.99.0077.021", "11.01.0003.015", 
    "11.99.0059.003", "11.01.0003.006", "11.99.0078.034", "11.99.0031.001", 
    "11.99.0038.005", "11.99.0025.005", "11.99.0040.006", "11.99.0040.006", 
    "11.99.0051.011")), .Names = c("rightsize", "hitlen", "linegroup", 
"leftsize", "leftspace", "Left", "Right", "Kwic", "toknum", "rightspace", 
"Tbl_refs", "ref"), class = "data.frame", row.names = c(NA, 20L
))

What I need to do is 1) transpose these 4 dataframes and assign the values in "class" to be the column headers. 我需要做的是1)转置这4个数据帧,并将“类”中的值分配为列标题。 Note, #1, the number of columns may differ. 请注意,#1,列数可能不同。 Also note (#2) that some of the column names will be "". 另请注意(#2),某些列名称将为“”。 As such, the wonderful solution here results in dataframes in which some column headings are all filled with junk, making the next step (dataframe merging) impossible, eg 因此, 这里的出色解决方案将导致数据帧中某些列标题都被垃圾填充,从而使得下一步(数据帧合并)变得不可能,例如

  1. ""
  2. strc strc
  3. structure("When patients with ", class = "AsIs") 结构(“当患者患有时,类别=” AsIs”)
  4. coll 柯尔
  5. structure(" liver", class = "AsIs"). 结构(“肝脏”,类别=“ AsIs”)。

(The junk-fill headers seem to be the ones that were "", beyond the first.) (除了第一个之外,垃圾填充标头似乎是“”标头。)

Following that step, I would then need to merge these dataframes, whilst accounting for missing values. 接下来,我需要合并这些数据框,同时考虑缺失值。 Rbind.fill does the trick, but only when the data is sufficiently uniform. Rbind.fill可以解决问题,但仅当数据足够统一时才可以。 I have searched high & low for a solution, and have yet to find one that sufficiently addresses this issue. 我已搜查的解决方案,并且还没有找到一个足以解决这个问题。

Update: I have continued to experiment with melt/cast. 更新:我继续尝试熔炼/铸造。 The following brings be very close to an acceptable, final solutions: 以下是非常接近可接受的最终解决方案:

require(reshape2)
docx <- melt(documentdata$Left, id.vars = c("class"))
docx <- dcast(docx, L1 + variable ~ class, fun.aggregate=list)

The only problem is, as mentioned, the blank "class" causes the structure to be lost upon dcast: all of the unnamed columns wind up merged and out of order, eg 如前所述,唯一的问题是,空白的“类”会导致在dcast时丢失该结构:所有未命名的列最终合并并且顺序混乱,例如

    L1  variable    Var.3   coll    strc
1    1  str patients with ,  obstructive pulmonary  chronic  
2    2  str respect to ,  obstructive pulmonary chronic  
3    3  str While there is no cure for this ,   chronic  
4    4  str ., When patients with ,  liver  chronic </p><p>
5    5  str bronchitis , and ,  obstructive pulmonary   chronic  

The key "class" in the og data is the variable "coll", which always has at least one blank before and one blank after. og数据中的键“类”是变量“ coll”,该变量始终始终至少有一个空白和之后的一个空白。 One solution might be to create names "pre-coll" and "post-coll" prior to dcast? 一种解决方案可能是在dcast之前创建名称“ pre-coll”和“ post-coll”?

Update #3: here's one possible, albeit ugly solution. 更新#3:这是一个可能的解决方案,尽管很丑。 Any "cleaner" options? 有“清洁”选项吗?

require(reshape2)
docx <- melt(documentdata$Left, id.vars = c("class"))
pre <- which(docx$class %in% c("coll")) - 1
post <- which(docx$class %in% c("coll")) + 1
docx$class[pre] = "l.pre"
docx$class[post] = "l.post"
docx <- dcast(docx, L1 + variable ~ class, fun.aggregate=list)
docx.left <- docx[, c("l.pre", "coll", "l.post")]

Thanks in advance for the help. 先谢谢您的帮助。

Let's do it with dplyr : 让我们用dplyr

library(dplyr)
documentdata$Left %>% do.call(rbind, .) %>%
                      do(data.frame(pre = .[["str"]][which(.[["class"]]=="coll")-1],
                                    coll = .[["str"]][which(.[["class"]]=="coll")], 
                                    post = .[["str"]][which(.[["class"]]=="coll")+1]))

                                           pre    coll                   post
1                               patients with  chronic  obstructive pulmonary
2                                  respect to  chronic  obstructive pulmonary
3             While there is no cure for this  chronic                       
4                          When patients with  chronic                  liver
5                            bronchitis , and  chronic  obstructive pulmonary
6                 offers the possibility that  chronic                   lung
....
18                               certainly be  chronic  obstructive pulmonary
19    cardiovascular disease , cancer , other  chronic                       
20        terminal illnesses are converted to  chronic  

EDIT: an explanation: dplyr has a weird syntax. 编辑:解释: dplyr具有怪异的语法。 See the dplyr vignette or the data wrangling cheat sheet . 请参阅dplyr插图或有关数据的备忘单 The %>% is the pipe from the magrittr package and simply puts the output of everything on the left of the pipe as the first argument if the function to the right: %>%magrittr包中的管道,并且如果右边的函数只是将所有内容的输出放在管道的左边作为第一个参数:

5 %>% c(1)
#same as
c(5, 1) 

You can use the . 您可以使用. to represent the stuff on the left if you want to use it somewhere else instead. 如果您想在其他地方使用它来表示左侧的内容。 You can subset the . 您可以将子集化. if you like (eg the .[["str"]] ) : 如果您愿意(例如.[["str"]] ):

5 %>% c(1, .)
#same as
c(1, 5)

do allows us to do any computation we want, without worrying about the standard dplyr verbs - it's a wrapper. do允许我们进行所需的任何计算,而不必担心标准的dplyr动词-它是包装器。 See ?do . 参见?do

So the answer takes the documentdata$Left , pipes it into do.call(rbind, .) which collapses the list (so far this is the same as do.call(rbind, documentdata$Left) ). 因此,答案将documentdata$Left do.call(rbind, .)do.call(rbind, .) ,它折叠列表(到目前为止,它与do.call(rbind, documentdata$Left) )。 The we pipe that to the do which makes a new data frame with the relevant columns selected from the . 我们将其通过管道传递给do ,该do会创建一个具有从中选择的相关列的新数据框. .

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM