简体   繁体   中英

Combine JSON rows with different number of columns, some unlabled, using either melt or dplyr in R

This is related to a previous question. However, the problem has evolved. I have JSON data that is in three columns: "Left", "Kwic", and "Right". Columns "Left" and "Right" are sometimes further subdivided. This subdivision is denoted in the JSON file as "class". However, this "class" is often unlabeled. In the subdivided columns, there will always be a class named "coll".

The excellent solution previously presented was to grab the "pre" and "post" columns and rename them for inclusion into the dataframe. However, now we have a mixture of columns, some are subdivided and some are not.

What I would like to do is take the undivided data and add it to the center column "coll". This applies for both the Left and Right divisions. Right now, however, I can only capture these as a separate columns. I have tried various things with both melt and dplyr2, but to no avail.

Data:

structure(list(Left = list(structure(list(class = "", str = " children tend to view authority figures"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "children have a computer . Wireless resources"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "unclear if increases in physical activity are"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "filtration pressure . Where recurrent disease is"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = c("", "strc", ""), str = c("multiply .", "</p><p>", 
    "When nevirapine is no longer")), .Names = c("class", "str"
), class = "data.frame", row.names = c(NA, 3L)), structure(list(
    class = "", str = "white . We don't provide enough services ,"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = ", a sexually transmitted infection , are"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "continuous lowgrade itching and linear lesions"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = " radiation oncology community is largely"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "the variability in response time that was"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "incremental cost effectiveness ratio that is"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "Through the use of warming , acrid herbs"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "start using tobacco : psychosocial factors"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "determining the severity because the fetus was"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = c("", "coll", ""), str = c("This occurred despite the ", 
    "significantly", " ")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
3L)), structure(list(class = "", str = "mission to eliminate the suffering and death"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "are more likely to be present , or to be"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "demonstrated primarily pulmonary signs and symptoms"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "criminal involvement . These findings are"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "", str = "model . There is a danger in using herbs"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L)), Kwic = list(structure(list(
    class = "col0 coll", str = " such"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " such"), .Names = c("class", 
"str"), class = "data.frame", row.names = 1L), structure(list(
    class = "col0 coll", str = " due"), .Names = c("class", "str"
), class = "data.frame", row.names = 1L), structure(list(class = "col0 coll", 
    str = " responsible"), .Names = c("class", "str"), class = "data.frame", row.names = 1L), 
    structure(list(class = "col0 coll", str = " present"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " such"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " responsible"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " consistent"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " responsible"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " due"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " less"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " such"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " such"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " less"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = "higher"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " due"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " present"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " such"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " consistent"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = "col0 coll", str = " such"), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L)), Right = list(
    structure(list(class = c("", "coll", ""), str = c(" ", "as", 
    " physicians and parents as legitimate")), .Names = c("class", 
    "str"), class = "data.frame", row.names = c(NA, 3L)), structure(list(
        class = c("", "coll", ""), str = c(" ", "as", " radio / CD headsets , handheld televisions"
        )), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
    3L)), structure(list(class = c("", "coll", ""), str = c(" ", 
    "to", " the physical environment itself , or")), .Names = c("class", 
    "str"), class = "data.frame", row.names = c(NA, 3L)), structure(list(
        class = c("", "coll", ""), str = c(" ", "for", " blockage of lymphatic collaterals ,"
        )), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
    3L)), structure(list(class = c("", "coll", ""), str = c(" ", 
    "in", " the blood , the HIV strains that are")), .Names = c("class", 
    "str"), class = "data.frame", row.names = c(NA, 3L)), structure(list(
        class = c("", "coll", "", "strc", ""), str = c(" ", "as", 
        " Spanish services . \"", "</p><p>", "She admits")), .Names = c("class", 
    "str"), class = "data.frame", row.names = c(NA, 5L)), structure(list(
        class = c("", "coll", ""), str = c(" ", "for", " the majority of cervical cancer cases"
        )), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
    3L)), structure(list(class = c("", "coll", "", "strc", ""
    ), str = c(" ", "with", " vigorous scratching .", "</p><p>", 
    "Psoriasis")), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
    5L)), structure(list(class = c("", "coll", ""), str = c(" ", 
    "for", " having treated hundreds of thousands")), .Names = c("class", 
    "str"), class = "data.frame", row.names = c(NA, 3L)), structure(list(
        class = c("", "coll", ""), str = c(" ", "to", " the distractor-ratio manipulation and"
        )), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
    3L)), structure(list(class = c("", "coll", ""), str = c(" ", 
    "than", " £ 30 000 per quality adjusted life")), .Names = c("class", 
    "str"), class = "data.frame", row.names = c(NA, 3L)), structure(list(
        class = c("", "coll", ""), str = c(" ", "as", " aconitum carmichaeli praeparatum ( fu"
        )), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
    3L)), structure(list(class = c("", "coll", ""), str = c(" ", 
    "as", " personality or parental role modeling")), .Names = c("class", 
    "str"), class = "data.frame", row.names = c(NA, 3L)), structure(list(
        class = c("", "coll", ""), str = c(" ", "than", " 28 weeks old , and the bilirubin had"
        )), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
    3L)), structure(list(class = "", str = " level of psychiatric symptoms observed "), .Names = c("class", 
    "str"), class = "data.frame", row.names = 1L), structure(list(
        class = c("", "coll", "", "strc", ""), str = c(" ", "to", 
        " all cancers by 2015 .", "</p><p>", "The primary")), .Names = c("class", 
    "str"), class = "data.frame", row.names = c(NA, 5L)), structure(list(
        class = c("", "coll", ""), str = c(" ", "in", " higher numbers , in sputum cultures "
        )), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
    3L)), structure(list(class = c("", "coll", "", "strc"), str = c(" ", 
    "as", " wheezing and shortness of breath .", "</p>")), .Names = c("class", 
    "str"), class = "data.frame", row.names = c(NA, 4L)), structure(list(
        class = c("", "coll", ""), str = c(" ", "with", " those from DeLeon and Jainchill 's"
        )), .Names = c("class", "str"), class = "data.frame", row.names = c(NA, 
    3L)), structure(list(class = c("", "coll", ""), str = c(" ", 
    "as", " mahuang in highly concentrated extracts")), .Names = c("class", 
    "str"), class = "data.frame", row.names = c(NA, 3L)))), .Names = c("Left", 
"Kwic", "Right"), class = "data.frame", row.names = c(NA, 20L
))

The key challenge is that some of columns are unlabelled, but can be identified based on the the structure of the data. The code below generates output that is almost there, but reconciling this small difference has driven me to near madness.

Code:

## generate raw output
documentdata <- document$Lines[, c("Left", "Kwic", "Right")]
documentdata = cbind(documentdata,SeekID=query)

## generate tidied output

## generate left columns
docx <- melt(documentdata$Left, id.vars = c("class"))
pre <- which(docx$class %in% c("coll")) - 1
post <- which(docx$class %in% c("coll")) + 1
docx$class[pre] = "l.pre"
docx$class[post] = "l.post"
docx <- dcast(docx, L1 + variable ~ class, fun.aggregate=list)
names(docx)[names(docx)=="Var.3"] <- "l.full"
names(docx)[names(docx)=="coll"] <- "l.coll"
docx.left <- docx[, c("l.full", "l.pre", "l.coll", "l.post")]

#docx.left <- documentdata$Left %>% do.call(rbind, .) %>%
#  do(data.frame(l.pre = .[["str"]][which(.[["class"]]=="coll")-1],
#                l.coll = .[["str"]][which(.[["class"]]=="coll")], 
#                l.post = .[["str"]][which(.[["class"]]=="coll")+1]))

## generate center columns
docx <- melt(documentdata$Kwic, id.vars = c("class"))
names(docx)[names(docx)=="value"] <- "k.coll"
docx.kwic = docx[, c("k.coll"), drop = FALSE]

## generate right columns
docx <- melt(documentdata$Right, id.vars = c("class"))
post <- which(docx$class %in% c("coll")) + 1
docx$class[post] = "r.post"
docx <- dcast(docx, L1 + variable ~ class, fun.aggregate=list)
names(docx)[names(docx)=="coll"] <- "r.coll"
names(docx)[names(docx)=="Var.3"] <- "r.pre"
docx.right <- docx[, c("r.pre", "r.coll", "r.post")]

## final output
docx.output = cbind(docx.left, docx.kwic, docx.right)
docx.output = cbind(docx.output,SeekID=query)
docx.output <- docx.output[, c("SeekID", "l.full", "l.pre", "l.coll", "l.post", "k.coll", "r.pre", "r.coll", "r.post")]

Corrected Code (supplied by @cgjeremy) [resolved]

## general parsing function that handles "r" and "l" differently
myparse <- function(x, side){
  if(any(x$class=="coll")){
    pre <- x$str[which(x$class=="coll")-1]
    coll <- x$str[which(x$class=="coll")]
    post <- x$str[which(x$class=="coll")+1]
  } else if(side=="l"){
    pre <- paste0(x$str, collapse="")
    coll <- ""
    post <- ""
  } else if(side=="r"){
    pre <- ""
    coll <- ""
    post <- paste0(x$str, collapse="")
  } else {
    pre <- ""
    coll <- ""
    post <- ""
  }
  z <- data.frame(pre, coll, post)
  names(z) <-c(paste0(side, ".pre"), paste0(side, ".coll"), paste0(side, ".post"))
  z
}

## calls parsing function to generate left, mid, and right column-sets
library(dplyr)
left <- documentdata$Left %>% lapply(myparse, side="l") %>%
  do.call(rbind, .)
mid <- do.call(rbind, documentdata$Kwic)$str
right <- documentdata$Right %>% lapply(myparse, side="r") %>%
  do.call(rbind, .)

## combines left, mid, and right columns-sets to generate final output
docx.output <- cbind(left, mid, right)

Thanks in advance for any help or advice,

I'm not exactly sure of your rules for the right hand side, but I think this is what you want.

First let's define a parsing function:

myparse <- function(x, side){
  if(any(x$class=="coll")){
    pre <- x$str[which(x$class=="coll")-1]
    coll <- x$str[which(x$class=="coll")]
    post <- x$str[which(x$class=="coll")+1]
    all <- ""
  } else {
    pre <- ""
    coll <- ""
    post <- ""
    all <- paste0(x$str, collapse="")
  }
  z <- data.frame(pre, coll, post, all)
  names(z) <-c(paste0(side, ".pre"), paste0(side, ".coll"), paste0(side, ".post"), paste0(side, ".all"))
  z
}

This function checks each member of the documentdata$Left or Right, and if any have class=="coll" we split, otherwise we paste everything in the all column.

To run it once, try myparse(documentdata$Left[[1]], side="whatever") you can change the 1 to other members of the list.

Then we can get our left and right using lapply (which takes myparse and applies it to each element of a list), then rbinding the list into a data.frame. Mid is easier:

library(dplyr)
left <- documentdata$Left %>% lapply(myparse, side="l") %>%
                              do.call(rbind, .)
mid <- do.call(rbind, documentdata$Kwic)$str
right <- documentdata$Right %>% lapply(myparse, side="r") %>%
                                do.call(rbind, .)

Then we cbind them together:

cbind(left, mid, right)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM