简体   繁体   中英

Dealing with multivalued attributes in R by creating dataframe columns for each attribute

I have a data set based on events and each event has attributes in JSON format, so for example, a simplified version of the data:

 id   event        attribute
 1    23       {'grades':43, 'school':'primary'}
 2    49       {}
 3    99       {'x':49, 'y':52, 'country':'Japan'}
 4    89       {'grades':56}

the attributes are multivalued, and each row has different numbers of attributes. I am guessing that R is probably not the best way to deal with this kind of data, usually I would have an 'attributes' table separately in SQL and join on the event ID to get the attributes and their values. I am wondering if there is an established way of dealing with this problem in R though. I want a way to represent this data so that I can summarise it and group events with the same kind of attributes to compare their values

update following the suggestion, I'd like to know if there is a straight forward way of getting the result

d = data.frame(id = 1:4, 
                event =c(23, 49, 99, 89), 
                grades = c(43, NA, NA, 56), 
                school=c("primary", NA, NA, NA))

without manually inputting it

second/third update

I've written this, which seems to work, so I thought i'd share, if there's an easier way to do it please let me know:

    library(jsonlite)

#data input
    id <- 1:4
    event <- c(23,49,99,89)
    attribute <- c("{'grades':43, 'school':'primary'}", "{}", "{'x':49, 'y':52, 'country':'Japan'}", "{'grades':56}")

#format for fromJSON
    attribute <- gsub("'", '"', attribute)
    att <- lapply(attribute, fromJSON)

#distinct attributes
    att_names <- unique(unlist(lapply(att, names)))

#store output in list list_atts
    list_atts <- list()

    for(i in 1:length(att_names)){
            j <- lapply(att, "[", paste(att_names[i]))
            j <- lapply(j, function(x) ifelse(is.null(unlist(x)) == TRUE, NA, unlist(x))) # convert NULL to NA
            list_atts[[i]] <- unlist(j)
            names(list_atts)[i] <- paste(att_names[i])
    }

The output here:

> data.frame(list_atts, stringsAsFactors = FALSE)
  grades  school  x  y country
1     43 primary NA NA    <NA>
2     NA    <NA> NA NA    <NA>
3     NA    <NA> 49 52   Japan
4     56    <NA> NA NA    <NA>

In an R data frame, each row should correspond to a person/thing each column should be a variable. So in your data set above, you want something like

dd = data.frame(id = 1:4, 
                event =c(23, 49, 99, 89), 
                grades = c(43, NA, NA, 56), 
                school=c("primary", NA, NA, NA))

where NA is a missing value.


Small update following comment:

  1. If each row is "similar" then this is the suggested approach. It means all the standard algorithms and plots will just work. If you have a large number of attributes, then it depends on what large is. Specifically, does it cause you memory/speed problems? If not, don't worry about. If so, do you really need all the attributes?

  2. For handling json data, see packages like jsonlite

You could try:

library(dplyr)
library(tidyr)
df %>% 
  mutate(to = strsplit(attribute, ",")) %>%
  unnest(to) %>%
  separate(to, into = c("l", "v"), sep = ":") %>%
  mutate_at(vars(l, v), funs(gsub("[^[:alnum:]]", "", .))) %>%
  spread(l, v, sep = "_") %>%
  select(-attribute, -l_)

Which gives:

#  id event l_country l_grades l_school  l_x  l_y
#1  1    23      <NA>       43  primary <NA> <NA>
#2  2    49      <NA>     <NA>     <NA> <NA> <NA>
#3  3    99     Japan     <NA>     <NA>   49   52
#4  4    89      <NA>       56     <NA> <NA> <NA>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM