简体   繁体   English

从R数据框的行中提取JSON数据

[英]Extract JSON data from the rows of an R data frame

I have a data frame where the values of column Parameters are Json data: 我有一个数据框,其中“列参数”的值是Json数据:

#  Parameters
#1 {"a":0,"b":[10.2,11.5,22.1]}
#2 {"a":3,"b":[4.0,6.2,-3.3]}
...

I want to extract the parameters of each row and append them to the data frame as columns A , B1 , B2 and B3 . 我想提取每一行的参数,并将它们作为列AB1B2B3附加到数据框。

How can I do it? 我该怎么做?

I would rather use dplyr if it is possible and efficient. 如果可能且高效,我宁愿使用dplyr

In your example data, each row contains a json object. 在示例数据中,每一行都包含一个json对象。 This format is called jsonlines aka ndjson , and the jsonlite package has a special function stream_in to parse such data into a data frame: 这种格式称为jsonlines aka ndjson ,并且jsonlite包具有特殊的功能stream_in将此类数据解析为数据帧:

# Example data
mydata <- data.frame(parameters = c(
  '{"a":0,"b":[10.2,11.5,22.1]}',
  '{"a":3,"b":[4.0,6.2,-3.3]}'
), stringsAsFactors = FALSE)

# Parse json lines
res <- jsonlite::stream_in(textConnection(mydata$parameters))

# Extract columns
a <- res$a
b1 <- sapply(res$b, "[", 1)
b2 <- sapply(res$b, "[", 2)
b3 <- sapply(res$b, "[", 3)

In your example, the json structure is fairly simple so the other suggestions work as well, but this solution will generalize to more complex json structures. 在您的示例中,json结构非常简单,因此其他建议也可以使用,但是此解决方案将推广到更复杂的json结构。

I actually had a similar problem where I had multiple variables in a data frame which were JSON objects and a lot of them were NA's, but I did not want to remove the rows where NA's existed. 我实际上有一个类似的问题,我在一个数据帧中有多个变量,这些变量是JSON对象,其中很多是NA,但是我不想删除存在NA的行。 I wrote a function which is passed a data frame, id within the data frame(usually a record ID), and the variable name in quotes to parse. 我编写了一个函数,该函数传递了一个数据框,该数据框内的id(通常是一个记录ID)以及用引号引起来的变量名以进行解析。 The function will create two subsets, one for records which contain JSON objects and another to keep track of NA value records for the same variable then it joins those data frames and joins their combination to the original data frame thereby replacing the former variable. 该函数将创建两个子集,一个子集用于包含JSON对象的记录,另一个子集跟踪同一变量的NA值记录,然后将这些数据帧合并,并将其组合与原始数据帧合并,从而替换前一个变量。 Perhaps it will help you or someone else as it has worked for me in a few cases now. 也许它将对您或其他人有所帮助,因为它在某些情况下对我有用。 I also haven't really cleaned it up too much so I apologize if my variable names are a bit confusing as well as this was a very ad-hoc function I wrote for work. 我也没有真正清除它太多,所以如果变量名有点令人困惑,这是我为工作编写的非常临时的函数,我深表歉意。 I also should state that I did use another poster's idea for replacing the former variable with the new variables created from the JSON object. 我还应该指出,我确实使用了另一个发布者的想法,即用从JSON对象创建的新变量替换以前的变量。 You can find that here : Add (insert) a column between two columns in a data.frame 您可以在此处找到: 在data.frame的两列之间添加(插入)一列

One last note: there is a package called tidyjson which would've had a simpler solution but apparently cannot work with list type JSON objects. 最后一点:有一个名为tidyjson的软件包,该软件包本来可以使用更简单的解决方案,但是显然不能与列表类型的JSON对象一起使用。 At least that's my interpretation. 至少那是我的解释。

library(jsonlite)
library(stringr)
library(dplyr)

parse_var <- function(df,id, var) {
  m <- df[,var]
  p <- m[-which(is.na(m))]
  n <- df[,id]
  key <- n[-which(is.na(df[,var]))]

  #create df for rows which are NA
  key_na <- n[which(is.na(df[,var]))]
  q <- m[which(is.na(m))]
  parse_df_na <- data.frame(key_na,q,stringsAsFactors = FALSE)  

  #Parse JSON values and bind them together into a dataframe.
  p <- lapply(p,function(x){ 
    fromJSON(x) %>% data.frame(stringsAsFactors = FALSE)}) %>% bind_rows()
  #bind the record id's of the JSON values to the above JSON parsed dataframe and name the columns appropriately.
  parse_df <- data.frame(key,p,stringsAsFactors = FALSE)

## The new variables begin with a capital 'x' so I replace those with my former variables  name
  n <- names(parse_df) %>% str_replace('X',paste(var,".",sep = ""))
  n <- n[2:length(n)]
  colnames(parse_df) <- c(id,n)

  #join the dataframe for NA JSON values and the dataframe containing parsed JSON values, then remove the NA column,q.
  parse_df <- merge(parse_df,parse_df_na,by.x = id,by.y = 'key_na',all = TRUE)

#Remove the new column formed by the NA values#
  parse_df <- parse_df[,-which(names(parse_df) =='q')]

  ####Replace variable that is being parsed in dataframe with the new parsed and names values.######

  new_df <- data.frame(append(df,parse_df[,-which(names(parse_df) == id)],after = which(names(df) == var)),stringsAsFactors = FALSE)
  new_df <- new_df[,-which(names(new_df) == var)]
  return(new_df)
} 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM