R中的嵌套分层数据帧

Question

我是R的新手，我不想从一开始就误解语言及其数据结构。 :)

我的data.frame sample.data包含“普通”属性（例如author ）旁边的另一个嵌套的data.frame（ files ）列表，其中包含例如属性extension 。

如何筛选已创建具有特定扩展名的文件的作者？ 有没有一种R-ic方式呢？ 也许在这个方向：

t <- subset(data, data$files[['extension']] > '.R')

其实我想避免循环。

在这里您可以找到一些示例数据：

d1 <- data.frame(extension=c('.py', '.py', '.c++')) # and some other attributes
d2 <- data.frame(extension=c('.R', '.py')) # and some other attributes

sample.data <- data.frame(author=c('author_1', 'author_2'), files=I(list(d1, d2)))

sample.data来自的JSON看起来像

[
    {
        "author": "author_1",
        "files": [
            {
                "extension": ".py",
                "path": "/a/path/somewhere/"
            },
            {
                "extension": ".c++",
                "path": "/a/path/somewhere/else/"
            }, ...
        ]
    }, ...
]

Answer 1

至少有十几种方法可以做到这一点，但是如果你想学习R，你应该学习数据结构子集的标准方法，特别是原子向量，列表和数据帧。 本书第二章介绍了这一点：

http://adv-r.had.co.nz/

还有其他很棒的书，但这是一本很好的书，它是在线免费的。

更新：好的，这会将您的json转换为数据框列表。

library("rjson")
s <- paste(c(
'[{' ,
'  "author": "author_1",',
'  "files": [',
'    {',
'     "extension": ".py",',
'     "path": "/a/path/somewhere/"',
'   },',
'   {',
'     "extension": ".c++",',
'     "path": "/a/path/somewhere/else/"',
'    }]',
'},',
'{',
'"author": "author_2",',
'"files": [',
'  {',
'    "extension": ".py",',
'    "path": "/b/path/somewhere/"',
'  },',
'  {',
'    "extension": ".c++",',
'    "path": "/b/path/somewhere/else/"',
'  }]',
'}]'),collapse="")

j <- fromJSON(s)

todf <- function (x) {
    nrow <- length(x$files)
    vext <- sapply(x$files,function (y) y[[1]])
    vpath <- sapply(x$files,function (y) y[[2]])
    df <- data.frame(author=rep(x$author,nrow),ext=vext,path=vpath)
}
listdf <- lapply(j,todf)
listdf

产量：

[[1]]
    author  ext                    path
1 author_1  .py      /a/path/somewhere/
2 author_1 .c++ /a/path/somewhere/else/

[[2]]
    author  ext                    path
1 author_2  .py      /b/path/somewhere/
2 author_2 .c++ /b/path/somewhere/else/

并完成任务，合并和子集：

   mdf <- do.call("rbind", listdf)
   mdf[ mdf$ext==".py", ]

收益：

    author ext               path
1 author_1 .py /a/path/somewhere/
3 author_2 .py /b/path/somewhere/

Answer 2

有趣的是，没有多少人使用R来模拟分层数据库！

subset(sample.data, sapply(files, function(df) any(df$extension == ".R")))

Answer 3

假设您的数据框df （如CSV）如下所示：

author,path,extension
john,/home/john,txt
mary,/home/mary,png

那么最简单的解决方案是使用dplyr包：

library(dplyr)
filter(df, author=="john" & extension=="txt")

Answer 4

我想base包中的grep()函数可能是你的解决方案：

files <- data.frame(path = paste0("path", 1:3), extension = c (".R", ".csv", ".R")
                    , creation.date = c(Sys.Date()+1:3))

> files
# path extension creation.date
# 1 path1        .R    2015-07-15
# 2 path2      .csv    2015-07-16
# 3 path3        .R    2015-07-17


> files[grep(".R", files$extension),]
# extension creation.date
# 1 path1        .R    2015-07-15
# 3 path3        .R    2015-07-17

R中的嵌套分层数据帧

问题描述

4 个解决方案

解决方案1
6 2015-07-14 08:47:31

解决方案2
3 已采纳 2015-07-14 09:47:58

解决方案3
2 2015-07-14 08:40:49

解决方案4
1 2015-07-14 08:37:46

R中的嵌套分层数据帧

问题描述

4 个解决方案

解决方案1 6 2015-07-14 08:47:31

解决方案2 3 已采纳 2015-07-14 09:47:58

解决方案3 2 2015-07-14 08:40:49

解决方案4 1 2015-07-14 08:37:46

解决方案1
6 2015-07-14 08:47:31

解决方案2
3 已采纳 2015-07-14 09:47:58

解决方案3
2 2015-07-14 08:40:49

解决方案4
1 2015-07-14 08:37:46