Nested, hierarchical data frames in R

I am new to R and I don't want to misunderstand the language and its data structure from the beginning on. :)

My data.frame sample.data contains beside 'normal' attributes (eg author ) another, nested list of data.frame ( files ), which has eg the attributes extension .

How can I filter for authors who have created files with a certain extension? Is there a R-ic way of doing that? Maybe in this direction:

t <- subset(data, data$files[['extension']] > '.R')

Actually I want to avoid for loops.

Here you can find some sample data:

d1 <- data.frame(extension=c('.py', '.py', '.c++')) # and some other attributes
d2 <- data.frame(extension=c('.R', '.py')) # and some other attributes

sample.data <- data.frame(author=c('author_1', 'author_2'), files=I(list(d1, d2)))

The JSON the sample.data comes from looks like

        "author": "author_1",
        "files": [
                "extension": ".py",
                "path": "/a/path/somewhere/"
                "extension": ".c++",
                "path": "/a/path/somewhere/else/"
            }, ...
    }, ...

There are at least a dozen ways of doing this, but if you want to learn R right, you should learn the standard ways of subsetting data structures, especially atomic vectors, lists and data frames. This is covered in chapter two of this book:


There are other great books, but this is a good one, and it is online and free.

UPDATE: Okay, this converts your json to a list of data frames.

s <- paste(c(
'[{' ,
'  "author": "author_1",',
'  "files": [',
'    {',
'     "extension": ".py",',
'     "path": "/a/path/somewhere/"',
'   },',
'   {',
'     "extension": ".c++",',
'     "path": "/a/path/somewhere/else/"',
'    }]',
'"author": "author_2",',
'"files": [',
'  {',
'    "extension": ".py",',
'    "path": "/b/path/somewhere/"',
'  },',
'  {',
'    "extension": ".c++",',
'    "path": "/b/path/somewhere/else/"',
'  }]',

j <- fromJSON(s)

todf <- function (x) {
    nrow <- length(x$files)
    vext <- sapply(x$files,function (y) y[[1]])
    vpath <- sapply(x$files,function (y) y[[2]])
    df <- data.frame(author=rep(x$author,nrow),ext=vext,path=vpath)
listdf <- lapply(j,todf)

Which yields:

    author  ext                    path
1 author_1  .py      /a/path/somewhere/
2 author_1 .c++ /a/path/somewhere/else/

    author  ext                    path
1 author_2  .py      /b/path/somewhere/
2 author_2 .c++ /b/path/somewhere/else/

And to finish the task, merge and subset:

   mdf <- do.call("rbind", listdf)
   mdf[ mdf$ext==".py", ]


    author ext               path
1 author_1 .py /a/path/somewhere/
3 author_2 .py /b/path/somewhere/


subset(sample.data, sapply(files, function(df) any(df$extension == ".R")))

Assuming your data frame df , as a CSV, looks like:


then the easiest solution is to use the dplyr package:

filter(df, author=="john" & extension=="txt") 

I guess grep() function in base package could be your solution:

files <- data.frame(path = paste0("path", 1:3), extension = c (".R", ".csv", ".R")
                    , creation.date = c(Sys.Date()+1:3))

> files
# path extension creation.date
# 1 path1        .R    2015-07-15
# 2 path2      .csv    2015-07-16
# 3 path3        .R    2015-07-17

> files[grep(".R", files$extension),]
# extension creation.date
# 1 path1        .R    2015-07-15
# 3 path3        .R    2015-07-17

