I am trying to reshape a data frame for more efficient storage and retrieval. Each row contains a "parent" (key) value, which is not unique between rows, and a child value (actually, a set of 3 attributes -- 1 character and 2 numeric). I want to transform this data frame into a list that has just one top-level entry for each unique parent key, and a number of sub-lists as determined by the number of children associate with the parent. Here are some sample data:
pcm <- data.frame(parent = c("middle", "middle", "might", "might",
"might", "million", "million", "millions"),
child = c("of", "school", "be", "have", "not", "in",
"to", "of"),
count = c(476, 165, 1183, 619, 321, 490, 190, 269))
The output for this should be a list with 4 top-level elements (named "middle", "might", "million", "millions"
), and varying numbers of sub-lists with named members $child
and $count
(eg lookup4[["middle"]]
contains sub-lists $children[[1]]$child
= "of"
, $count
= 476
and $children[[2]]$child
= "school"
, $count
= 165
).
The code below works, but is extremely slow (several hours on a 300,000-row data frame using 8 GB RAM). I have imposed a limit of 6 on the number of children in the output data, but it doesn't seem to have made a big difference.
lookup4 <- list()
parents <- unique(pcm$parent)
n.parents <- length(parents)
for (i in 1:n.parents) {
words <- pcm$child[pcm$parent == parents[i]]
counts <- pcm$count[pcm$parent == parents[i]]
probs <- pcm$prob[pcm$parent == parents[i]]
n.children <- min(c(NROW(words), 6)
ngram.tail <- list()
for (k in 1:n.children) {
ngram.tail[[k]] <- list(word = words[k],
count = counts[k],
prob = probs[k])
}
lookup4[[parents[i]]] <- list(children = ngram.tail)
}
Could I speed it up by eliminating the 'for' loop? If so, how would I code the transformation?
Try this:
I suppose that the dataframe is called parents
:
parents.list <- as.list(as.data.frame(t(parents)))
If you want the row names of parents to be the names of the list:
parents.list <- setNames(split(parents, seq(nrow(parents))), rownames(parents))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.