[英]fast way to separate list of list into two lists
I have got quite a good experience with C programming and I am used to think in terms of pointers, so I can get good performance when dealing with huge amount of datas. 我在C编程方面有很好的经验,并且习惯于使用指针进行思考,因此在处理大量数据时可以获得良好的性能。 It is not the same with R, which I am still learning. R与R并不相同,我仍在学习。
I have got a file with approximately 1 million lines, separated by a '\\n' and each line has got 1, 2 or more integers inside, separated by a ' '. 我有一个大约有100万行的文件,以'\\ n'分隔,每行内部有1、2个或多个整数,以''分隔。 I have been able to put together a code which reads the file and put everything into a list of lists. 我已经能够编写一个读取文件并将所有内容放入列表列表的代码。 Some lines can be empty. 有些行可以为空。 I would then like to put the first number of each line, if it exists, into a separated list, just passing over if a line is empty, and the remaining numbers into a second list. 然后,我想将每行的第一个数字(如果存在的话)放入一个单独的列表中,如果某行为空,则将其过去,然后将剩余的数字放入第二个列表中。
The code I post here is terribly slow (it has been still running since I started wrote this question so now I killed R), how can I get a decent speed? 我在这里发布的代码非常慢 (自从我开始写这个问题以来,它一直在运行,所以现在我杀了R),如何获得一个不错的速度? In C this would be done instantly. 在C语言中,这将立即完成。
graph <- function() {
x <- scan("result", what="", sep="\n")
y <- strsplit(x, "[[:space:]]+") #use spaces for split number in each line
y <- lapply(y, FUN = as.integer) #convert from a list of lists of characters to a list of lists of integers
print("here we go")
first <- c()
others <- c()
for(i in 1:length(y)) {
if(length(y[i]) >= 1) {
first[i] <- y[i][1]
}
k <- 2;
for(j in 2:length(y[i])) {
others[k] <- y[i][k]
k <- k + 1
}
}
In a previous version of the code, in which each line had at least one number and in which I was interested only in the first number of each line, I used this code (I read everywhere that I should avoid using for loops in languages like R) 在以前的代码版本中,每行至少有一个数字,而我只对每行的第一个数字感兴趣,因此我使用了这段代码(我读了很多书,应该避免在诸如R)
yy <- rapply(y, function(x) head(x,1))
which takes about 5 second, so far far better than above but still annoying if compared to C. 这大约需要5秒钟,到目前为止远远超过了上面的时间,但是与C相比仍然很烦人 。
EDIT this is an example of the first 10 lines of my file: 编辑这是我文件前10行的示例:
42 7 31 3
23 1 34 5
1
-23 -34 2 2
42 7 31 3 31 4
1
Base R versus purrr 基数R与Purrr
your_list <- rep(list(list(1,2,3,4), list(5,6,7), list(8,9)), 100)
microbenchmark::microbenchmark(
your_list %>% map(1),
lapply(your_list, function(x) x[[1]])
)
Unit: microseconds
expr min lq mean median uq max neval
your_list %>% map(1) 22671.198 23971.213 24801.5961 24775.258 25460.4430 28622.492 100
lapply(your_list, function(x) x[[1]]) 143.692 156.273 178.4826 162.233 172.1655 1089.939 100
microbenchmark::microbenchmark(
your_list %>% map(. %>% .[-1]),
lapply(your_list, function(x) x[-1])
)
Unit: microseconds
expr min lq mean median uq max neval
your_list %>% map(. %>% .[-1]) 916.118 942.4405 1019.0138 967.4370 997.2350 2840.066 100
lapply(your_list, function(x) x[-1]) 202.956 219.3455 264.3368 227.9535 243.8455 1831.244 100
purrr isn't a package for performance, just convenience, which is great but not when you care a lot about performance. purrr并不是性能套件,只是方便,这很棒,但是当您非常在意性能时却不是。 This has been discussed elsewhere . 这已经在其他地方讨论过了。
By the way, if you are good in C, you should look at package Rcpp . 顺便说一句,如果您精通C语言,则应查看软件包Rcpp 。
try this: 尝试这个:
your_list <- list(list(1,2,3,4),
list(5,6,7),
list(8,9))
library(purrr)
first <- your_list %>% map(1)
# [[1]]
# [1] 1
#
# [[2]]
# [1] 5
#
# [[3]]
# [1] 8
other <- your_list %>% map(. %>% .[-1])
# [[1]]
# [[1]][[1]]
# [1] 2
#
# [[1]][[2]]
# [1] 3
#
# [[1]][[3]]
# [1] 4
#
#
# [[2]]
# [[2]][[1]]
# [1] 6
#
# [[2]][[2]]
# [1] 7
#
#
# [[3]]
# [[3]][[1]]
# [1] 9
Though you might want the following, as it seems to me those numbers would be better stored in vectors than in lists: 尽管您可能需要以下内容,但在我看来,将这些数字更好地存储在矢量中而不是列表中:
your_list %>% map(1) %>% unlist # as it seems map_dbl was slow
# [1] 1 5 8
your_list %>% map(~unlist(.x[-1]))
# [[1]]
# [1] 2 3 4
#
# [[2]]
# [1] 6 7
#
# [[3]]
# [1] 9
Indeed, coming from C to R will be confusing (it was for me). 确实,从C到R会造成混乱(这是对我而言)。 What helps for performance is understanding that primitive types in R are all vectors implemented in highly optimized, natively-compiled C and Fortran, and you should aim to avoid loops when there's a vectorized solution available. 有助于提高性能的方法是,了解R中的原始类型都是在高度优化的,本机编译的C和Fortran中实现的所有向量 ,并且当有矢量化解决方案可用时,您应力争避免循环。
That said, I think you should load this as a csv via read.csv()
. 就是说,我认为您应该通过read.csv()
其作为csv加载。 This will provide you with a dataframe with which you can perform vector-based operations. 这将为您提供一个数据框,您可以使用该数据框执行基于矢量的操作。
For a better understanding, a concise (and humorous) read is http://www.burns-stat.com/pages/Tutor/R_inferno.pdf . 为了获得更好的理解,请访问http://www.burns-stat.com/pages/Tutor/R_inferno.pdf进行简洁(幽默)的阅读。
I would try to use stringr
package. 我会尝试使用stringr
包。 Something like this: 像这样:
set.seed(3)
d <- replicate(3, sample(1:1000, 3))
d <- apply(d, 2, function(x) paste(c(x, "\n"), collapse = " "))
d
# [1] "169 807 385 \n" "328 602 604 \n" "125 295 577 \n"
require(stringr)
str_split(d, " ", simplify = T)
# [,1] [,2] [,3] [,4]
# [1,] "169" "807" "385" "\n"
# [2,] "328" "602" "604" "\n"
# [3,] "125" "295" "577" "\n"
Even for large data it is fast: 即使是大数据,它也很快:
d <- replicate(1e6, sample(1:1000, 3))
d <- apply(d, 2, function(x) paste(c(x, "\n"), collapse = " "))
d
system.time(s <- str_split(d, " ", simplify = T)) #0.77 sek
Assuming the files are in a CSV, and that all of the 'numbers' are strictly of the form 1 2
or -1 2
( ie , 1 2 3
or 1 23
are not allowed in the file), then one could start by coding: 假设文件采用CSV格式,并且所有“数字”严格采用1 2
或-1 2
的格式( 即文件中不允许使用1 2 3
或1 23
),那么可以通过编码开始:
# Install package `data.table` if needed
# install.packages('data.table')
# Load `data.table` package
library(data.table)
# Load the CSV, which has just one column named `my_number`.
# Then, coerce `my_number` into character format and remove negative signs.
DT <- fread('file.csv')[, my_number := as.character(abs(my_number))]
# Extract first character, which would be the first desired digit
# if my assumption about number formats is correct.
DT[, first_column := substr(my_number, 1, 1)]
# The rest of the substring can go into another column.
DT[, second_column := substr(my_number, 2, nchar(my_number))].
Then, if you still really need to create two lists, you could do the following. 然后,如果您仍然确实需要创建两个列表,则可以执行以下操作。
# Create the first list.
first_list <- DT[, as.list(first_column)]
# Create the second list.
second_list <- DT[, as.list(second_column)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.