[英]How to save Excel/CSV file with more than 1048576 records in R?
I have data with more than 1048576 records and want to save this file into Excel or CSV format in R Programming language?我有超过 1048576 条记录的数据,想用 R 编程语言将此文件保存为 Excel 或 CSV 格式? I know that excel sheet has restriction of 1048576 records but I am okay if the records can be appended in Other sheet?
我知道 excel 表有 1048576 条记录的限制,但我可以将这些记录附加到其他表中吗? Is there any way to achieve this?
有什么办法可以做到这一点? Thanks
谢谢
Both scripts to write as csv or xlsx start by setting the digits
option to a bigger value (see this SO question ) and to set a temporary directory to save and retrieve the files.编写为 csv 或 xlsx 的两个脚本都通过将
digits
选项设置为更大的值(参见这个 SO 问题)并设置一个临时目录来保存和检索文件。
Base function write.csv
doesn't have a 1MB or 1,048,576 rows limit.基本 function
write.csv
没有 1MB 或 1,048,576 行限制。
old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")
# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294
csv_test_file <- "so_q71553974_test.csv"
# write to disk and check its size and other info
write.csv(df1, csv_test_file, row.names = FALSE)
file.info(csv_test_file)
#> size isdir mode mtime
#> so_q71553974_test.csv 97139168 FALSE 666 2022-03-21 08:13:42
#> ctime atime exe
#> so_q71553974_test.csv 2022-03-21 08:13:29 2022-03-21 08:13:42 no
# read the data from file and check if
# the two data sets are equal
df2 <- read.csv(csv_test_file)
dim(df1)
#> [1] 4194294 2
dim(df2)
#> [1] 4194294 2
identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] TRUE
Created on 2022-03-21 by the reprex package (v2.0.1)由reprex package (v2.0.1) 创建于 2022-03-21
Final clean-up最后清理
unlink(csv_test_file)
options(old_opts)
setwd(old_dir)
Excel has a 1MB or 2^20 or 1048576 rows limit so in the code below I will split the data into sub-df's with less than 2^20 - 1 rows. Excel 有 1MB 或 2^20 或 1048576 行限制,因此在下面的代码中,我会将数据拆分为少于 2^20 - 1 行的子 df。 I will subtract 2 to account for the column headers row and an extra row just to not be at the limit.
我将减去 2 以说明列标题行和一个额外的行,只是为了不在限制范围内。
When tested for equality, the two data.frames have different classes.当测试是否相等时,两个 data.frames 具有不同的类。
read_excel
reads the file and outputs a tibble, which sub-classes "data.frame"
. read_excel
读取文件并输出一个 tibble,它是"data.frame"
的子类。
old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")
# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294
library(readxl)
library(writexl)
xl_test_file <- "so_q71553974_test.xlsx"
max_sheet_size <- 2^20 - 2L # account for header row minus 1 to be safe
nsheets <- nrow(df1) %/% max_sheet_size + 1L
f <- rep(paste0("test_write_", seq.int(nsheets)), each = max_sheet_size, length.out = nrow(df1))
sp <- split(df1, f)
names(sp)
#> [1] "test_write_1" "test_write_2" "test_write_3" "test_write_4"
sapply(sp, nrow)
#> test_write_1 test_write_2 test_write_3 test_write_4
#> 1048574 1048574 1048574 1048572
write_xlsx(sp, path = xl_test_file)
file.info(xl_test_file)
#> size isdir mode mtime
#> so_q71553974_test.xlsx 89724869 FALSE 666 2022-03-21 08:28:54
#> ctime atime exe
#> so_q71553974_test.xlsx 2022-03-21 08:28:44 2022-03-21 08:28:54 no
# read the excel file
# since it has more than one sheet, loop through
# the sheets and read them one by one
sheets <- excel_sheets(xl_test_file)
df2 <- lapply(sheets, \(s) read_excel(xl_test_file, sheet = s))
# bind all rows
df2 <- do.call(rbind, df2)
dim(df1)
#> [1] 4194294 2
dim(df2)
#> [1] 4194294 2
identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"
class(df1)
#> [1] "data.frame"
class(df2)
#> [1] "tbl_df" "tbl" "data.frame"
# final clean up
unlink(xl_test_file)
options(old_opts)
setwd(old_dir)
Created on 2022-03-21 by the reprex package (v2.0.1)由reprex package (v2.0.1) 创建于 2022-03-21
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.