简体   繁体   English

如何在R中保存超过1048576条记录的Excel/CSV文件?

[英]How to save Excel/CSV file with more than 1048576 records in R?

I have data with more than 1048576 records and want to save this file into Excel or CSV format in R Programming language?我有超过 1048576 条记录的数据,想用 R 编程语言将此文件保存为 Excel 或 CSV 格式? I know that excel sheet has restriction of 1048576 records but I am okay if the records can be appended in Other sheet?我知道 excel 表有 1048576 条记录的限制,但我可以将这些记录附加到其他表中吗? Is there any way to achieve this?有什么办法可以做到这一点? Thanks谢谢

Both scripts to write as csv or xlsx start by setting the digits option to a bigger value (see this SO question ) and to set a temporary directory to save and retrieve the files.编写为 csv 或 xlsx 的两个脚本都通过将digits选项设置为更大的值(参见这个 SO 问题)并设置一个临时目录来保存和检索文件。

Write as CSV写成 CSV

Base function write.csv doesn't have a 1MB or 1,048,576 rows limit.基本 function write.csv没有 1MB 或 1,048,576 行限制。

old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")

# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294

csv_test_file <- "so_q71553974_test.csv"
# write to disk and check its size and other info
write.csv(df1, csv_test_file, row.names = FALSE)
file.info(csv_test_file)
#>                           size isdir mode               mtime
#> so_q71553974_test.csv 97139168 FALSE  666 2022-03-21 08:13:42
#>                                     ctime               atime exe
#> so_q71553974_test.csv 2022-03-21 08:13:29 2022-03-21 08:13:42  no

# read the data from file and check if 
# the two data sets are equal
df2 <- read.csv(csv_test_file)

dim(df1)
#> [1] 4194294       2
dim(df2)
#> [1] 4194294       2

identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] TRUE

Created on 2022-03-21 by the reprex package (v2.0.1)reprex package (v2.0.1) 创建于 2022-03-21

Final clean-up最后清理

unlink(csv_test_file)
options(old_opts)
setwd(old_dir)

Write as Excel file写成Excel文件

Excel has a 1MB or 2^20 or 1048576 rows limit so in the code below I will split the data into sub-df's with less than 2^20 - 1 rows. Excel 有 1MB 或 2^20 或 1048576 行限制,因此在下面的代码中,我会将数据拆分为少于 2^20 - 1 行的子 df。 I will subtract 2 to account for the column headers row and an extra row just to not be at the limit.我将减去 2 以说明列标题行和一个额外的行,只是为了不在限制范围内。
When tested for equality, the two data.frames have different classes.当测试是否相等时,两个 data.frames 具有不同的类。 read_excel reads the file and outputs a tibble, which sub-classes "data.frame" . read_excel读取文件并输出一个 tibble,它是"data.frame"的子类。

old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")

# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294


library(readxl)
library(writexl)

xl_test_file <- "so_q71553974_test.xlsx"

max_sheet_size <- 2^20 - 2L  # account for header row minus 1 to be safe
nsheets <- nrow(df1) %/% max_sheet_size + 1L
f <- rep(paste0("test_write_", seq.int(nsheets)), each = max_sheet_size, length.out = nrow(df1))

sp <- split(df1, f)
names(sp)
#> [1] "test_write_1" "test_write_2" "test_write_3" "test_write_4"
sapply(sp, nrow)
#> test_write_1 test_write_2 test_write_3 test_write_4 
#>      1048574      1048574      1048574      1048572
write_xlsx(sp, path = xl_test_file)

file.info(xl_test_file)
#>                            size isdir mode               mtime
#> so_q71553974_test.xlsx 89724869 FALSE  666 2022-03-21 08:28:54
#>                                      ctime               atime exe
#> so_q71553974_test.xlsx 2022-03-21 08:28:44 2022-03-21 08:28:54  no

# read the excel file
# since it has more than one sheet, loop through 
# the sheets and read them one by one
sheets <- excel_sheets(xl_test_file)
df2 <- lapply(sheets, \(s) read_excel(xl_test_file, sheet = s))

# bind all rows 
df2 <- do.call(rbind, df2)

dim(df1)
#> [1] 4194294       2
dim(df2)
#> [1] 4194294       2

identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"

class(df1)
#> [1] "data.frame"
class(df2)
#> [1] "tbl_df"     "tbl"        "data.frame"

# final clean up
unlink(xl_test_file)
options(old_opts)
setwd(old_dir)

Created on 2022-03-21 by the reprex package (v2.0.1)reprex package (v2.0.1) 创建于 2022-03-21

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM