如何在R中保存超过1048576条记录的Excel/CSV文件？

Question

I have data with more than 1048576 records and want to save this file into Excel or CSV format in R Programming language?我有超过 1048576 条记录的数据，想用 R 编程语言将此文件保存为 Excel 或 CSV 格式？ I know that excel sheet has restriction of 1048576 records but I am okay if the records can be appended in Other sheet?我知道 excel 表有 1048576 条记录的限制，但我可以将这些记录附加到其他表中吗？ Is there any way to achieve this?有什么办法可以做到这一点？ Thanks谢谢

Answer 1

Both scripts to write as csv or xlsx start by setting the digits option to a bigger value (see this SO question ) and to set a temporary directory to save and retrieve the files.编写为 csv 或 xlsx 的两个脚本都通过将digits选项设置为更大的值（参见这个 SO 问题）并设置一个临时目录来保存和检索文件。

Write as CSV写成 CSV

Base function write.csv doesn't have a 1MB or 1,048,576 rows limit.基本 function write.csv没有 1MB 或 1,048,576 行限制。

old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")

# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294

csv_test_file <- "so_q71553974_test.csv"
# write to disk and check its size and other info
write.csv(df1, csv_test_file, row.names = FALSE)
file.info(csv_test_file)
#>                           size isdir mode               mtime
#> so_q71553974_test.csv 97139168 FALSE  666 2022-03-21 08:13:42
#>                                     ctime               atime exe
#> so_q71553974_test.csv 2022-03-21 08:13:29 2022-03-21 08:13:42  no

# read the data from file and check if 
# the two data sets are equal
df2 <- read.csv(csv_test_file)

dim(df1)
#> [1] 4194294       2
dim(df2)
#> [1] 4194294       2

identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] TRUE

^{Created on 2022-03-21 by the reprex package (v2.0.1)}^{由reprex package (v2.0.1) 创建于 2022-03-21}

Final clean-up最后清理

unlink(csv_test_file)
options(old_opts)
setwd(old_dir)

Write as Excel file写成Excel文件

Excel has a 1MB or 2^20 or 1048576 rows limit so in the code below I will split the data into sub-df's with less than 2^20 - 1 rows. Excel 有 1MB 或 2^20 或 1048576 行限制，因此在下面的代码中，我会将数据拆分为少于 2^20 - 1 行的子 df。 I will subtract 2 to account for the column headers row and an extra row just to not be at the limit.我将减去 2 以说明列标题行和一个额外的行，只是为了不在限制范围内。
When tested for equality, the two data.frames have different classes.当测试是否相等时，两个 data.frames 具有不同的类。 read_excel reads the file and outputs a tibble, which sub-classes "data.frame" . read_excel读取文件并输出一个 tibble，它是"data.frame"的子类。

old_opts <- options(digits = 20)
old_dir <- getwd()
setwd("~/Temp")

# create a test data.frame
set.seed(2022)
# more than 1048576 rows
n <- 2^22
# two columns, one char, the other numeric
df1 <- data.frame(x = rep(letters, n%/%26), y = rnorm(n - 10L))
nrow(df1)
#> [1] 4194294


library(readxl)
library(writexl)

xl_test_file <- "so_q71553974_test.xlsx"

max_sheet_size <- 2^20 - 2L  # account for header row minus 1 to be safe
nsheets <- nrow(df1) %/% max_sheet_size + 1L
f <- rep(paste0("test_write_", seq.int(nsheets)), each = max_sheet_size, length.out = nrow(df1))

sp <- split(df1, f)
names(sp)
#> [1] "test_write_1" "test_write_2" "test_write_3" "test_write_4"
sapply(sp, nrow)
#> test_write_1 test_write_2 test_write_3 test_write_4 
#>      1048574      1048574      1048574      1048572
write_xlsx(sp, path = xl_test_file)

file.info(xl_test_file)
#>                            size isdir mode               mtime
#> so_q71553974_test.xlsx 89724869 FALSE  666 2022-03-21 08:28:54
#>                                      ctime               atime exe
#> so_q71553974_test.xlsx 2022-03-21 08:28:44 2022-03-21 08:28:54  no

# read the excel file
# since it has more than one sheet, loop through 
# the sheets and read them one by one
sheets <- excel_sheets(xl_test_file)
df2 <- lapply(sheets, \(s) read_excel(xl_test_file, sheet = s))

# bind all rows 
df2 <- do.call(rbind, df2)

dim(df1)
#> [1] 4194294       2
dim(df2)
#> [1] 4194294       2

identical(df1, df2)
#> [1] FALSE
all.equal(df1, df2)
#> [1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
#> [2] "Attributes: < Component \"class\": 1 string mismatch >"

class(df1)
#> [1] "data.frame"
class(df2)
#> [1] "tbl_df"     "tbl"        "data.frame"

# final clean up
unlink(xl_test_file)
options(old_opts)
setwd(old_dir)

^{Created on 2022-03-21 by the reprex package (v2.0.1)}^{由reprex package (v2.0.1) 创建于 2022-03-21}

如何在R中保存超过1048576条记录的Excel/CSV文件？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-03-21 08:55:12

Write as CSV写成 CSV

Write as Excel file写成Excel文件

如何在R中保存超过1048576条记录的Excel/CSV文件？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-03-21 08:55:12

Write as CSV写成 CSV

Write as Excel file写成Excel文件

解决方案1
1 已采纳 2022-03-21 08:55:12