[英]Unable to load Cassandra table using a spark session, sparklyr and R
[英]Loading more than one Spark parquet file into a Spark table using R sparklyr?
我正在尝试使用R sparklyr
将多个实木复合地板文件加载到一个Spark表中。 随附的代码显示了我的操作方式。
spark_load_data <- function(db_conn, test_period)
{
library(DBI)
#
overwrite <- TRUE
#
for (ts in seq(as.Date(test_period["START_DATE","VALUE"]),
as.Date(test_period["END_DATE","VALUE"]),
by="day")) {
#
# date to load
#
td <- format(as.Date(ts,origin="1970-01-01"), "%Y-%m-%d")
#
# load parquet files
#
tbl <- "pcidata"
pq_path <- paste0("s3://<path>/PciData/transaction_date=", td)
read_in <- spark_read_parquet(db_conn,
name=tbl,
path=pq_path,
overwrite=overwrite)
#
overwrite <- FALSE
}
}
我想让Spark表包含所有实木复合地板文件,而不是覆盖数据或跳过数据。 能做到吗?
read.parquet
方法实际上支持提供多个文件路径,这意味着我们可以编写一个简单的包装器:
read_parquet_multiple <- function(sc, paths) {
spark_session(sc) %>% invoke("read") %>% invoke("parquet", as.list(paths))
}
然后使用它读取多个文件(例如,完整的示例,包括连接到本地spark实例并写入2个实木复合地板文件以进行加载):
library(sparklyr); library(dplyr)
sc <- spark_connect(master = "local")
# Write 1:10 into 2 separate parquet files
sdf_seq(sc, 1, 3, repartition = NULL) %>% spark_write_parquet("batch_1")
sdf_seq(sc, 4, 6, repartition = NULL) %>% spark_write_parquet("batch_2")
# Read mulitple files
dataset <- sc %>% read_parquet_multiple(paths = c("batch_1", "batch_2"))
# Collect to show the results
dataset %>% collect()
# # A tibble: 6 x 1
# id
# <int>
# 1 2
# 2 3
# 3 5
# 4 6
# 5 1
# 6 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.