简体   繁体   English

使用 sparklyr 在 Databricks 中读取 Parquet 文件

[英]Read Parquet file in Databricks using sparklyr

Trying to read a Parquet file from R into Apache Spark 2.4.3 using the following code.尝试使用以下代码将 Parquet 文件从 R 读入 Apache Spark 2.4.3。 It works on my local machine using Windows 10, but not on Databricks 5.5 LTS.它适用于我使用 Windows 10 的本地机器,但不适用于 Databricks 5.5 LTS。

library(sparklyr)
library(arrow)

# Set up Spark connection
sc <- sparklyr::spark_connect(method = "databricks")

# Convert iris R data frame to Parquet and save to disk
arrow::write_parquet(iris, "/dbfs/user/iris.parquet")

# Read Parquet file into a Spark DataFrame: throws the error below
iris_sdf <- sparklyr::spark_read_parquet(sc, "iris_sdf", "user/iris.parquet")

Error in record_batch_stream_reader(stream) : Error in record_batch_stream_reader(stream) : could not find function "record_batch_stream_reader" record_batch_stream_reader(stream) 中的错误:record_batch_stream_reader(stream) 中的错误:找不到函数“record_batch_stream_reader”

What could possibly be wrong here?这里可能有什么问题?

SessionInfo() on my local machine: SessionInfo()在我的本地机器上:

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] arrow_0.16.0.2 sparklyr_1.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3        rstudioapi_0.11   magrittr_1.5      bit_1.1-15.2      tidyselect_1.0.0  R6_2.4.1          rlang_0.4.5       httr_1.4.1        dplyr_0.8.5       tools_3.6.3       DBI_1.1.0         dbplyr_1.4.2      ellipsis_0.3.0    htmltools_0.4.0  
[15] bit64_0.9-7       assertthat_0.2.1  rprojroot_1.3-2   digest_0.6.25     tibble_2.1.3      forge_0.2.0       crayon_1.3.4      purrr_0.3.3       vctrs_0.2.4       base64enc_0.1-3   htmlwidgets_1.5.1 glue_1.3.1        compiler_3.6.3    pillar_1.4.3     
[29] generics_0.0.2    r2d3_0.2.3        backports_1.1.5   jsonlite_1.6.1    pkgconfig_2.0.3  

The problem is that Databricks Runtime 5.5 LTS comes with sparklyr 1.0.0 ( released 2019-02-25 ) but version 1.1.0 or above is needed.问题是Databricks Runtime 5.5 LTS带有 sparklyr 1.0.0(发布于 2019-02-25 ),但需要 1.1.0 或更高版本。 Install a newer version either through CRAN or GitHub, and spark_read_parquet() should work.通过 CRAN 或 GitHub 安装更新版本, spark_read_parquet()应该可以工作。

# CRAN
install.packages("sparklyr")

# GitHub
devtools::install_github("rstudio/sparklyr")

# You also need to install Apache Arrow
install.packages("arrow")
arrow_install()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM