简体   繁体   中英

How to get dplyr::summarize_all to work on a sparkdataframe using databricks?

Question

I have a massive Spark Dataframe, called x. I am using databricks. x is billions of records long, too large to collect onto a single machine. What do I have to do to get this to work?:

dplyr::summarize_all(x,mean)

More Info

This is the error message I currently get:

Error in UseMethod("tbl_vars") : 
  no applicable method for 'tbl_vars' applied to an object of class "SparkDataFrame"

and

class(x)

returns: [1] "SparkDataFrame" attr(,"package") [1] "SparkR"

The book, Mastering Spark with R , has an example of loading up a tiny r data frame, and running summarize_all on it:

cars <- copy_to(sc, mtcars)
summarize_all(cars, mean)

Note the above code works on my databricks cluster and returns a nice block of text:

# Source: spark<?> [?? x 11]
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  20.1  6.19  231.  147.  3.60  3.22  17.8 0.438 0.406  3.69  2.81

The same book leads me to believe I can use this and similar functions on huge spark dataframes.

and also

class(cars)

returns:

[1] "tbl_spark" "tbl_sql"   "tbl_lazy"  "tbl"   

It seems obvious that I need to convert my spark dataframe to a tbl_spark, tbl_sql, tbl_lazy or tbl so that I can pass it to dplyr::summarize_all, but I have searched all over the place and asked experts and cannot figure out how to do this.

You're right that SparkR and sparklyr are different APIs that don't play well together. You can convert the SparkR data frame to be used with sparklyr by using a temp table.

Here's an example SparkR data frame.

sc <- sparklyr::spark_connect(method = "databricks")

x_sparkr <- SparkR::sql("SELECT 1 AS a UNION SELECT 2")

Create the temp table.

SparkR::registerTempTable(x_sparkr, "temp_x")

Load it into sparklyr .

x_sparklyr <- dplyr::tbl(sc, "temp_x")

dplyr::summarize_all(x_sparklyr, mean)
#> # Source: spark<?> [?? x 1]
#>       a
#>   <dbl>
#> 1   1.5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM