简体   繁体   English

使用SparkR JVM从Scala jar文件中调用方法

[英]Using SparkR JVM to call methods from a Scala jar file

I wanted to be able to package DataFrames in a Scala jar file and access them in R. The end goal is to create a way to access specific and often-used database tables in Python, R, and Scala without writing a different library for each. 我希望能够在Scala jar文件中打包DataFrames并在R中访问它们。最终目标是创建一种方法来访问Python,R和Scala中特定且经常使用的数据库表,而无需为每个表编写不同的库。 。

To do this, I made a jar file in Scala with functions that use the SparkSQL library to query the database and get the DataFrames I want. 为此,我在Scala中创建了一个jar文件,其中的函数使用SparkSQL库来查询数据库并获取我想要的DataFrame。 I wanted to be able to call these functions in R without creating another JVM, since Spark already runs on a JVM in R. However, the JVM Spark uses is not exposed in the SparkR API. 我希望能够在R中调用这些函数而不创建另一个JVM,因为Spark已经在R中的JVM上运行。但是,JVM Spark使用的内容未在SparkR API中公开。 To make it accessible and make Java methods callable, I modified "backend.R", "generics.R", "DataFrame.R", and "NAMESPACE" in the SparkR package and rebuilt the package: 为了使其可访问并使Java方法可调用,我修改了SparkR包中的“backend.R”,“generics.R”,“DataFrame.R”和“NAMESPACE”并重新构建了包:

In "backend.R" I made "callJMethod" and "createJObject" formal methods: 在“backend.R”中,我制作了“callJMethod”和“createJObject”正式方法:

  setMethod("callJMethod", signature(objId="jobj", methodName="character"), function(objId, methodName, ...) {
  stopifnot(class(objId) == "jobj")
  if (!isValidJobj(objId)) {
    stop("Invalid jobj ", objId$id,
         ". If SparkR was restarted, Spark operations need to be re-executed.")
  }
  invokeJava(isStatic = FALSE, objId$id, methodName, ...)
})


  setMethod("newJObject", signature(className="character"), function(className, ...) {
  invokeJava(isStatic = TRUE, className, methodName = "<init>", ...)
})

I modified "generics.R" to also contain these functions: 我修改了“generics.R”也包含这些功能:

#' @rdname callJMethod
#' @export
setGeneric("callJMethod", function(objId, methodName, ...) { standardGeneric("callJMethod")})

#' @rdname newJobject
#' @export
setGeneric("newJObject", function(className, ...) {standardGeneric("newJObject")})

Then I added exports for these functions to the NAMESPACE file: 然后我将这些函数的导出添加到NAMESPACE文件中:

export("cacheTable",
   "clearCache",
   "createDataFrame",
   "createExternalTable",
   "dropTempTable",
   "jsonFile",
   "loadDF",
   "parquetFile",
   "read.df",
   "sql",
   "table",
   "tableNames",
   "tables",
   "uncacheTable",
   "callJMethod",
   "newJObject")

This allowed me to call the Scala functions I wrote without starting a new JVM. 这允许我在不启动新JVM的情况下调用我编写的Scala函数。

The scala methods I wrote return DataFrames, which are "jobj"s in R when returned, but a SparkR DataFrame is an environment + a jobj. 我写的scala方法返回DataFrames,返回时是R中的“jobj”,但SparkR DataFrame是一个环境+一个jobj。 To turn these jobj DataFrames into SparkR DataFrames, I used the dataFrame() function in "DataFrame.R", which I also made accessible following the steps above. 为了将这些jobj DataFrames转换为SparkR DataFrames,我在“DataFrame.R”中使用了dataFrame()函数,我也可以按照上述步骤访问它。

I was then able to access the DataFrame that I "built" in Scala from R and use all of SparkR's functions on that DataFrame. 然后,我可以从R中访问我在Scala中“构建”的DataFrame,并在该DataFrame上使用所有SparkR的函数。 I was wondering if there was a better way to make such a cross-language library, or if there is any reason the Spark JVM should not be public? 我想知道是否有更好的方法来制作这样的跨语言库,或者是否有任何理由不应该公开Spark JVM?

any reason the Spark JVM should not be public? Spark JVM不应该公开的任何原因?

Probably more than one. 可能不止一个。 Spark developers make serious efforts to provide a stable public API. Spark开发人员正在努力提供稳定的公共API。 Low details of the implementation, including the way how guest languages communicate with JVM, are simply not a part of the contract. 实现的低细节,包括客户语言与JVM的通信方式,根本不是合同的一部分。 It could be completely rewritten at any point without any negative impact on the users. 它可以在任何时候完全重写,而不会对用户产生任何负面影响。 If you decide to use it and there are backwards incompatible changes you're on your own. 如果您决定使用它并且存在向后不兼容的更改,那么您自己就可以使用它们。

Keeping internals private reduces the effort to maintain and support software. 保持内部私密性可减少维护和支持软件的工作量。 You simply don't have bother yourself with all possible ways in which user can abuse these. 您根本无需担心用户滥用这些内容的所有可能方式。

a better way to make such a cross-language library 制作这样一个跨语言库的更好方法

It is hard to say without knowing more about your use case. 如果不了解您的用例,很难说。 I see at least three options: 我看到至少有三种选择:

  • For starters R provides only a weak access control mechanisms. 对于初学者,R仅提供弱访问控制机制。 If any part of the API is internal you can always use ::: function to access it. 如果API的任何部分是内部的,您可以始终使用:::函数来访问它。 As smart people say: 聪明人说:

    It is typically a design mistake to use ::: in your code since the corresponding object has probably been kept internal for a good reason. 在您的代码中使用:::通常是一个设计错误,因为相应的对象可能已经保留在内部,这是有充分理由的。

    but one thing for sure it is much better than modifying Spark source. 但有一点可以肯定它比修改Spark源要好得多。 As a bonus it clearly marks parts of your code which are particularly fragile an potentially unstable. 作为奖励,它清楚地标记了代码中特别脆弱且可能不稳定的部分内容。

  • if all you want is to create DataFrames the simplest thing is to use raw SQL. 如果您只想创建DataFrames,最简单的方法是使用原始SQL。 It is clean, portable, requires no compilation, packaging and simply works. 它干净,便携,无需编译,包装,只需工作。 Assuming you have query string like below stored in the variable named q 假设您在名为q的变量中存储了如下所示的查询字符串

     CREATE TEMPORARY TABLE foo USING org.apache.spark.sql.jdbc OPTIONS ( url "jdbc:postgresql://localhost/test", dbtable "public.foo", driver "org.postgresql.Driver" ) 

    it can be used in R: 它可以在R中使用:

     sql(sqlContext, q) fooDF <- sql(sqlContext, "SELECT * FROM foo") 

    Python: 蟒蛇:

     sqlContext.sql(q) fooDF = sqlContext.sql("SELECT * FROM foo") 

    Scala: 斯卡拉:

     sqlContext.sql(q) val fooDF = sqlContext.sql("SELECT * FROM foo") 

    or directly in Spark SQL. 或直接在Spark SQL中。

  • finally you can use Spark Data Sources API for consistent and supported cross-platform access. 最后,您可以使用Spark Data Sources API实现一致且受支持的跨平台访问。

Out of these three I would prefer raw SQL, followed by Data Sources API for complex cases and leave internals as a last resort. 在这三个中,我更喜欢原始SQL,其次是针对复杂案例的Data Sources API,并留下内部作为最后的手段。

Edit (2016-08-04) : 编辑 (2016-08-04)

If you're interested in low level access to JVM there is relatively new package rstudio/sparkapi which exposes internal SparkR RPC protocol. 如果您对JVM的低级访问感兴趣,则会有相对较新的软件包rstudio / sparkapi暴露内部SparkR RPC协议。 It is hard to predict how it will evolve so use it at your own risk. 很难预测它将如何演变,因此使用它需要您自担风险。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM