简体   繁体   English

Spark REST API:无法找到数据源:com.databricks.spark.csv

[英]Spark REST API: Failed to find data source: com.databricks.spark.csv

I have a pyspark file stored on s3. 我在s3上存储了一个pyspark文件。 I am trying to run it using spark REST API. 我正在尝试使用Spark REST API运行它。

I am running the following command: 我正在运行以下命令:

curl -X POST http://<ip-address>:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "testing.py"],
"appResource" : "s3n://accessKey:secretKey/<bucket-name>/testing.py",
"clientSparkVersion" : "1.6.1",
"environmentVariables" : {
    "SPARK_ENV_LOADED" : "1"
},
"mainClass" : "org.apache.spark.deploy.SparkSubmit",
"sparkProperties" : {
"spark.driver.supervise" : "false",
"spark.app.name" : "Simple App",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://<ip-address>:6066",
"spark.jars" : "spark-csv_2.10-1.4.0.jar",
"spark.jars.packages" : "com.databricks:spark-csv_2.10:1.4.0"
}
}'

and the testing.py file has a code snippet: 并且testing.py文件具有代码片段:

myContext = SQLContext(sc)
format = "com.databricks.spark.csv"
dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)
dataFrame2 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location2).repartition(1)
outDataFrame = dataFrame1.join(dataFrame2, dataFrame1.values == dataFrame2.valuesId)
outDataFrame.write.format(format).option("header", "true").option("nullValue","").save(outLocation)

But on this line: 但是在这条线上:

dataFrame1 = myContext.read.format(format).option("header", "true").option("inferSchema", "true").option("delimiter",",").load(location1).repartition(1)

I get exception: 我得到异常:

java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.csv.DefaultSource

I was trying different things out and one of those things was that I logged into the ip-address machine and ran this command: 我尝试了不同的操作,其中之一是我登录ip地址计算机并运行了以下命令:

./bin/spark-shell --packages com.databricks:spark-csv_2.10:1.4.0

so that It would download the spark-csv in .ivy2/cache folder. 这样它就可以在.ivy2 / cache文件夹中下载spark-csv。 But that didn't solve the problem. 但这并不能解决问题。 What am I doing wrong? 我究竟做错了什么?

(Posted on behalf of the OP) . (代表OP张贴)

I first added spark-csv_2.10-1.4.0.jar on driver and worker machines. 我首先在驱动程序和工作计算机上添加了spark-csv_2.10-1.4.0.jar。 and added 并添加

"spark.driver.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",
"spark.executor.extraClassPath" : "absolute/path/to/spark-csv_2.10-1.4.0.jar",

Then I got following error: 然后我得到以下错误:

java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat
Caused by: java.lang.ClassNotFoundException: org.apache.commons.csv.CSVFormat

And then I added commons-csv-1.4.jar on both machines and added: 然后,我在两台计算机上添加了commons-csv-1.4.jar并添加了:

"spark.driver.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",
"spark.executor.extraClassPath" : "/absolute/path/to/spark-csv_2.10-1.4.0.jar:/absolute/path/to/commons-csv-1.4.jar",

And that solved my problem. 这就解决了我的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM