[英]ClassNotFoundException: Failed to find data source: bigquery
I'm trying to load data from Google BigQuery into Spark running on Google Dataproc (I'm using Java).我正在尝试将数据从 Google BigQuery 加载到在 Google Dataproc 上运行的 Spark(我正在使用 Java)。 I tried to follow instructions on here: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example我试着按照这里的说明操作: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
I get the error: " ClassNotFoundException: Failed to find data source: bigquery
."我收到错误:“ ClassNotFoundException: Failed to find data source: bigquery
。”
My pom.xml looks like this:我的 pom.xml 看起来像这样:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.virtualpairprogrammers</groupId>
<artifactId>learningSpark</artifactId>
<version>0.0.3-SNAPSHOT</version>
<packaging>jar</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<java.version>1.8</java.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>com.google.cloud.spark</groupId>
<artifactId>spark-bigquery_2.11</artifactId>
<version>0.9.1-beta</version>
<classifier>shaded</classifier>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<archive>
<manifest>
<mainClass>com.virtualpairprogrammers.Main</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
</project>
After adding the dependency to my pom.xml it was downloading a lot to build the.jar, so I think I should have the correct dependency?将依赖项添加到我的 pom.xml 后,它下载了很多来构建 .jar,所以我认为我应该有正确的依赖项? However, Eclipse is also warning me that "The import com.google.cloud.spark.bigquery is never used".但是,Eclipse 也警告我“从未使用导入 com.google.cloud.spark.bigquery”。
This is the part of my code where I get the error:这是我收到错误的代码部分:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import com.google.cloud.spark.bigquery.*;
public class Main {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName("testingSql")
.getOrCreate();
Dataset<Row> data = spark.read().format("bigquery")
.option("table","project.dataset.tablename")
.load()
.cache();
I think you only added BQ connector as compile time dependency, but it is missing at runtime.我认为您仅将 BQ 连接器添加为编译时依赖项,但在运行时缺少它。 You need to either make a uber jar which includes the connector in your job jar (the doc needs to be updated), or include it when you submit the job gcloud dataproc jobs submit spark --properties spark.jars.packages=com.google.cloud.spark:spark-bigquery_2.11:0.9.1-beta
.您需要制作一个超级 jar ,其中包括您的作业gcloud dataproc jobs submit spark --properties spark.jars.packages=com.google.cloud.spark:spark-bigquery_2.11:0.9.1-beta
中的连接器(文档需要更新),或者在您提交作业时包含它gcloud dataproc jobs submit spark --properties spark.jars.packages=com.google.cloud.spark:spark-bigquery_2.11:0.9.1-beta
。
I faced the same issue and updated the format from "bigquery" to "com.google.cloud.spark.bigquery" and that worked for me.我遇到了同样的问题并将格式从“bigquery”更新为“com.google.cloud.spark.bigquery”,这对我有用。
Specifying the dependency in the build.sbt and using "com.google.cloud.spark.bigquery" in format as suggested by Peter resolved the issue for me.在 build.sbt 中指定依赖项并按照 Peter 建议的格式使用“com.google.cloud.spark.bigquery”为我解决了这个问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.