如何在本地使用Spark BigQuery Connector？

Question

为了进行测试，我想使用BigQuery Connector在BigQuery中编写Parquet Avro日志。 在撰写本文时，无法从用户界面直接读取Parquet来吸收它，因此我正在编写Spark作业来做到这一点。

在Scala中，目前的工作内容如下：

val events: RDD[RichTrackEvent] =
readParquetRDD[RichTrackEvent, RichTrackEvent](sc, googleCloudStorageUrl)

val conf = sc.hadoopConfiguration
conf.set("mapred.bq.project.id", "myproject")

// Output parameters
val projectId = conf.get("fs.gs.project.id")
val outputDatasetId = "logs"
val outputTableId = "test"
val outputTableSchema = LogSchema.schema

// Output configuration
BigQueryConfiguration.configureBigQueryOutput(
  conf, projectId, outputDatasetId, outputTableId, outputTableSchema
)
conf.set(
  "mapreduce.job.outputformat.class",
  classOf[BigQueryOutputFormat[_, _]].getName
)

events
  .mapPartitions {
    items =>
      val gson = new Gson()
      items.map(e => gson.fromJson(e.toString, classOf[JsonObject]))
  }
  .map(x => (null, x))
  .saveAsNewAPIHadoopDataset(conf)

由于BigQueryOutputFormat找不到Google凭据，因此它在元数据主机上进行回退，以尝试使用以下堆栈跟踪来发现它们：

016-06-13 11:40:53 WARN  HttpTransport:993 - exception thrown while executing request
java.net.UnknownHostException: metadata
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.cloud.hadoop.util.CredentialFactory$ComputeCredentialWithRetry.executeRefreshToken(CredentialFactory.java:160)
at com.google.api.client.auth.oauth2.Credential.refreshToken(Credential.java:489)
at com.google.cloud.hadoop.util.CredentialFactory.getCredentialFromMetadataServiceAccount(CredentialFactory.java:207)
at com.google.cloud.hadoop.util.CredentialConfiguration.getCredential(CredentialConfiguration.java:72)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.createBigQueryCredential(BigQueryFactory.java:81)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQuery(BigQueryFactory.java:101)
at com.google.cloud.hadoop.io.bigquery.BigQueryFactory.getBigQueryHelper(BigQueryFactory.java:89)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.<init>(BigQueryOutputCommitter.java:70)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:102)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:84)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputFormat.getOutputCommitter(BigQueryOutputFormat.java:30)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1135)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)

当然是可以预期的，但是它应该能够使用我的服务帐户及其密钥，因为GoogleCredential.getApplicationDefault()返回从GOOGLE_APPLICATION_CREDENTIALS环境变量中提取的相应凭据。

由于连接器似乎是从hadoop配置中读取凭据的，因此要设置什么密钥才能使其读取GOOGLE_APPLICATION_CREDENTIALS ？ 有没有一种方法可以配置输出格式以使用提供的GoogleCredential对象？

Answer 1

如果我正确理解了您的问题-您可能需要设置：

<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.email</name>
<name>mapred.bq.auth.service.account.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>

在这里， mapred.bq.auth.service.account.keyfile应该指向旧式“ P12”密钥文件的完整文件路径； 或者，如果您使用的是更新的“ JSON”密钥文件， mapred.bq.auth.service.account.json.keyfile使用单个mapred.bq.auth.service.account.json.keyfile密钥替换“ email”和“ keyfile”条目：

<name>mapred.bq.auth.service.account.enable</name>
<name>mapred.bq.auth.service.account.json.keyfile</name>
<name>mapred.bq.project.id</name>
<name>mapred.bq.gcs.bucket</name>

另外，您可能想看看https://github.com/spotify/spark-bigquery-这是使用BQ和Spark的更为文明的方式。 如果将BQ连接器用于Hadoop，则本例中使用的setGcpJsonKeyFile方法与为mapred.bq.auth.service.account.json.keyfile设置的JSON文件相同。

如何在本地使用Spark BigQuery Connector？

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-06-14 17:31:08

如何在本地使用Spark BigQuery Connector？

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-06-14 17:31:08

解决方案1
4 已采纳 2016-06-14 17:31:08