Pyspark 将 dataframe 写入 bigquery [错误 gs]

Question

I'm trying to write a dataframe to a bigquery table.我正在尝试将 dataframe 写入 bigquery 表。 I have set the sparkSession with the required parameters.我已经使用所需参数设置了 sparkSession。 However, at the moment of doing the write I get an error:但是，在写入时出现错误：

Py4JJavaError: An error occurred while calling o114.save.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)

The code is the following one:代码如下：

import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession

spark2 = SparkSession.builder\
    .config("spark.jars", "/Users/xyz/Downloads/gcs-connector-hadoop2-latest.jar") \
    .config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0")\
    .config("google.cloud.auth.service.account.json.keyfile", "/Users/xyz/Downloads/MyProject-cd7627f8ef9b.json") \
    .getOrCreate()

spark2.conf.set("parentProject", "xyz")

data=spark2.createDataFrame(
    [
        ("AAA", 51), 
        ("BBB", 23),
    ],
    ['codiPuntSuministre', 'valor'] 
)

spark2.conf.set("temporaryGcsBucket","bqconsumptions")

data.write.format('bigquery') \
    .option("credentialsFile", "/Users/xyz/Downloads/MyProject-xyz.json")\
    .option('table', 'consumptions.c1') \
    .mode('append') \
    .save()

df=spark2.read.format("bigquery").option("credentialsFile", "/Users/xyz/Downloads/MyProject-xyz.json")\
    .load("consumptions.c1")

I don't get any error if taking out the write from the code, so the error comes when trying to write and may be with something related to the auxiliar bucket to operate with bigquery如果从代码中取出写入，我不会收到任何错误，因此在尝试写入时会出现错误，并且可能与辅助存储桶相关的东西与 bigquery 一起运行

Answer 1

the error here suggests that it is not able to recognize the filesystem, you can use the below link for adding the support for gs filesystem, it happens because when you write to bigquery the files are loaded to google cloud storage bucket temporarily and then it is loaded into the bigquery table.这里的错误表明它无法识别文件系统，您可以使用下面的链接添加对 gs 文件系统的支持，这是因为当您写入 bigquery 时，文件会暂时加载到谷歌云存储桶然后它是加载到 bigquery 表中。

spark._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

Pyspark 将 dataframe 写入 bigquery [错误 gs]

问题描述

1 个解决方案

解决方案1
1 2020-11-18 19:12:41

Pyspark 将 dataframe 写入 bigquery [错误 gs]

问题描述

1 个解决方案

解决方案1 1 2020-11-18 19:12:41

解决方案1
1 2020-11-18 19:12:41