从 BigQuery 读取字符串 NULL 值时出现问题

Question

Currently I am using spark to read data from bigqiery tables and write it to storage bucket as csv.目前我正在使用 spark 从 bigqiery 表中读取数据并将其作为 csv 写入存储桶。 One issue that i am facing is that the null string values are not being read properly by spark from bq.我面临的一个问题是来自 bq 的 spark 没有正确读取 null 字符串值。 It reads the null string values but in the csv it writes that value as an empty string with double quotes (ie like this "").它读取 null 字符串值，但在 csv 中，它将该值写入为带双引号的空字符串（即像这样的“”）。

# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
    .option('table', <bq_dataset> + <bq_table>) \
    .load()
bqdf.createOrReplaceTempView('bqdf')

# Select required data into another df
bqdf2 = spark.sql(
    'SELECT * FROM bqdf')

# write to GCS
bqdf2.write.csv(<gcs_data_path> + <bq_table> + '/' , mode='overwrite', sep= '|')

I have tried emptyValue='' and nullValue options with df.write.csv() while writing to csv but dosen't work.在写入 csv 时，我尝试使用 df.write.csv() 使用 emptyValue='' 和 nullValue 选项，但不起作用。

I needed a solution for this problem, if anyone else faced this issue and could help.如果其他人遇到这个问题并且可以提供帮助，我需要一个解决这个问题的方法。 Thanks!谢谢！

Answer 1

I was able to reproduce your case and I found a solution that worked with a sample table I created in BigQuery.我能够重现您的案例，并且找到了一个适用于我在 BigQuery 中创建的示例表的解决方案。 The data is as follows:数据如下：

According to the PySpark documentation , in the class pyspark.sql.DataFrameWriter(df) , there is an option called nullValue :根据PySpark 文档，在class pyspark.ZAC5C74B64B4B8352EF2F181AFFB5AC2

nullValue – sets the string representation of a null value. nullValue – 设置 null 值的字符串表示形式。 If None is set, it uses the default value, empty string.如果设置了 None，它将使用默认值空字符串。

Which is what you are looking for.这就是你要找的。 Then, I just implemented nullValue option below.然后，我刚刚在下面实现了 nullValue 选项。

sc = SparkContext()
spark = SparkSession(sc)

# Read the data from BigQuery as a Spark Dataframe.
data = spark.read.format("bigquery").option(
    "table", "dataset.table").load()

# Create a view so that Spark SQL queries can be run against the data.
data.createOrReplaceTempView("data_view")

# Select required data into another df
data_view2 = spark.sql(
    'SELECT * FROM data_view')

df=data_view2.write.csv('gs://bucket/folder', header=True, nullValue='')

data_view2.show()

Notice that I have used data_view2.show() to print out the view in order to check if it was correctly read.请注意，我使用data_view2.show()打印出视图，以检查它是否被正确读取。 The output was: output 是：

+------+---+
|name  |age|
+------+---+
|Robert| 25|
|null  | 23|
+------+---+

Therefore, the null value was precisely interpreted.因此，精确解释了null值。 In addition, I also checked the .csv file:此外，我还查看了.csv文件：

name,age
Robert,25
,23

As you can see the null value is correct and not represented as between double quotes as an empty String.如您所见， null值是正确的，并且没有在双引号之间表示为空字符串。 Finally, just as a final inspection I created a load job from this .csv file to BigQuery.最后，作为最终检查，我创建了一个从.csv文件到 BigQuery 的加载作业。 The table was created and the null value was interpreted accurately.该表已创建并准确解释了 null 值。

Note : I ran the pyspark job from the DataProc job's console in a DataProc cluster, previously created.注意：我从之前创建的 DataProc 集群中的 DataProc 作业控制台运行了 pyspark 作业。 Also, the cluster was at the same location as the dataset in BigQuery.此外，该集群与 BigQuery 中的数据集位于同一位置。

从 BigQuery 读取字符串 NULL 值时出现问题

问题描述

1 个解决方案

解决方案1
0 2020-05-15 08:09:21

从 BigQuery 读取字符串 NULL 值时出现问题

问题描述

1 个解决方案

解决方案1 0 2020-05-15 08:09:21

解决方案1
0 2020-05-15 08:09:21