Csv 使用 Spark 的 Parquet 数据未正确加载

Question

I have a table in Hive我在Hive有一张桌子

CREATE TABLE tab_data (
  rec_id INT,
  rec_name STRING,
  rec_value DECIMAL(3,1),
  rec_created TIMESTAMP
) STORED AS PARQUET;

and I want to populate this table with data in .csv files like these我想用这些.csv文件中的数据填充这个表

10|customer1|10.0|2016-09-07  08:38:00.0
20|customer2|24.0|2016-09-08  10:45:00.0
30|customer3|35.0|2016-09-10  03:26:00.0
40|customer1|46.0|2016-09-11  08:38:00.0
50|customer2|55.0|2016-09-12  10:45:00.0
60|customer3|62.0|2016-09-13  03:26:00.0
70|customer1|72.0|2016-09-14  08:38:00.0
80|customer2|23.0|2016-09-15  10:45:00.0
90|customer3|30.0|2016-09-16  03:26:00.0

using Spark and Scala with code as below使用Spark和Scala代码如下

import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.types.{DataTypes, IntegerType, StringType, StructField, StructType, TimestampType}

object MainApp {

  val spark = SparkSession
    .builder()
    .appName("MainApp")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions","200") 
    .getOrCreate()

  val sc = spark.sparkContext

  val inputPath = "hdfs://host.hdfs:8020/..../tab_data.csv"
  val outputPath = "hdfs://host.hdfs:8020/...../warehouse/test.db/tab_data"

  def main(args: Array[String]): Unit = {

    try {

      val DecimalType = DataTypes.createDecimalType(3, 1)

      /**
        * schema
        */
      val schema = StructType(List(StructField("rec_id", IntegerType, true), StructField("rec_name",StringType, true),
        StructField("rec_value",DecimalType),StructField("rec_created",TimestampType, true)))

      /**
        * Reading the data from HDFS 
        */
      val data = spark
        .read
        .option("sep","|")
        .schema(schema)
        .csv(inputPath)

      data.show(truncate = false)
      data.schema.printTreeString()

      /**
        * Writing the data as Parquet
        */
      data
        .write
        .mode(SaveMode.Append)
        .parquet(outputPath)

    } finally {
      sc.stop()    
      spark.stop()
    }
  }
}

The problem is that I am getting this output问题是我得到这个 output

+------+--------+---------+-----------+
|rec_id|rec_name|rec_value|rec_created|
+------+--------+---------+-----------+
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |


root
 |-- rec_id: integer (nullable = true)
 |-- rec_name: string (nullable = true)
 |-- rec_value: decimal(3,1) (nullable = true)
 |-- rec_created: timestamp (nullable = true)

The schema is fine but the data is not loading properly in the table架构很好，但数据未正确加载到表中

SELECT * FROM tab_data;

+------------------+--------------------+---------------------+-----------------------+--+
| tab_data.rec_id  | tab_data.rec_name  | tab_data.rec_value  | tab_data.rec_created  |
+------------------+--------------------+---------------------+-----------------------+--+
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |

What am I doing wrong?我究竟做错了什么？

I'm new with Spark and some help would be appreciated.我是Spark的新手，我们将不胜感激。

Answer 1

You are getting null values in all columns because one of the column of type String is not able convert to Timestamp type.您将在所有列中获得null值，因为其中一个String类型的列无法转换为Timestamp类型。

To convert string to timestamp type, specify timestamp format by using this option("timestampFormat","yyyy-MM-dd HH:mm:ss.S") option while loading csv data.要将字符串转换为时间戳类型，请在加载 csv 数据时使用此option("timestampFormat","yyyy-MM-dd HH:mm:ss.S")选项指定时间戳格式。

Check below code.检查下面的代码。

Schema架构

scala> val schema = StructType(List(
   StructField("rec_id", IntegerType, true), 
   StructField("rec_name",StringType, true),
   StructField("rec_value",DecimalType(3,1)),
   StructField("rec_created",TimestampType, true))
)

Loading CSV Data加载 CSV 数据

scala> val df = spark
.read
.option("sep","|")
.option("inferSchema","true")
.option("timestampFormat","yyyy-MM-dd HH:mm:ss.S")
.schema(schema)
.csv("/tmp/sample")

scala> df.show(false)
+------+---------+---------+-------------------+
|rec_id|rec_name |rec_value|rec_created        |
+------+---------+---------+-------------------+
|10    |customer1|10.0     |2016-09-07 08:38:00|
|20    |customer2|24.0     |2016-09-08 10:45:00|
|30    |customer3|35.0     |2016-09-10 03:26:00|
|40    |customer1|46.0     |2016-09-11 08:38:00|
|50    |customer2|55.0     |2016-09-12 10:45:00|
|60    |customer3|62.0     |2016-09-13 03:26:00|
|70    |customer1|72.0     |2016-09-14 08:38:00|
|80    |customer2|23.0     |2016-09-15 10:45:00|
|90    |customer3|30.0     |2016-09-16 03:26:00|
+------+---------+---------+-------------------+

Updated更新

Since table is managed table, You don't need to set all those parameters, You can use insertInto function to insert the data into table.由于表是托管表，您不需要设置所有这些参数，您可以使用insertInto function 将数据插入表中。

df.write.mode("append").insertInto("tab_data")

Answer 2

To deal with issues between Spark , Hive and Parquet set up your SparkSession as follow:为了处理Spark 、 Hive和Parquet之间的问题，您的SparkSession设置如下：

  val spark = SparkSession
    .builder()
    .appName("CsvToParquet")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
    .config("spark.sql.parquet.writeLegacyFormat", true) // To skip issues with data type between Spark and Hive
                                                         // The convention used by Spark to write Parquet data is configurable.
                                                         // This is determined by the property spark.sql.parquet.writeLegacyFormat
                                                         // The default value is false. If set to "true",
                                                         // Spark will use the same convention as Hive for writing the Parquet data.

afterwards read the .csv data as follow之后读取.csv数据如下

      val data = spark
        .read
        .option("sep","|")
        .option("timestampFormat","yyyy-MM-dd HH:mm:ss.S") // to read timestamp fields
        .option("inferSchema",false) // by default is false
        .schema(schema)
        .csv(inputPath)

then write the data as parquet with no compression (by default data is compressed) as follow然后将数据写成no compression的parquet （默认情况下数据是压缩的），如下所示

      data
        .write
        .mode(SaveMode.Append)
        .option("compression", "none") // Assuming no data compression
        .parquet(outputPath)

Note: It's probably that the reason why Hive cannot query the data is because data is compressed in snappy format by default and your CREATE TABLE statement stores the data as parquet without compression.注意： Hive无法查询数据的原因可能是因为数据默认以snappy格式压缩，而您的CREATE TABLE语句将数据存储为parquet而不进行压缩。

Csv 使用 Spark 的 Parquet 数据未正确加载

问题描述

2 个解决方案

解决方案1
2 2020-07-20 14:39:57

解决方案2
2 2020-07-21 22:07:40

Csv 使用 Spark 的 Parquet 数据未正确加载

问题描述

2 个解决方案

解决方案1 2 2020-07-20 14:39:57

解决方案2 2 2020-07-21 22:07:40

解决方案1
2 2020-07-20 14:39:57

解决方案2
2 2020-07-21 22:07:40