[英]Csv Data is not loading properly as Parquet using Spark
I have a table in Hive我在Hive有一张桌子
CREATE TABLE tab_data (
rec_id INT,
rec_name STRING,
rec_value DECIMAL(3,1),
rec_created TIMESTAMP
) STORED AS PARQUET;
and I want to populate this table with data in .csv files like these我想用这些.csv文件中的数据填充这个表
10|customer1|10.0|2016-09-07 08:38:00.0
20|customer2|24.0|2016-09-08 10:45:00.0
30|customer3|35.0|2016-09-10 03:26:00.0
40|customer1|46.0|2016-09-11 08:38:00.0
50|customer2|55.0|2016-09-12 10:45:00.0
60|customer3|62.0|2016-09-13 03:26:00.0
70|customer1|72.0|2016-09-14 08:38:00.0
80|customer2|23.0|2016-09-15 10:45:00.0
90|customer3|30.0|2016-09-16 03:26:00.0
using Spark and Scala with code as below使用Spark和Scala代码如下
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.types.{DataTypes, IntegerType, StringType, StructField, StructType, TimestampType}
object MainApp {
val spark = SparkSession
.builder()
.appName("MainApp")
.master("local[*]")
.config("spark.sql.shuffle.partitions","200")
.getOrCreate()
val sc = spark.sparkContext
val inputPath = "hdfs://host.hdfs:8020/..../tab_data.csv"
val outputPath = "hdfs://host.hdfs:8020/...../warehouse/test.db/tab_data"
def main(args: Array[String]): Unit = {
try {
val DecimalType = DataTypes.createDecimalType(3, 1)
/**
* schema
*/
val schema = StructType(List(StructField("rec_id", IntegerType, true), StructField("rec_name",StringType, true),
StructField("rec_value",DecimalType),StructField("rec_created",TimestampType, true)))
/**
* Reading the data from HDFS
*/
val data = spark
.read
.option("sep","|")
.schema(schema)
.csv(inputPath)
data.show(truncate = false)
data.schema.printTreeString()
/**
* Writing the data as Parquet
*/
data
.write
.mode(SaveMode.Append)
.parquet(outputPath)
} finally {
sc.stop()
spark.stop()
}
}
}
The problem is that I am getting this output问题是我得到这个 output
+------+--------+---------+-----------+
|rec_id|rec_name|rec_value|rec_created|
+------+--------+---------+-----------+
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
|null |null |null |null |
root
|-- rec_id: integer (nullable = true)
|-- rec_name: string (nullable = true)
|-- rec_value: decimal(3,1) (nullable = true)
|-- rec_created: timestamp (nullable = true)
The schema is fine but the data is not loading properly in the table架构很好,但数据未正确加载到表中
SELECT * FROM tab_data;
+------------------+--------------------+---------------------+-----------------------+--+
| tab_data.rec_id | tab_data.rec_name | tab_data.rec_value | tab_data.rec_created |
+------------------+--------------------+---------------------+-----------------------+--+
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
| NULL | NULL | NULL | NULL |
What am I doing wrong?我究竟做错了什么?
I'm new with Spark and some help would be appreciated.我是Spark的新手,我们将不胜感激。
You are getting null
values in all columns because one of the column of type String
is not able convert to Timestamp
type.您将在所有列中获得
null
值,因为其中一个String
类型的列无法转换为Timestamp
类型。
To convert string to timestamp type, specify timestamp format by using this option("timestampFormat","yyyy-MM-dd HH:mm:ss.S")
option while loading csv data.要将字符串转换为时间戳类型,请在加载 csv 数据时使用此
option("timestampFormat","yyyy-MM-dd HH:mm:ss.S")
选项指定时间戳格式。
Check below code.检查下面的代码。
Schema架构
scala> val schema = StructType(List(
StructField("rec_id", IntegerType, true),
StructField("rec_name",StringType, true),
StructField("rec_value",DecimalType(3,1)),
StructField("rec_created",TimestampType, true))
)
Loading CSV Data加载 CSV 数据
scala> val df = spark
.read
.option("sep","|")
.option("inferSchema","true")
.option("timestampFormat","yyyy-MM-dd HH:mm:ss.S")
.schema(schema)
.csv("/tmp/sample")
scala> df.show(false)
+------+---------+---------+-------------------+
|rec_id|rec_name |rec_value|rec_created |
+------+---------+---------+-------------------+
|10 |customer1|10.0 |2016-09-07 08:38:00|
|20 |customer2|24.0 |2016-09-08 10:45:00|
|30 |customer3|35.0 |2016-09-10 03:26:00|
|40 |customer1|46.0 |2016-09-11 08:38:00|
|50 |customer2|55.0 |2016-09-12 10:45:00|
|60 |customer3|62.0 |2016-09-13 03:26:00|
|70 |customer1|72.0 |2016-09-14 08:38:00|
|80 |customer2|23.0 |2016-09-15 10:45:00|
|90 |customer3|30.0 |2016-09-16 03:26:00|
+------+---------+---------+-------------------+
Updated更新
Since table is managed table, You don't need to set all those parameters, You can use insertInto
function to insert the data into table.由于表是托管表,您不需要设置所有这些参数,您可以使用
insertInto
function 将数据插入表中。
df.write.mode("append").insertInto("tab_data")
To deal with issues between Spark
, Hive
and Parquet
set up your SparkSession
as follow:为了处理
Spark
、 Hive
和Parquet
之间的问题,您的SparkSession
设置如下:
val spark = SparkSession
.builder()
.appName("CsvToParquet")
.master("local[*]")
.config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
.config("spark.sql.parquet.writeLegacyFormat", true) // To skip issues with data type between Spark and Hive
// The convention used by Spark to write Parquet data is configurable.
// This is determined by the property spark.sql.parquet.writeLegacyFormat
// The default value is false. If set to "true",
// Spark will use the same convention as Hive for writing the Parquet data.
afterwards read the .csv
data as follow之后读取
.csv
数据如下
val data = spark
.read
.option("sep","|")
.option("timestampFormat","yyyy-MM-dd HH:mm:ss.S") // to read timestamp fields
.option("inferSchema",false) // by default is false
.schema(schema)
.csv(inputPath)
then write the data as parquet
with no compression
(by default data is compressed) as follow然后将数据写成
no compression
的parquet
(默认情况下数据是压缩的),如下所示
data
.write
.mode(SaveMode.Append)
.option("compression", "none") // Assuming no data compression
.parquet(outputPath)
Note: It's probably that the reason why Hive
cannot query the data is because data is compressed in snappy
format by default and your CREATE TABLE
statement stores the data as parquet
without compression.注意:
Hive
无法查询数据的原因可能是因为数据默认以snappy
格式压缩,而您的CREATE TABLE
语句将数据存储为parquet
而不进行压缩。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.