简体   繁体   English

如何在 Spark scala 中动态读取文本文件(字符串类型数据)map 并将数据加载为镶木地板格式(具有不同数据类型的多列)

[英]How to read from textfile(String type data) map and load data into parquet format(multiple columns with different datatype) in Spark scala dynamically

we are importing data from Source RDBMS system to hadoop environment using sqoop as textfile format.我们正在使用 sqoop 作为文本文件格式将数据从 Source RDBMS 系统导入 hadoop 环境。 And this textfile need to be loaded into hive table of parquet format.并且此文本文件需要加载到拼花格式的 hive 表中。 How can we approach this scenario without using Hive support(earlier we used beeline insert and we are designing not to use hive anymore) and write directly to HDFS using parquet.我们如何在不使用 Hive 支持的情况下处理这种情况(之前我们使用直线插入,我们设计不再使用 hive)并使用镶木地板直接写入 HDFS。

EX:- After sqoop import, lets say we have file under HDFS target dir. EX:- 在 sqoop 导入之后,假设我们在 HDFS 目标目录下有文件。 /data/loc/mydb/Mytable /data/loc/mydb/Mytable

data in Mytable and all are of type String. Mytable 中的数据都是字符串类型。

-----------------------------------------
10|customer1|10.0|2016-09-07  08:38:00.0
20|customer2|20.0|2016-09-08  10:45:00.0
30|customer3|30.0|2016-09-10  03:26:00.0
------------------------------------------

target Hive table schema.目标 Hive 表架构。

rec_id: int
rec_name: String
rec_value: Decimal(2,1)
rec_created: Timestamp

How can we load data from Mytable to target underlying Hive table location(parquet format) using spark and managing typecasting for all the columns dynamically.我们如何使用 Spark 将数据从 Mytable 加载到目标底层 Hive 表位置(镶木地板格式)并动态管理所有列的类型转换。

Please Note: we cannot use HiveContext here.请注意:我们不能在这里使用 HiveContext。 Any help in the approach is much appreciated.非常感谢该方法的任何帮助。 Thanks in advance.提前致谢。

The example below read a .csv file as the same format as presented in the question.下面的示例读取.csv文件,其格式与问题中提供的格式相同。

There are some details that I would like to explain first.有一些细节我想先解释一下。

In the table schema the field: rec_value: Decimal(2,1) would have to be rec_value: Decimal(3,1) for the following reason:在表模式中,字段: rec_value: Decimal(2,1)必须是rec_value: Decimal(3,1) ,原因如下:

The DECIMAL type represents numbers with fixed precision and scale . DECIMAL类型表示具有固定precisionscale的数字。 When you create a DECIMAL column, you specify the precision , p, and scale , s.创建DECIMAL列时,指定precision p 和scale s。 Precision is the total number of digits, regardless of the location of the decimal point. Precision是总位数,与小数点的位置无关。 Scale is the number of digits after the decimal place. Scale是小数点后的位数。 To represent the number 10.0 without a loss of precision, you would need a DECIMAL type with precision of at least 3, and scale of at least 1.要在不损失精度的情况下表示数字 10.0,您需要一个precision至少为 3 且scale位数至少为 1 的DECIMAL类型。

So the Hive table would be:所以Hive表将是:

CREATE TABLE tab_data (
  rec_id INT,
  rec_name STRING,
  rec_value DECIMAL(3,1),
  rec_created TIMESTAMP
) STORED AS PARQUET;

The full scala code完整的 scala 代码

import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.types.{DataTypes, IntegerType, StringType, StructField, StructType, TimestampType}

object CsvToParquet {

  val spark = SparkSession
    .builder()
    .appName("CsvToParquet")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
    .config("spark.sql.parquet.writeLegacyFormat", true) // To avoid issues with data type between Spark and Hive
                                                         // The convention used by Spark to write Parquet data is configurable.
                                                         // This is determined by the property spark.sql.parquet.writeLegacyFormat
                                                         // The default value is false. If set to "true",
                                                         // Spark will use the same convention as Hive for writing the Parquet data.
    .getOrCreate()

  val sc = spark.sparkContext

  val inputPath = "hdfs://host:port/user/...../..../tab_data.csv"
  val outputPath = "hdfs://host:port/user/hive/warehouse/test.db/tab_data"

  def main(args: Array[String]): Unit = {

    Logger.getRootLogger.setLevel(Level.ERROR)

    try {

      val DecimalType = DataTypes.createDecimalType(3, 1)

      /**
        * the data schema
        */
      val schema = StructType(List(StructField("rec_id", IntegerType, true), StructField("rec_name",StringType, true),
                   StructField("rec_value",DecimalType),StructField("rec_created",TimestampType, true)))

      /**
        * Reading the data from HDFS as .csv text file
        */
      val data = spark
        .read
        .option("sep","|")
        .option("timestampFormat","yyyy-MM-dd HH:mm:ss.S")
        .option("inferSchema",false)
        .schema(schema)
        .csv(inputPath)

       data.show(truncate = false)
       data.schema.printTreeString()

      /**
        * Writing the data as Parquet file
        */
      data
        .write
        .mode(SaveMode.Append)
        .option("compression", "none") // Assuming no data compression
        .parquet(outputPath)

    } finally {
      sc.stop()
      println("SparkContext stopped")
      spark.stop()
      println("SparkSession stopped")
    }
  }
}

Input file as .csv tab separated fields输入文件为.csv制表符分隔字段

10|customer1|10.0|2016-09-07  08:38:00.0
20|customer2|24.0|2016-09-08  10:45:00.0
30|customer3|35.0|2016-09-10  03:26:00.0
40|customer1|46.0|2016-09-11  08:38:00.0
........

reading from SparkSpark读取

+------+---------+---------+-------------------+
|rec_id|rec_name |rec_value|rec_created        |
+------+---------+---------+-------------------+
|10    |customer1|10.0     |2016-09-07 08:38:00|
|20    |customer2|24.0     |2016-09-08 10:45:00|
|30    |customer3|35.0     |2016-09-10 03:26:00|
|40    |customer1|46.0     |2016-09-11 08:38:00|
......

schema图式

root
 |-- rec_id: integer (nullable = true)
 |-- rec_name: string (nullable = true)
 |-- rec_value: decimal(3,1) (nullable = true)
 |-- rec_created: timestamp (nullable = true)

reading from HiveHive读取

SELECT *
FROM tab_data;

+------------------+--------------------+---------------------+------------------------+--+
| tab_data.rec_id  | tab_data.rec_name  | tab_data.rec_value  |  tab_data.rec_created  |
+------------------+--------------------+---------------------+------------------------+--+
| 10               | customer1          | 10                  | 2016-09-07 08:38:00.0  |
| 20               | customer2          | 24                  | 2016-09-08 10:45:00.0  |
| 30               | customer3          | 35                  | 2016-09-10 03:26:00.0  |
| 40               | customer1          | 46                  | 2016-09-11 08:38:00.0  |
.....

Hope this helps.希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM