[英]How to load a parquet file into a Hive Table using Spark?
So, I am trying to load a csv file, and then save it as a parquet file, and then load it into a Hive table.所以,我试图加载一个 csv 文件,然后将其保存为镶木地板文件,然后将其加载到 Hive 表中。 However whenever it load it into the table, the values are out of place and all over the place.I am using Pyspark/Hive但是,无论何时将其加载到表中,值都不合适并且到处都是。我正在使用 Pyspark/Hive
Here is the content in my csv file:这是我的 csv 文件中的内容:
Here is my code to convert csv to parquet and write it to my HDFS location:这是我将 csv 转换为 parquet 并将其写入我的 HDFS 位置的代码:
#This creates the sparkSession
from pyspark.sql import SparkSession
#from pyspark.sql import SQLContext
spark = (SparkSession \
.builder \
.appName("S_POCC") \
.enableHiveSupport()\
.getOrCreate())
df = spark.read.load('/user/new_file.csv', format="csv", sep=",", inferSchema="true", header="false")
df.write.save('hdfs://my_path/table/test1.parquet')
This succesfully converts it to parquet and to the path however when I load it using the following statements in Hive, it gives a weird output.这成功地将它转换为镶木地板和路径,但是当我在 Hive 中使用以下语句加载它时,它给出了一个奇怪的输出。
Hive statements:蜂巢声明:
drop table sndbx_test.test99 purge ;
create external table if not exists test99 ( c0 string, c1 string, c2 string, c3 string, c4 string, c5 string, c6 string);
load data inpath 'hdfs://my_path/table/test1.parquet;
Any ideas/suggestions?任何想法/建议?
Instead of saving as parquet and then trying to insert in to hive df.write.save('hdfs://my_path/table/test1.parquet')
而不是保存为镶木地板,然后尝试插入到配置单元df.write.save('hdfs://my_path/table/test1.parquet')
you can do directly like below...你可以像下面那样直接做...
df.write
.format("parquet")
.partitionBy('yourpartitioncolumns')
.saveAsTable('yourtable')
OR或者
df.write
.format("parquet")
.partitionBy('yourpartitioncolumns')
.insertInto('yourtable')
Note : if you dont have patition columns and is non-partition table then no need of partitionBy
注意:如果您没有分区列并且是非分区表,则不需要partitionBy
Instead of creating a table and then loading the data into it, you can do both in one statement.您可以在一个语句中完成这两项操作,而不是创建一个表然后将数据加载到其中。
CREATE EXTERNAL TABLE IF NOT EXISTS test99 ( c0 string, c1 string, c2 string, c3 string, c4 string, c5 string, c6 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS PARQUET
LOCATION 'hdfs://my_path/table/' ;
If you describe your table, it would most probably show that your table stores data in ORC format since it is default for Hive.如果你描述你的表,它很可能会显示你的表以 ORC 格式存储数据,因为它是 Hive 的默认格式。 Hence, while creating your table, make sure you mention the format in which the underlying data will be stored, in this case parquet.因此,在创建表时,请确保提及存储基础数据的格式,在本例中为 parquet。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.