[英]How do I load data correctly in Hive using spark?
I want to input data which looks as- 我想输入看起来像的数据 -
"58;""management"";""married"";""tertiary"";""no"";2143;""yes"";""no"";""unknown"";5;""may"";261;1;-1;0;""unknown"";""no"""
"44;""technician"";""single"";""secondary"";""no"";29;""yes"";""no"";""unknown"";5;""may"";151;1;-1;0;""unknown"";""no"""
"33;""entrepreneur"";""married"";""secondary"";""no"";2;""yes"";""yes"";""unknown"";5;""may"";76;1;-1;0;""unknown"";""no"""
"47;""blue-collar"";""married"";""unknown"";""no"";1506;""yes"";""no"";""unknown"";5;""may"";92;1;-1;0;""unknown"";""no"""
My create table statement is as- 我的create table语句是 -
sqlContext.sql("create table dummy11(age int, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';'")
When I run the statement- 当我发表声明时 -
sqlContext.sql("from dummy11 select age").show()
OR 要么
sqlContext.sql("from dummy11 select y").show()
It returns NULL
value instead of correct values, though other values are visible 它返回
NULL
值而不是正确的值,但其他值是可见的
So how do I correct this?? 那么我该怎么纠正这个?
As you are using Hive QL syntax, you need to validate the input data before processing. 在使用Hive QL语法时,需要在处理之前验证输入数据。
In your data, few records have lesser columns - than the actual columns defined in DDL. 在您的数据中,很少有记录具有比DDL中定义的实际列更少的列。
So, for those records, the rest columns (from last) are set as NULL; 因此,对于那些记录,其余列(从last)被设置为NULL; as that row does not have enough values.
因为该行没有足够的值。
That's why, the last column y
has values NULL
. 这就是为什么,最后一列
y
值为NULL
。
Also, in DDL, first field's data type is INT
; 另外,在DDL中,第一个字段的数据类型是
INT
; but in record, first field values are: 但在记录中,第一个字段值是:
"58
"44
"33
Because of "
, the values are not type-casted to INT
; setting field value as NULL
. 由于
"
,值不是类型转换为INT
;将字段值设置为NULL
。
As per the DDL and data - you provided, values are getting set as: 根据您提供的DDL和数据,值设置为:
age "58
job ""management""
marital ""married""
education ""tertiary""
default ""no""
housing 2143
loan ""yes""
contact ""no""
month ""unknown""
day_of_week 5
duration ""may""
campaign 261
pday 1
previous -1
poutcome 0
emp_var_rate ""unknown""
cons_price_idx ""no""
cons_price_idx NULL
cons_conf_idx NULL
euribor3m int NULL
nr_employed NULL
y NULL
Check the NULL
values for last 5 columns. 检查最后5列的
NULL
值。
So, if that is not expected, you need to validate the data first before proceeding. 因此,如果不是这种情况,则需要先进行数据验证,然后再继续。
And for the column age
, if you need it in INT
type, cleanse the data to remove unwanted "
character. 对于列
age
,如果您需要INT
类型,请清除数据以删除不需要的"
字符”。
WORKAROUND
替代方法
As workaround, you can define age
as STRING
at beginning, as use spark transformations to parse the first field and convert it to INT
作为解决方法,您可以在开始时将
age
定义为STRING
,因为使用spark变换来解析第一个字段并将其转换为INT
import org.apache.spark.sql.functions._
val ageInINT = udf { (make: String) =>
Integer.parseInt(make.substring(1))
}
df.withColumn("ageInINT", ageInINT(df("age"))).show
Here df
is your dataframe created while executing the hive DDL with column age
as sTRING
. 这里
df
是在执行hive DDL时创建的数据sTRING
,其列age
为sTRING
。
Nnow, you can perform operation on new column ageInINT
rather than column age
with INTEGER
values. Nnow,您可以对新列
ageInINT
执行操作,而不是使用INTEGER
值对列age
执行操作。
Since your data contains "
just before the age, it is considered as string. In the code you have defined it as int
therefore sql parser is trying to find the integer value and therefore you are getting the null
record. Change the age int
with age string
and you will be able to see the result. 由于您的数据包含
"
刚岁以前,它被认为是字符串。在代码中,你已经把它定义为int
因此SQL语法分析程序试图找到整数值,因此,您所得到的null
记录。更改age int
与age string
,你将能够看到结果。
Please see below working example Using Spark HiveContext
. 请参阅下面的工作示例使用Spark
HiveContext
。
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
sqlContext.sql("create external table dummy11(age string, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';' location '/user/skumar143/stack/'")
sqlContext.sql("select age, job from dummy11").show()
Its output: 它的输出:
+---+----------------+
|age| job|
+---+----------------+
|"58| ""management""|
|"44| ""technician""|
|"33|""entrepreneur""|
|"47| ""blue-collar""|
+---+----------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.