如何使用spark在Hive中正确加载数据？

Question

我想输入看起来像的数据 -

"58;""management"";""married"";""tertiary"";""no"";2143;""yes"";""no"";""unknown"";5;""may"";261;1;-1;0;""unknown"";""no"""
"44;""technician"";""single"";""secondary"";""no"";29;""yes"";""no"";""unknown"";5;""may"";151;1;-1;0;""unknown"";""no"""
"33;""entrepreneur"";""married"";""secondary"";""no"";2;""yes"";""yes"";""unknown"";5;""may"";76;1;-1;0;""unknown"";""no"""
"47;""blue-collar"";""married"";""unknown"";""no"";1506;""yes"";""no"";""unknown"";5;""may"";92;1;-1;0;""unknown"";""no"""

我的create table语句是 -

sqlContext.sql("create table dummy11(age int, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';'")

当我发表声明时 -

sqlContext.sql("from dummy11 select age").show()

要么

sqlContext.sql("from dummy11 select y").show()

它返回NULL值而不是正确的值，但其他值是可见的

那么我该怎么纠正这个？

Answer 1

在使用Hive QL语法时，需要在处理之前验证输入数据。

在您的数据中，很少有记录具有比DDL中定义的实际列更少的列。

因此，对于那些记录，其余列（从last）被设置为NULL; 因为该行没有足够的值。

这就是为什么，最后一列y值为NULL 。

另外，在DDL中，第一个字段的数据类型是INT ; 但在记录中，第一个字段值是：

"58
"44
"33

由于" ，值不是类型转换为INT ;将字段值设置为NULL 。

根据您提供的DDL和数据，值设置为：

age             "58
job             ""management""
marital         ""married""
education       ""tertiary""
default         ""no""
housing         2143
loan            ""yes""
contact         ""no""
month           ""unknown""
day_of_week     5
duration        ""may""
campaign        261
pday            1
previous        -1
poutcome        0
emp_var_rate    ""unknown""
cons_price_idx  ""no""
cons_price_idx  NULL
cons_conf_idx   NULL
euribor3m int   NULL
nr_employed     NULL
y               NULL

检查最后5列的NULL值。

因此，如果不是这种情况，则需要先进行数据验证，然后再继续。

对于列age ，如果您需要INT类型，请清除数据以删除不需要的"字符”。

替代方法

作为解决方法，您可以在开始时将age定义为STRING ，因为使用spark变换来解析第一个字段并将其转换为INT

import org.apache.spark.sql.functions._
val ageInINT = udf { (make: String) =>
  Integer.parseInt(make.substring(1))
}
df.withColumn("ageInINT", ageInINT(df("age"))).show

这里df是在执行hive DDL时创建的数据sTRING ，其列age为sTRING 。

Nnow，您可以对新列ageInINT执行操作，而不是使用INTEGER值对列age执行操作。

Answer 2

由于您的数据包含"刚岁以前，它被认为是字符串。在代码中，你已经把它定义为int因此SQL语法分析程序试图找到整数值，因此，您所得到的null记录。更改age int与age string ，你将能够看到结果。

请参阅下面的工作示例使用Spark HiveContext 。

import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)

sqlContext.sql("create external table dummy11(age string, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';' location '/user/skumar143/stack/'")
sqlContext.sql("select age, job from dummy11").show()

它的输出：

+---+----------------+
|age|             job|
+---+----------------+
|"58|  ""management""|
|"44|  ""technician""|
|"33|""entrepreneur""|
|"47| ""blue-collar""|
+---+----------------+

如何使用spark在Hive中正确加载数据？

问题描述

2 个解决方案

解决方案1
0 已采纳 2017-06-28 06:20:14

解决方案2
0 2017-06-28 06:58:30

如何使用spark在Hive中正确加载数据？

问题描述

2 个解决方案

解决方案1 0 已采纳 2017-06-28 06:20:14

解决方案2 0 2017-06-28 06:58:30

解决方案1
0 已采纳 2017-06-28 06:20:14

解决方案2
0 2017-06-28 06:58:30