如何使用spark在Hive中正确加载数据？

Question

I want to input data which looks as- 我想输入看起来像的数据 -

"58;""management"";""married"";""tertiary"";""no"";2143;""yes"";""no"";""unknown"";5;""may"";261;1;-1;0;""unknown"";""no"""
"44;""technician"";""single"";""secondary"";""no"";29;""yes"";""no"";""unknown"";5;""may"";151;1;-1;0;""unknown"";""no"""
"33;""entrepreneur"";""married"";""secondary"";""no"";2;""yes"";""yes"";""unknown"";5;""may"";76;1;-1;0;""unknown"";""no"""
"47;""blue-collar"";""married"";""unknown"";""no"";1506;""yes"";""no"";""unknown"";5;""may"";92;1;-1;0;""unknown"";""no"""

My create table statement is as- 我的create table语句是 -

sqlContext.sql("create table dummy11(age int, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';'")

When I run the statement- 当我发表声明时 -

sqlContext.sql("from dummy11 select age").show()

OR 要么

sqlContext.sql("from dummy11 select y").show()

It returns NULL value instead of correct values, though other values are visible 它返回NULL值而不是正确的值，但其他值是可见的

So how do I correct this?? 那么我该怎么纠正这个？

Answer 1

As you are using Hive QL syntax, you need to validate the input data before processing. 在使用Hive QL语法时，需要在处理之前验证输入数据。

In your data, few records have lesser columns - than the actual columns defined in DDL. 在您的数据中，很少有记录具有比DDL中定义的实际列更少的列。

So, for those records, the rest columns (from last) are set as NULL; 因此，对于那些记录，其余列（从last）被设置为NULL; as that row does not have enough values. 因为该行没有足够的值。

That's why, the last column y has values NULL . 这就是为什么，最后一列y值为NULL 。

Also, in DDL, first field's data type is INT ; 另外，在DDL中，第一个字段的数据类型是INT ; but in record, first field values are: 但在记录中，第一个字段值是：

"58
"44
"33

Because of " , the values are not type-casted to INT ; setting field value as NULL . 由于" ，值不是类型转换为INT ;将字段值设置为NULL 。

As per the DDL and data - you provided, values are getting set as: 根据您提供的DDL和数据，值设置为：

age             "58
job             ""management""
marital         ""married""
education       ""tertiary""
default         ""no""
housing         2143
loan            ""yes""
contact         ""no""
month           ""unknown""
day_of_week     5
duration        ""may""
campaign        261
pday            1
previous        -1
poutcome        0
emp_var_rate    ""unknown""
cons_price_idx  ""no""
cons_price_idx  NULL
cons_conf_idx   NULL
euribor3m int   NULL
nr_employed     NULL
y               NULL

Check the NULL values for last 5 columns. 检查最后5列的NULL值。

So, if that is not expected, you need to validate the data first before proceeding. 因此，如果不是这种情况，则需要先进行数据验证，然后再继续。

And for the column age , if you need it in INT type, cleanse the data to remove unwanted " character. 对于列age ，如果您需要INT类型，请清除数据以删除不需要的"字符”。

WORKAROUND 替代方法

As workaround, you can define age as STRING at beginning, as use spark transformations to parse the first field and convert it to INT 作为解决方法，您可以在开始时将age定义为STRING ，因为使用spark变换来解析第一个字段并将其转换为INT

import org.apache.spark.sql.functions._
val ageInINT = udf { (make: String) =>
  Integer.parseInt(make.substring(1))
}
df.withColumn("ageInINT", ageInINT(df("age"))).show

Here df is your dataframe created while executing the hive DDL with column age as sTRING . 这里df是在执行hive DDL时创建的数据sTRING ，其列age为sTRING 。

Nnow, you can perform operation on new column ageInINT rather than column age with INTEGER values. Nnow，您可以对新列ageInINT执行操作，而不是使用INTEGER值对列age执行操作。

Answer 2

Since your data contains " just before the age, it is considered as string. In the code you have defined it as int therefore sql parser is trying to find the integer value and therefore you are getting the null record. Change the age int with age string and you will be able to see the result. 由于您的数据包含"刚岁以前，它被认为是字符串。在代码中，你已经把它定义为int因此SQL语法分析程序试图找到整数值，因此，您所得到的null记录。更改age int与age string ，你将能够看到结果。

Please see below working example Using Spark HiveContext . 请参阅下面的工作示例使用Spark HiveContext 。

import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)

sqlContext.sql("create external table dummy11(age string, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';' location '/user/skumar143/stack/'")
sqlContext.sql("select age, job from dummy11").show()

Its output: 它的输出：

+---+----------------+
|age|             job|
+---+----------------+
|"58|  ""management""|
|"44|  ""technician""|
|"33|""entrepreneur""|
|"47| ""blue-collar""|
+---+----------------+

如何使用spark在Hive中正确加载数据？

问题描述

2 个解决方案

解决方案1
0 已采纳 2017-06-28 06:20:14

解决方案2
0 2017-06-28 06:58:30

如何使用spark在Hive中正确加载数据？

问题描述

2 个解决方案

解决方案1 0 已采纳 2017-06-28 06:20:14

解决方案2 0 2017-06-28 06:58:30

解决方案1
0 已采纳 2017-06-28 06:20:14

解决方案2
0 2017-06-28 06:58:30