How do I load data correctly in Hive using spark?

Question

I want to input data which looks as-

"58;""management"";""married"";""tertiary"";""no"";2143;""yes"";""no"";""unknown"";5;""may"";261;1;-1;0;""unknown"";""no"""
"44;""technician"";""single"";""secondary"";""no"";29;""yes"";""no"";""unknown"";5;""may"";151;1;-1;0;""unknown"";""no"""
"33;""entrepreneur"";""married"";""secondary"";""no"";2;""yes"";""yes"";""unknown"";5;""may"";76;1;-1;0;""unknown"";""no"""
"47;""blue-collar"";""married"";""unknown"";""no"";1506;""yes"";""no"";""unknown"";5;""may"";92;1;-1;0;""unknown"";""no"""

My create table statement is as-

sqlContext.sql("create table dummy11(age int, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';'")

When I run the statement-

sqlContext.sql("from dummy11 select age").show()

OR

sqlContext.sql("from dummy11 select y").show()

It returns NULL value instead of correct values, though other values are visible

So how do I correct this??

Answer 1

As you are using Hive QL syntax, you need to validate the input data before processing.

In your data, few records have lesser columns - than the actual columns defined in DDL.

So, for those records, the rest columns (from last) are set as NULL; as that row does not have enough values.

That's why, the last column y has values NULL .

Also, in DDL, first field's data type is INT ; but in record, first field values are:

"58
"44
"33

Because of " , the values are not type-casted to INT ; setting field value as NULL .

As per the DDL and data - you provided, values are getting set as:

age             "58
job             ""management""
marital         ""married""
education       ""tertiary""
default         ""no""
housing         2143
loan            ""yes""
contact         ""no""
month           ""unknown""
day_of_week     5
duration        ""may""
campaign        261
pday            1
previous        -1
poutcome        0
emp_var_rate    ""unknown""
cons_price_idx  ""no""
cons_price_idx  NULL
cons_conf_idx   NULL
euribor3m int   NULL
nr_employed     NULL
y               NULL

Check the NULL values for last 5 columns.

So, if that is not expected, you need to validate the data first before proceeding.

And for the column age , if you need it in INT type, cleanse the data to remove unwanted " character.

WORKAROUND

As workaround, you can define age as STRING at beginning, as use spark transformations to parse the first field and convert it to INT

import org.apache.spark.sql.functions._
val ageInINT = udf { (make: String) =>
  Integer.parseInt(make.substring(1))
}
df.withColumn("ageInINT", ageInINT(df("age"))).show

Here df is your dataframe created while executing the hive DDL with column age as sTRING .

Nnow, you can perform operation on new column ageInINT rather than column age with INTEGER values.

Answer 2

Since your data contains " just before the age, it is considered as string. In the code you have defined it as int therefore sql parser is trying to find the integer value and therefore you are getting the null record. Change the age int with age string and you will be able to see the result.

Please see below working example Using Spark HiveContext .

import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)

sqlContext.sql("create external table dummy11(age string, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';' location '/user/skumar143/stack/'")
sqlContext.sql("select age, job from dummy11").show()

Its output:

+---+----------------+
|age|             job|
+---+----------------+
|"58|  ""management""|
|"44|  ""technician""|
|"33|""entrepreneur""|
|"47| ""blue-collar""|
+---+----------------+

How do I load data correctly in Hive using spark?

Question

2 answers

solution1
0 ACCPTED 2017-06-28 06:20:14

solution2
0 2017-06-28 06:58:30

How do I load data correctly in Hive using spark?

Question

2 answers

solution1 0 ACCPTED 2017-06-28 06:20:14

solution2 0 2017-06-28 06:58:30

solution1
0 ACCPTED 2017-06-28 06:20:14

solution2
0 2017-06-28 06:58:30