I want to input data which looks as-
"58;""management"";""married"";""tertiary"";""no"";2143;""yes"";""no"";""unknown"";5;""may"";261;1;-1;0;""unknown"";""no"""
"44;""technician"";""single"";""secondary"";""no"";29;""yes"";""no"";""unknown"";5;""may"";151;1;-1;0;""unknown"";""no"""
"33;""entrepreneur"";""married"";""secondary"";""no"";2;""yes"";""yes"";""unknown"";5;""may"";76;1;-1;0;""unknown"";""no"""
"47;""blue-collar"";""married"";""unknown"";""no"";1506;""yes"";""no"";""unknown"";5;""may"";92;1;-1;0;""unknown"";""no"""
My create table statement is as-
sqlContext.sql("create table dummy11(age int, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';'")
When I run the statement-
sqlContext.sql("from dummy11 select age").show()
OR
sqlContext.sql("from dummy11 select y").show()
It returns NULL
value instead of correct values, though other values are visible
So how do I correct this??
As you are using Hive QL syntax, you need to validate the input data before processing.
In your data, few records have lesser columns - than the actual columns defined in DDL.
So, for those records, the rest columns (from last) are set as NULL; as that row does not have enough values.
That's why, the last column y
has values NULL
.
Also, in DDL, first field's data type is INT
; but in record, first field values are:
"58
"44
"33
Because of "
, the values are not type-casted to INT
; setting field value as NULL
.
As per the DDL and data - you provided, values are getting set as:
age "58
job ""management""
marital ""married""
education ""tertiary""
default ""no""
housing 2143
loan ""yes""
contact ""no""
month ""unknown""
day_of_week 5
duration ""may""
campaign 261
pday 1
previous -1
poutcome 0
emp_var_rate ""unknown""
cons_price_idx ""no""
cons_price_idx NULL
cons_conf_idx NULL
euribor3m int NULL
nr_employed NULL
y NULL
Check the NULL
values for last 5 columns.
So, if that is not expected, you need to validate the data first before proceeding.
And for the column age
, if you need it in INT
type, cleanse the data to remove unwanted "
character.
WORKAROUND
As workaround, you can define age
as STRING
at beginning, as use spark transformations to parse the first field and convert it to INT
import org.apache.spark.sql.functions._
val ageInINT = udf { (make: String) =>
Integer.parseInt(make.substring(1))
}
df.withColumn("ageInINT", ageInINT(df("age"))).show
Here df
is your dataframe created while executing the hive DDL with column age
as sTRING
.
Nnow, you can perform operation on new column ageInINT
rather than column age
with INTEGER
values.
Since your data contains "
just before the age, it is considered as string. In the code you have defined it as int
therefore sql parser is trying to find the integer value and therefore you are getting the null
record. Change the age int
with age string
and you will be able to see the result.
Please see below working example Using Spark HiveContext
.
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.types._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
sqlContext.sql("create external table dummy11(age string, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day_of_week string, duration int, campaign int, pday int, previous int, poutcome string, emp_var_rate int, cons_price_idx int, cons_conf_idx int, euribor3m int, nr_employed int, y string)row format delimited fields terminated by ';' location '/user/skumar143/stack/'")
sqlContext.sql("select age, job from dummy11").show()
Its output:
+---+----------------+
|age| job|
+---+----------------+
|"58| ""management""|
|"44| ""technician""|
|"33|""entrepreneur""|
|"47| ""blue-collar""|
+---+----------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.