Select data into Hadoop with Hive

Question

I've created a table in Hive with the following command :

CREATE TABLE tweet_table(
    tweet STRING
)
ROW FORMAT
    DELIMITED
        FIELDS TERMINATED BY '\n'
        LINES TERMINATED BY '\n'

I insert some data with :

LOAD DATA LOCAL INPATH 'data.txt' INTO TABLE tweet_table

data.txt :

data1
data2
data3data4
data5

The command select * from tweet_table returns :

data1
data2
data3data4
data5

But select tweet from tweet_table gives me :

java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0
    at org.apache.hadoop.hive.ql.exec.Utilities.getMapRedWork(Utilities.java:230)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:255)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:381)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:374)
    at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:540)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
    at java.beans.XMLDecoder.readObject(XMLDecoder.java:250)
    at org.apache.hadoop.hive.ql.exec.Utilities.deserializeMapRedWork(Utilities.java:542)
    at org.apache.hadoop.hive.ql.exec.Utilities.getMapRedWork(Utilities.java:222)
    ... 7 more


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec

Like if the data were stored in the correct table, but not in the tweet field, why ?

Answer 1

Testing against Apache Hive 1.2.1, it appears that this behavior no longer repros in exactly the same way. However, it's highly likely that the original problem had something to do with the use of the same character ( '\\n' ) as both the field terminator and the line terminator in the CREATE TABLE statement.

CREATE TABLE tweet_table(
    tweet STRING
)
ROW FORMAT
    DELIMITED
        FIELDS TERMINATED BY '\n'
        LINES TERMINATED BY '\n'

This cannot yield predictable results, because you've said that the '\\n' could indicate both the end of a field or the end of a whole line.

This is what happens when I test against Apache Hive 1.2.1. The contents of data.txt is 3 rows of data, each row containing 2 columns, with the fields separated by a tab ( '\\t' ) and lines separated by '\\n' .

key1    value1
key2    value2
key3    value3

Let's test with the field terminator and the line terminator both set to '\\n' .

hive> CREATE TABLE data_table(
    >     key STRING,
    >     value STRING
    > )
    > ROW FORMAT
    >     DELIMITED
    >         FIELDS TERMINATED BY '\n'
    >         LINES TERMINATED BY '\n';
OK
Time taken: 2.322 seconds
hive> LOAD DATA LOCAL INPATH 'data.txt' INTO TABLE data_table;
Loading data to table default.data_table
Table default.data_table stats: [numFiles=1, totalSize=36]
OK
Time taken: 2.273 seconds
hive> SELECT * FROM data_table;
OK
key1    value1  NULL
key2    value2  NULL
key3    value3  NULL
Time taken: 1.387 seconds, Fetched: 3 row(s)
hive> SELECT key FROM data_table;
OK
key1    value1
key2    value2
key3    value3
Time taken: 1.254 seconds, Fetched: 3 row(s)
hive> SELECT value FROM data_table;
OK
NULL
NULL
NULL
Time taken: 1.384 seconds, Fetched: 3 row(s)

We can see that it interpreted each "key\\tvalue" as the key in the table definition, and assumed there was nothing specified for value . This is a valid interpretation, because the table definition stated that fields would be delimited by '\\n' , and there is no '\\n' in the input until after both the key and the value.

Now let's repeat the same test with the field terminator set to '\\t' and the line terminator set to '\\n' .

hive> CREATE TABLE data_table(
    >     key STRING,
    >     value STRING
    > )
    > ROW FORMAT
    >     DELIMITED
    >         FIELDS TERMINATED BY '\t'
    >         LINES TERMINATED BY '\n';
OK
Time taken: 2.247 seconds
hive> LOAD DATA LOCAL INPATH 'data.txt' INTO TABLE data_table;
Loading data to table default.data_table
Table default.data_table stats: [numFiles=1, totalSize=36]
OK
Time taken: 2.244 seconds
hive> SELECT * FROM data_table;
OK
key1    value1
key2    value2
key3    value3
Time taken: 1.308 seconds, Fetched: 3 row(s)
hive> SELECT key FROM data_table;
OK
key1
key2
key3
Time taken: 1.376 seconds, Fetched: 3 row(s)
hive> SELECT value FROM data_table;
OK
value1
value2
value3
Time taken: 1.281 seconds, Fetched: 3 row(s)

This time we see the expected results.

Select data into Hadoop with Hive

Question

1 answers

solution1
1 2015-12-30 00:50:42

Select data into Hadoop with Hive

Question

1 answers

solution1 1 2015-12-30 00:50:42

solution1
1 2015-12-30 00:50:42