In continuation from this question How can I make a Hive table from a .csv file which has one column with fields delimiited by semicolon ;
Some of the titles/publishers in my csv file have "&"; in them and the rows which contain them are being misread because they are getting prematurely split on the semicolon in the ampersand code and at the end of each field.
How can I modify this code:
CREATE TABLE books (ISBN STRING, Title STRING, Author STRING, Year STRING, Publisher STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\;";
LOAD DATA INPATH '/path/to/my/datafile' INTO TABLE books;
so it does not do this?
An example problematic row in my csv file would be:
0743403843;"Decipher";"Stel Pavlou";"2002";"Simon & Schuster (Trade Division)"
With the publisher column not being read right.
I understand that I could sanatize the csv before hand removing the (&); but could tell me how I could do it in Hive or another tool of Hadoop?
This posting discusses a similar problem and solution when using CSV and quoted strings contain commas: http://dev.bizo.com/2010/11/csv-and-hive.html
It looks like the CSV-Serde they link to can be configured for an alternate separator, so it should work for your format as well.
Can you try this?
hive> CREATE TABLE test_regex(
> isbn STRING,
> title STRING,
> author STRING,
> year STRING,
> publisher STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES ("input.regex" =
> "(.*)\;\"(.*)\"\;\"(.*)\"\;\"(.*)\"\;\"(.*)\"",
> "output.format.string" = "%1$s %2$s %3$s %4$s %5s")
> STORED AS TEXTFILE;
OK
Time taken: 4.139 seconds
hive> load data local inpath 'input.csv' overwrite into table test_regex;
OK
Time taken: 0.393 seconds
hive> select isbn,publisher from test_regex;
ISBN Publisher
0002005018 HarperFlamingo Canada
0399135782 Putnam Pub Group
0743403843 Simon & Schuster (Trade Division)
Time taken: 4.522 seconds
hive> select *from test_regex;
OK
ISBN Title Author Year Publisher
0002005018 Clara Callan Richard Bruce Wright 2001 HarperFlamingo Canada
0399135782 The Kitchen God's Wife Amy Tan 1991 Putnam Pub Group
0743403843 Decipher Stel Pavlou 2002 Simon & Schuster (Trade Division)
Time taken: 0.253 seconds
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.