简体   繁体   中英

How can I ignore the semicolon ; in the & when I am creating a Hive table from a .csv file

In continuation from this question How can I make a Hive table from a .csv file which has one column with fields delimiited by semicolon ;

Some of the titles/publishers in my csv file have "&amp"; in them and the rows which contain them are being misread because they are getting prematurely split on the semicolon in the ampersand code and at the end of each field.

How can I modify this code:

CREATE TABLE books (ISBN STRING, Title STRING, Author STRING, Year STRING, Publisher STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY "\;";
LOAD DATA INPATH '/path/to/my/datafile' INTO TABLE books;

so it does not do this?

An example problematic row in my csv file would be:

 0743403843;"Decipher";"Stel Pavlou";"2002";"Simon & Schuster (Trade Division)"

With the publisher column not being read right.

I understand that I could sanatize the csv before hand removing the (&amp); but could tell me how I could do it in Hive or another tool of Hadoop?

This posting discusses a similar problem and solution when using CSV and quoted strings contain commas: http://dev.bizo.com/2010/11/csv-and-hive.html

It looks like the CSV-Serde they link to can be configured for an alternate separator, so it should work for your format as well.

Can you try this?

hive> CREATE TABLE test_regex(
    >     isbn STRING,
    >     title STRING,
    >     author STRING,
    >     year STRING,
    >     publisher STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
    >     WITH SERDEPROPERTIES ("input.regex" = 
    >     "(.*)\;\"(.*)\"\;\"(.*)\"\;\"(.*)\"\;\"(.*)\"",
    >     "output.format.string" = "%1$s %2$s %3$s %4$s %5s")
    >     STORED AS TEXTFILE;
OK
Time taken: 4.139 seconds

hive> load data local inpath 'input.csv' overwrite into table test_regex;
OK
Time taken: 0.393 seconds


hive> select isbn,publisher from test_regex;
ISBN    Publisher
0002005018  HarperFlamingo Canada
0399135782  Putnam Pub Group
0743403843  Simon & Schuster (Trade Division)
Time taken: 4.522 seconds

hive> select *from test_regex;
OK
ISBN    Title   Author  Year    Publisher
0002005018  Clara Callan    Richard Bruce Wright    2001    HarperFlamingo Canada
0399135782  The Kitchen God's Wife  Amy Tan 1991    Putnam Pub Group
0743403843  Decipher    Stel Pavlou 2002    Simon & Schuster (Trade Division)
Time taken: 0.253 seconds

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM