简体   繁体   中英

How to read csv file seperated by semicolon in PIG

How to read CSV file separated by semicolon in PIG? The data can also contain semicolon.

Eg Input Line: "Name";"Age";"Address";"Resume contains special char like ;,$#$@^";"Rating"

Output : Each of these fields should be loaded in columns especially "Resume" column should have "Resume contains special char like ;,$#$@^"


Note: I have tried PigStorage, CVSLoader but still cant make it work as the delimiter could also be in data.

You can use piggybank.jar to read such files.

First you need to register piggybank.jar in your pig script and then you can use the functions with in your scripts. Following is code snippet (I haven't tested this but I'm sure it will do the trick)

REGISTER 'piggybank-0.12.0.jar';

DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage();

input_lines = LOAD 'PATH/TO/FILES' using CSVExcelStorage(';', 'YES_MULTILINE') AS (name:chararray, age:int, address:chararray, details:chararray);

For more details refer this and this

try this solution.

A = load 'pigconcat' using PigStorage(';') as (a:chararray,b:chararray,c:chararray,d:chararray,e:chararray,f:chararray);

B = foreach A GENERATE a,b,c,CONCAT(CONCAT(d,';'),e) as (resume:chararray),f; 

C= foreach B GENERATE resume;

dump C;

If delimiter also present in the input data then my suggestion would be go for Regex instead of any loading technique( PigStorage,CSVStorage ). This will provide more flexible and control in your input. I agree many ppl wont go for Regex due to complex in nature but these kind of problem can be easily solved using regex.

Sample example

input

"Name";"Age";"Address";"Resume contains special char like ;,$#$@^";"Rating"
"Name1";"Age1";"Address1";"Resume;$# contains ;@^ special char like ;,$#$@^";"Rating"

PigScript:

A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'("\\w+");("\\w+");("\\w+");("[\\w+\\s;$,#@^]+");("\\w+")')) AS(name,age,address,resume,rating);
C = FOREACH B GENERATE resume;
DUMP C;

Output:

("Resume contains special char like ;,$#$@^")
("Resume;$# contains ;@^ special char like ;,$#$@^")

Note:
This is very generic solution and it will work irrespective of any number of special characters present in your input column(resume) . In this script i have printed only resume column , in-case if you need other columns then include in the relation C .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM