简体   繁体   English

如何在PIG中读取以分号分隔的CSV文件

[英]How to read csv file seperated by semicolon in PIG

How to read CSV file separated by semicolon in PIG? 如何在PIG中读取以分号分隔的CSV文件? The data can also contain semicolon. 数据也可以包含分号。

Eg Input Line: "Name";"Age";"Address";"Resume contains special char like ;,$#$@^";"Rating" 例如,输入行:“名称”;“年龄”;“地址”;“继续包含特殊字符,如;,$#$ @ ^”;“评级”

Output : Each of these fields should be loaded in columns especially "Resume" column should have "Resume contains special char like ;,$#$@^" 输出:这些字段中的每个字段均应装入列中,尤其是“ Resume”列应具有“ Resume包含特殊字符,如;,$#$ @ ^”


Note: I have tried PigStorage, CVSLoader but still cant make it work as the delimiter could also be in data. 注意:我已经尝试了PigStorage,CVSLoader,但是仍然不能使它工作,因为分隔符也可以存在于数据中。

You can use piggybank.jar to read such files. 您可以使用piggybank.jar读取此类文件。

First you need to register piggybank.jar in your pig script and then you can use the functions with in your scripts. 首先,您需要在Pig脚本中注册ggybankbank.jar,然后可以在脚本中使用函数。 Following is code snippet (I haven't tested this but I'm sure it will do the trick) 以下是代码段(我尚未测试过此代码,但我确定它可以解决问题)

REGISTER 'piggybank-0.12.0.jar';

DEFINE CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage();

input_lines = LOAD 'PATH/TO/FILES' using CSVExcelStorage(';', 'YES_MULTILINE') AS (name:chararray, age:int, address:chararray, details:chararray);

For more details refer this and this 欲了解更多详情,请参阅

try this solution. 试试这个解决方案。

A = load 'pigconcat' using PigStorage(';') as (a:chararray,b:chararray,c:chararray,d:chararray,e:chararray,f:chararray);

B = foreach A GENERATE a,b,c,CONCAT(CONCAT(d,';'),e) as (resume:chararray),f; 

C= foreach B GENERATE resume;

dump C;

If delimiter also present in the input data then my suggestion would be go for Regex instead of any loading technique( PigStorage,CSVStorage ). 如果在输入数据中也存在定界符,那么我的建议是使用Regex而不是使用任何加载技术( PigStorage,CSVStorage )。 This will provide more flexible and control in your input. 这将在您的输入中提供更多的灵活性和控制力。 I agree many ppl wont go for Regex due to complex in nature but these kind of problem can be easily solved using regex. 我同意由于性质复杂,许多人不会选择Regex ,但是使用regex可以轻松解决这类问题。

Sample example 样例

input 输入

"Name";"Age";"Address";"Resume contains special char like ;,$#$@^";"Rating"
"Name1";"Age1";"Address1";"Resume;$# contains ;@^ special char like ;,$#$@^";"Rating"

PigScript: PigScript:

A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'("\\w+");("\\w+");("\\w+");("[\\w+\\s;$,#@^]+");("\\w+")')) AS(name,age,address,resume,rating);
C = FOREACH B GENERATE resume;
DUMP C;

Output: 输出:

("Resume contains special char like ;,$#$@^")
("Resume;$# contains ;@^ special char like ;,$#$@^")

Note: 注意:
This is very generic solution and it will work irrespective of any number of special characters present in your input column(resume) . 这是非常通用的解决方案,无论输入column(resume)存在任何特殊字符,它都可以正常工作。 In this script i have printed only resume column , in-case if you need other columns then include in the relation C . 在此脚本中,我仅打印了resume column ,以防万一,如果需要其他列,则将其包括在relation C

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM