简体   繁体   English

使用int数据类型加载时,apache pig输出空值

[英]apache pig output null values when loading with int datatype

I am working with pig-0.16.0 I'm trying to join two tab delimited files (.tsv) using pig script. 我正在使用Pig-0.16.0,正在尝试使用Pig脚本加入两个制表符分隔文件(.tsv)。 Some of the column fields are of integer type, so I am trying to load them as int. 一些列字段是整数类型,因此我试图将它们加载为int。 But I see that whichever columns I made 'int' are not loaded with data and they shows as empty. 但是我看到我创建为“ int”的任何列均未加载数据,并且显示为空。 My join was not outputting any result, so I took a step back and found out this problem occurred at the loading step. 我的联接没有输出任何结果,因此我退了一步,发现在加载步骤中发生了此问题。 I am pasting a part of my pig script here: 我在这里粘贴我的猪脚本的一部分:

REGISTER /usr/local/pig/lib/piggybank.jar;
-- $0 = streaminputs/forum_node.tsv
-- $1 = streaminputs/forum_users.tsv
u_f_n = LOAD '$file1' USING PigStorage('\t') AS (id: long, title: chararray, tagnames: chararray, author_id: long, body: chararray, node_type: chararray, parent_id: long, abs_parent_id: long, added_at: chararray, score: int, state_string: chararray, last_edited_id: long, last_activity_by_id: long, last_activity_at: chararray, active_revision_id: int, extra:chararray, extra_ref_id: int, extra_count:int, marked: chararray);

LUFN = LIMIT u_f_n 10;

STORE LUFN INTO 'pigout/LN';

u_f_u = LOAD '$file2' USING PigStorage('\t') AS (author_id: long, reputation: chararray, gold: chararray, silver: chararray, bronze: chararray);

LUFUU = LIMIT u_f_u 10;

STORE LUFUU INTO 'pigout/LU';

I tried using long, but still the same issue, only chararray seemed to work here. 我尝试使用很长但仍然相同的问题,只有chararray似乎在这里起作用。 So, what could be the problem? 那么,可能是什么问题呢?

Snippets from two input .tsv files: 来自两个输入.tsv文件的片段:

forum_nodes.tsv: forum_nodes.tsv:

"id"    "title" "tagnames"  "author_id" "body"  "node_type" "parent_id" "abs_parent_id" "added_at"  "score" "state_string"  "last_edited_id"    "last_activity_by_id"   "last_activity_at"  "active_revision_id"    "extra" "extra_ref_id"  "extra_count"   "marked"
"5339"  "Whether pdf of Unit and Homework is available?"    "cs101 pdf" "100000458" ""  "question"  "\N"    "\N"    "2012-02-25 08:09:06.787181+00" "1" ""  "\N"    "100000921" "2012-02-25 08:11:01.623548+00" "6922"  "\N"    "\N"    "204"   "f"

forum_users.tsv: forum_users.tsv:

"user_ptr_id"   "reputation"    "gold"  "silver"    "bronze"
"100006402" "18"    "0" "0" "0"
"100022094" "6354"  "4" "12"    "50"
"100018705" "76"    "0" "3" "4"
"100021176" "213"   "0" "1" "5"
"100045508" "505"   "0" "1" "5"

You need to replace quotes to let pig know its int or else it will display blank. 您需要替换引号以使pig知道其int ,否则它将显示为空白。 You should use CSVLoader OR CSVExcelStorage, see my tests: 您应该使用CSVLoader或CSVExcelStorage,请参阅我的测试:

Sample File: 样本文件:

"1","test"

Test 1 - Using CSVLoader: 测试1-使用CSVLoader:

You can use CSVLoader or CSVExcelStorage if you want to ignore quotes - see example here 如果您想忽略引号,则可以使用CSVLoader或CSVExcelStorage-在此处查看示例

PIG Commands: PIG命令:

register '/usr/lib/pig/piggybank.jar' ;
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
file1 = load 'file1.txt' using CSVLoader(',') as (f1:int, f2:chararray);

output: 输出:

(1,test)

Test 2 - Replacing double quotes: 测试2-替换双引号:

PIG commands: PIG命令:

file1 = load 'file1.txt' using PigStorage(',');
data  = foreach file1 generate REPLACE($0,'\\"','') as (f1:int) ,$1 as (f2:chararray);

output: 输出:

(1,"test")

Test 3 - using data as it is: 测试3-按原样使用数据:

PIG commands: PIG命令:

file1 = load 'file1.txt' using PigStorage(',') as (f1:int, f2:chararray);

Output: 输出:

(,"test")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM