简体   繁体   English

在PIG中读取带有模式的文件

[英]Read files with pattern in PIG

I have a scenario, where I am loading 40 files with different patterns from a directory to Hive Tables using HCatStorer . 我有一种情况,我将使用HCatStorer从目录到Hive Tables加载40个具有不同模式的文件。

Directory : opt/inputfolder/ 
Input Files Pattern :

inp1*.log,
inp2*.log,
    .....
inp39*.log,
inp40*.log.

I have written a pig -script which reads all the files with 40 patterns. 我写了一个pig脚本-读取40种模式的所有文件。

But my problem is, these 40 files is mandatory and I may not receive some files. 但是我的问题是,这40个文件是强制性的,我可能收不到一些文件。 In which case, I am getting an exception stating: 在这种情况下,我得到一个异常说明:

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
           Input Pattern opt/ip_files/inp16*.log matches 0 files

Is there any way to handle this exception? 有什么办法可以处理这个异常?

I want to read the remaining 39 files with pattern even though this file is not present. 我想读取剩余的39个带有模式的文件,即使该文件不存在。

What if my source files are in string (ie banana_2014012.log,orange_2014012.log,apple_2014012.log) 如果我的源文件是字符串(例如,banana_2014012.log,orange_2014012.log,apple_2014012.log),该怎么办?

The following is my Approach for loading data from these files to HIVE Table using HCatStorer. 以下是我使用HCatStorer将这些文件中的数据加载到HIVE表的方法。

*** Pseudo code ****
banana_src = LOAD banana_*.log' using PigStorage;
......
Store banana_src into BANANA using HCatStorer;

apple_src = LOAD banana_*.log' using PigStorage;
......
Store apple_src into APPLE using HCatStorer;

orange_src = LOAD banana_*.log' using PigStorage;
......
Store orange_src into ORANGE using HCatStorer;

If any of the src is not having files then this Pig script will throw error saying Match Pattern is 0 and PIG Scrip will be in FAILED.Even though one source file is not available, I want my scrip to load the other tables without failing my Job. 如果任何src没有文件,则此Pig脚本将引发错误,指出Match Pattern为0并且PIG Scrip将处于FAILED状态。即使一个源文件不可用,我也希望我的Scrip加载其他表而不会失败工作。

Thanks. 谢谢。

 If you load inp1*.log, it matches inp16*.log also(if file present) but why are you again
 loading inp16*.log separately?

 Based on the above input i feel the below condition is sufficient for you.
        LOAD 'opt/ip_files/inp[1-9]*.log'

Please let me know if you are trying something different?

UPDATE:
I have one more option but not sure if this works for you.
1. Split your pig script into three parts say banana.pig, apple.pig and orange.pig each script will have their own logic.
2. Write a shell script to check existence of the each file pattern
3. If the files are present, call the corresponding pig script using pig params option else dont call. 
   In this option, if the files are not present that particular pig script will not be get triggred

Shellscript: test.sh
#!/bin/bash

BANANA_FILES="opt/ip_files/banana_*.log"
APPLE_FILES="opt/ip_files/apple_*.log"
ORANGE_FILES="opt/ip_files/orange_*.log"

if ls $BANANA_FILES > /dev/null 2>&1
then
    echo "Banana File Found"
    pig -x local -param PIG_BANANA_INPUT_FILES="$BANANA_FILES" -f banana.pig
else
    echo "No Banana files found"
fi

if ls $APPLE_FILES > /dev/null 2>&1
then
    echo "Apple File Found"
    pig -x local -param PIG_APPLE_INPUT_FILES="$APPLE_FILES" -f apple.pig
else
    echo "No APPLE files found"
fi

if ls $ORANGE_FILES > /dev/null 2>&1
then
    echo "Orange File Found"
    pig -x local -param PIG_ORANGE_INPUT_FILES="$ORANGE_FILES" -f orange.pig
else
    echo "No Orange files found"
fi


PigScript:banana.pig
banana_src = LOAD '$PIG_BANANA_INPUT_FILES' using PigStorage;
DUMP banana_src;

PigScript: apple.pig
apple_src = LOAD '$PIG_APPLE_INPUT_FILES' using PigStorage;
DUMP apple_src;

PigScript:orange.pig
orange_src = LOAD '$PIG_ORANGE_INPUT_FILES' using PigStorage;
DUMP orange_src;

Output1: All the three files are present
$ ./test.sh 
Banana File Found
(1,2,3,4,5)
(a,b,c,d,e)
Apple File Found
(test1,test2)
Orange File Found
(13,4,5)

Output2: Only banana files are present
$ ./test.sh 
Banana File Found
(1,2,3,4,5)
(a,b,c,d,e)
No APPLE files found
No Orange files found

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM