簡體   English   中英

在PIG中讀取帶有模式的文件

[英]Read files with pattern in PIG

我有一種情況,我將使用HCatStorer從目錄到Hive Tables加載40個具有不同模式的文件。

Directory : opt/inputfolder/ 
Input Files Pattern :

inp1*.log,
inp2*.log,
    .....
inp39*.log,
inp40*.log.

我寫了一個pig腳本-讀取40種模式的所有文件。

但是我的問題是,這40個文件是強制性的,我可能收不到一些文件。 在這種情況下,我得到一個異常說明:

Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
           Input Pattern opt/ip_files/inp16*.log matches 0 files

有什么辦法可以處理這個異常?

我想讀取剩余的39個帶有模式的文件,即使該文件不存在。

如果我的源文件是字符串(例如,banana_2014012.log,orange_2014012.log,apple_2014012.log),該怎么辦?

以下是我使用HCatStorer將這些文件中的數據加載到HIVE表的方法。

*** Pseudo code ****
banana_src = LOAD banana_*.log' using PigStorage;
......
Store banana_src into BANANA using HCatStorer;

apple_src = LOAD banana_*.log' using PigStorage;
......
Store apple_src into APPLE using HCatStorer;

orange_src = LOAD banana_*.log' using PigStorage;
......
Store orange_src into ORANGE using HCatStorer;

如果任何src沒有文件,則此Pig腳本將引發錯誤,指出Match Pattern為0並且PIG Scrip將處於FAILED狀態。即使一個源文件不可用,我也希望我的Scrip加載其他表而不會失敗工作。

謝謝。

 If you load inp1*.log, it matches inp16*.log also(if file present) but why are you again
 loading inp16*.log separately?

 Based on the above input i feel the below condition is sufficient for you.
        LOAD 'opt/ip_files/inp[1-9]*.log'

Please let me know if you are trying something different?

UPDATE:
I have one more option but not sure if this works for you.
1. Split your pig script into three parts say banana.pig, apple.pig and orange.pig each script will have their own logic.
2. Write a shell script to check existence of the each file pattern
3. If the files are present, call the corresponding pig script using pig params option else dont call. 
   In this option, if the files are not present that particular pig script will not be get triggred

Shellscript: test.sh
#!/bin/bash

BANANA_FILES="opt/ip_files/banana_*.log"
APPLE_FILES="opt/ip_files/apple_*.log"
ORANGE_FILES="opt/ip_files/orange_*.log"

if ls $BANANA_FILES > /dev/null 2>&1
then
    echo "Banana File Found"
    pig -x local -param PIG_BANANA_INPUT_FILES="$BANANA_FILES" -f banana.pig
else
    echo "No Banana files found"
fi

if ls $APPLE_FILES > /dev/null 2>&1
then
    echo "Apple File Found"
    pig -x local -param PIG_APPLE_INPUT_FILES="$APPLE_FILES" -f apple.pig
else
    echo "No APPLE files found"
fi

if ls $ORANGE_FILES > /dev/null 2>&1
then
    echo "Orange File Found"
    pig -x local -param PIG_ORANGE_INPUT_FILES="$ORANGE_FILES" -f orange.pig
else
    echo "No Orange files found"
fi


PigScript:banana.pig
banana_src = LOAD '$PIG_BANANA_INPUT_FILES' using PigStorage;
DUMP banana_src;

PigScript: apple.pig
apple_src = LOAD '$PIG_APPLE_INPUT_FILES' using PigStorage;
DUMP apple_src;

PigScript:orange.pig
orange_src = LOAD '$PIG_ORANGE_INPUT_FILES' using PigStorage;
DUMP orange_src;

Output1: All the three files are present
$ ./test.sh 
Banana File Found
(1,2,3,4,5)
(a,b,c,d,e)
Apple File Found
(test1,test2)
Orange File Found
(13,4,5)

Output2: Only banana files are present
$ ./test.sh 
Banana File Found
(1,2,3,4,5)
(a,b,c,d,e)
No APPLE files found
No Orange files found

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM