简体   繁体   中英

How to execute a Pig script against multiple files within a directory

Seems simple and I can easily do this in another language but I'm struggling to find the answer using Pig. I can have one or more log files dumped to a directory daily. I want to iteratively call each file and pass it to a Pig script for processing. I'm aware of parameter substitution but how do I loop through the files and pass/call my Pig script? Thanks.

You can use wildcard directly in Pig:

 a= load '/user/hadoop/test/*.csv' using PigStorage(',');
   dump a;

If your files are stored on your local drive, you need to upload them into the HDFS before :

hadoop fs -put /localdir/*.csv /user/hadoop/test

If you have to take a list of files in Pig, you can create a bash to initialise this list and call the pig command. For example, if you want to launch your pig scripts with the 5 latest files named "httpd-.log", create a .sh :

#!/bin/bash
set -e
HDFS_PATH=/user/hadoop/test
FILES=`hadoop fs -ls $HDFS_PATH | grep "httpd" | sort +5 -7 | tail -5 | awk '{ print $8 }'| tr '\n' ','`
pig -f <pig_script> -p files=$FILES

and in your pig script :

a= load '$files' using PigStorage(',');

Romain

I am not sure on whether it can be done directly using pig but i usually use a combination of hadoop and awk to do this

hadoop fs -ls /testing/*.data  | awk '{print $8}' | xargs -I {} pig -param input={} test.pig

In the above example assuming you have to process all the .data files under testing folder in hdfs(if you want all type then just have a *) you do a list and then grep for only the file pat part of it and then use xargs to pass it to your pig file your pig file should use input as a parameter

Pig file should have something like this

   a= load '$input' using PigStorage(',');
   dump a;

I just read the whole directory in and don't attempt to deal with individual files.

a= load '/user/Hadoop/test/' using PigStorage(',');

You will get a single relation with the contents of all the CSV files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM