Seems simple and I can easily do this in another language but I'm struggling to find the answer using Pig. I can have one or more log files dumped to a directory daily. I want to iteratively call each file and pass it to a Pig script for processing. I'm aware of parameter substitution but how do I loop through the files and pass/call my Pig script? Thanks.
You can use wildcard directly in Pig:
a= load '/user/hadoop/test/*.csv' using PigStorage(',');
dump a;
If your files are stored on your local drive, you need to upload them into the HDFS before :
hadoop fs -put /localdir/*.csv /user/hadoop/test
If you have to take a list of files in Pig, you can create a bash to initialise this list and call the pig command. For example, if you want to launch your pig scripts with the 5 latest files named "httpd-.log", create a .sh :
#!/bin/bash
set -e
HDFS_PATH=/user/hadoop/test
FILES=`hadoop fs -ls $HDFS_PATH | grep "httpd" | sort +5 -7 | tail -5 | awk '{ print $8 }'| tr '\n' ','`
pig -f <pig_script> -p files=$FILES
and in your pig script :
a= load '$files' using PigStorage(',');
Romain
I am not sure on whether it can be done directly using pig but i usually use a combination of hadoop and awk to do this
hadoop fs -ls /testing/*.data | awk '{print $8}' | xargs -I {} pig -param input={} test.pig
In the above example assuming you have to process all the .data files under testing folder in hdfs(if you want all type then just have a *) you do a list and then grep for only the file pat part of it and then use xargs to pass it to your pig file your pig file should use input as a parameter
Pig file should have something like this
a= load '$input' using PigStorage(',');
dump a;
I just read the whole directory in and don't attempt to deal with individual files.
a= load '/user/Hadoop/test/' using PigStorage(',');
You will get a single relation with the contents of all the CSV files.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.