How to use regex to include/exclude some input files in sc.textFile?

Question

I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile() .

I have attempted to do the following:

sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")

This should match the following:

/user/Orders/201507270010033.gz
/user/Orders/201507300060052.gz

Any idea how to achieve this?

Answer 1

Looking at the accepted answer , it seems to use some form of glob syntax. It also reveals that the API is an exposure of Hadoop's FileInputFormat .

Searching reveals that paths supplied to FileInputFormat 's addInputPath or setInputPath "may represent a file, a directory, or, by using glob, a collection of files and directories" . Perhaps, SparkContext also uses those APIs to set the path.

The syntax of the glob includes:

* (match 0 or more character)
? (match single character)
[ab] (character class)
[^ab] (negated character class)
[ab] (character range)
{a,b} (alternation)
\\c (escape character)

Following the example in the accepted answer, it is possible to write your path as:

sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")

It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). According to zero323 's comment, no escaping is necessary:

sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")

How to use regex to include/exclude some input files in sc.textFile?

Question

1 answers

solution1
52 ACCPTED 2015-08-03 09:49:54

How to use regex to include/exclude some input files in sc.textFile?

Question

1 answers

solution1 52 ACCPTED 2015-08-03 09:49:54

solution1
52 ACCPTED 2015-08-03 09:49:54