如何使用正则表达式在sc.textFile中包含/排除一些输入文件？

Question

I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile() . 我试图使用文件中的Apache spark过滤掉特定文件的日期到RDD函数sc.textFile() 。

I have attempted to do the following: 我试图做以下事情：

sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")

This should match the following: 这应符合以下要求：

/user/Orders/201507270010033.gz
/user/Orders/201507300060052.gz

Any idea how to achieve this? 知道怎么做到这一点？

Answer 1

Looking at the accepted answer , it seems to use some form of glob syntax. 看看接受的答案，它似乎使用某种形式的glob语法。 It also reveals that the API is an exposure of Hadoop's FileInputFormat . 它还揭示了API是Hadoop的FileInputFormat的曝光。

Searching reveals that paths supplied to FileInputFormat 's addInputPath or setInputPath "may represent a file, a directory, or, by using glob, a collection of files and directories" . 搜索显示提供给FileInputFormat的addInputPath或setInputPath路径“可以表示文件，目录，或者使用glob，表示文件和目录的集合” 。 Perhaps, SparkContext also uses those APIs to set the path. 也许， SparkContext也使用这些API来设置路径。

The syntax of the glob includes: glob的语法包括：

* (match 0 or more character) * （匹配0或更多字符）
? (match single character) （匹配单个字符）
[ab] (character class) [ab] （角色类）
[^ab] (negated character class) [^ab] （否定字符类）
[ab] (character range) [ab] （字符范围）
{a,b} (alternation) {a,b} （交替）
\\c (escape character) \\c （转义字符）

Following the example in the accepted answer, it is possible to write your path as: 按照接受的答案中的示例，可以将您的路径写为：

sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")

It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). 目前尚不清楚如何使用交替语法，因为逗号用于分隔路径列表（如上所示）。 According to zero323 's comment, no escaping is necessary: 根据zero323的评论，没有必要逃脱：

sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")

如何使用正则表达式在sc.textFile中包含/排除一些输入文件？

问题描述

1 个解决方案

解决方案1
52 已采纳 2015-08-03 09:49:54

如何使用正则表达式在sc.textFile中包含/排除一些输入文件？

问题描述

1 个解决方案

解决方案1 52 已采纳 2015-08-03 09:49:54

解决方案1
52 已采纳 2015-08-03 09:49:54