[英]How to use regex to include/exclude some input files in sc.textFile?
I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile()
. 我试图使用文件中的Apache spark过滤掉特定文件的日期到RDD函数
sc.textFile()
。
I have attempted to do the following: 我试图做以下事情:
sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")
This should match the following: 这应符合以下要求:
/user/Orders/201507270010033.gz
/user/Orders/201507300060052.gz
Any idea how to achieve this? 知道怎么做到这一点?
Looking at the accepted answer , it seems to use some form of glob syntax. 看看接受的答案 ,它似乎使用某种形式的glob语法。 It also reveals that the API is an exposure of Hadoop's
FileInputFormat
. 它还揭示了API是Hadoop的
FileInputFormat
的曝光。
Searching reveals that paths supplied to FileInputFormat
's addInputPath
or setInputPath
"may represent a file, a directory, or, by using glob, a collection of files and directories" . 搜索显示提供给
FileInputFormat
的addInputPath
或setInputPath
路径“可以表示文件,目录,或者使用glob,表示文件和目录的集合” 。 Perhaps, SparkContext
also uses those APIs to set the path. 也许,
SparkContext
也使用这些API来设置路径。
The syntax of the glob includes: glob的语法包括:
*
(match 0 or more character) *
(匹配0或更多字符) ?
(match single character) [ab]
(character class) [ab]
(角色类) [^ab]
(negated character class) [^ab]
(否定字符类) [ab]
(character range) [ab]
(字符范围) {a,b}
(alternation) {a,b}
(交替) \\c
(escape character) \\c
(转义字符) Following the example in the accepted answer, it is possible to write your path as: 按照接受的答案中的示例,可以将您的路径写为:
sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")
It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). 目前尚不清楚如何使用交替语法,因为逗号用于分隔路径列表(如上所示)。 According to zero323 's comment, no escaping is necessary:
根据zero323的评论,没有必要逃脱:
sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.