简体   繁体   English

如何使用正则表达式在sc.textFile中包含/排除一些输入文件?

[英]How to use regex to include/exclude some input files in sc.textFile?

I have attempted to filter out dates for specific files using Apache spark inside the file to RDD function sc.textFile() . 我试图使用文件中的Apache spark过滤掉特定文件的日期到RDD函数sc.textFile()

I have attempted to do the following: 我试图做以下事情:

sc.textFile("/user/Orders/201507(2[7-9]{1}|3[0-1]{1})*")

This should match the following: 这应符合以下要求:

/user/Orders/201507270010033.gz
/user/Orders/201507300060052.gz

Any idea how to achieve this? 知道怎么做到这一点?

Looking at the accepted answer , it seems to use some form of glob syntax. 看看接受的答案 ,它似乎使用某种形式的glob语法。 It also reveals that the API is an exposure of Hadoop's FileInputFormat . 它还揭示了API是Hadoop的FileInputFormat的曝光。

Searching reveals that paths supplied to FileInputFormat 's addInputPath or setInputPath "may represent a file, a directory, or, by using glob, a collection of files and directories" . 搜索显示提供给FileInputFormataddInputPathsetInputPath路径“可以表示文件,目录,或者使用glob,表示文件和目录的集合” Perhaps, SparkContext also uses those APIs to set the path. 也许, SparkContext也使用这些API来设置路径。

The syntax of the glob includes: glob语法包括:

  • * (match 0 or more character) * (匹配0或更多字符)
  • ? (match single character) (匹配单个字符)
  • [ab] (character class) [ab] (角色类)
  • [^ab] (negated character class) [^ab] (否定字符类)
  • [ab] (character range) [ab] (字符范围)
  • {a,b} (alternation) {a,b} (交替)
  • \\c (escape character) \\c (转义字符)

Following the example in the accepted answer, it is possible to write your path as: 按照接受的答案中的示例,可以将您的路径写为:

sc.textFile("/user/Orders/2015072[7-9]*,/user/Orders/2015073[0-1]*")

It's not clear how alternation syntax can be used here, since comma is used to delimit a list of paths (as shown above). 目前尚不清楚如何使用交替语法,因为逗号用于分隔路径列表(如上所示)。 According to zero323 's comment, no escaping is necessary: 根据zero323的评论,没有必要逃脱:

sc.textFile("/user/Orders/201507{2[7-9],3[0-1]}*")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM