简体   繁体   English

如何有效地将 Spark 中的数据框与小文件目录连接起来?

[英]How can I efficiently join a dataframe in spark with a directory of small files?

I have a dataframe ( df1 ) with the columns id, date, type .我有一个数据框( df1 ),其中列id, date, type I need to join it with a file /descriptions/{id}.txt .我需要用文件/descriptions/{id}.txt加入它。 The result of the join should be a dataframe id, date, type, description for all entries in ( df1 ) .连接的结果应该是 ( df1 ) 中所有条目的数据帧id, date, type, description

The goal is to use this preprocessed dataframe for futher analysis so I don't have to deal with the small files anymore.目标是使用这个预处理过的数据框进行进一步分析,这样我就不必再处理小文件了。

Worth noting: There are a lot more small description files than I need (x1000) so I think some kind of "lazy join" would be more efficient than reading all of the small files upfront and join then.值得注意的是:小描述文件比我需要的多得多(x1000),所以我认为某种“懒惰加入”比预先读取所有小文件然后加入更有效。

How would you build this in Spark?你将如何在 Spark 中构建它? I use scala currently but if you have a python example I think I could work with that as well.我目前使用 Scala,但如果你有一个 python 示例,我想我也可以使用它。

Option 1, reading all the files:选项 1,读取所有文件:

You could read the descriptions with wholeTextFiles .您可以使用wholeTextFiles阅读描述。 It returns a RDD mapping the file paths to their content, and then join the result with your dataframe.它返回一个将文件路径映射到其内容的 RDD,然后将结果与您的数据帧连接起来。

val descriptions = sc
    .wholeTextFiles(".../descriptions")
    // We need to extract the id from the file path
    .map{ case (path, desc) => {
        val fileName = path.split("/").last
        val id = "[0-9]+".r.findFirstIn(fileName)
        id.get.toLong -> desc
    }}
    .toDF("id", "description")

val result = df1.join(descriptions, Seq("id"))

Option 2, reading only the files you need选项 2,仅读取您需要的文件

For that, you could use binaryFiles .为此,您可以使用binaryFiles It creates a RDD that maps each file path to a DataStream .它创建了一个 RDD,将每个文件路径映射到一个DataStream Therefore, the files are not read right away.因此,不会立即读取文件。 Then you can select all the distinct id from df1 , join them with the RDD and then only read the content of the files you need.然后你可以从df1选择所有不同的id ,将它们与 RDD 连接起来,然后只读取你需要的文件的内容。 The code would look like this:代码如下所示:

val idRDD = df1
    .select("id").distinct
    .rdd.map(_.getAs[Long]("id") -> true)

val descriptions = sc.binaryFiles(".../descriptions")
    // same as before, but the description is not read yet
    .map{ case (path, descFile) => {
        val fileName = path.split("/").last
        val id = "[0-9]+".r.findFirstIn(fileName)
        id.get.toLong -> descFile
    }} // inner join with the ids we are interested in
    .join(idRDD)
    .map{ case(id, (file, _)) => id -> file}
    // reading the files
    .mapValues(file => {
         val reader = scala.io.Source.fromInputStream(file.open)
         val desc = reader.getLines.mkString("\n")
         reader.close
         desc
    })
    .toDF("id", "description")

val result = df1.join(descriptions, Seq("id"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何有效地将Spark数据帧列转换为Numpy数组? - How can I convert Spark dataframe column to Numpy array efficiently? 如何在Python的目录(包括子目录)中有效地选择100个随机JPG文件? - How can I efficiently select 100 random JPG files from a directory (including subdirs) in Python? 如何在python上有效地将小表连接到大表 - How to efficiently join a small table to a large one on python 如何有效地将字典中的数据添加到数据框? - How can I efficiently add data from dictionaries to a dataframe? 如何有效地识别和分类熊猫数据框中的字符串? - how can I efficiently identify and categorize strings in a pandas dataframe? 如何有效地向前/向后填充数据帧中的间隙? - How can I efficiently half forward/backward fill a gap in a dataframe? 我怎样才能有效地从Pandas数据帧转移到JSON - How can I efficiently move from a Pandas dataframe to JSON 如何有效获取DataFrame的前x%? - How can I efficiently get the first x% of a DataFrame? 如何有效地在源代码文件中找到小的错别字? - How to efficiently find small typos in source code files? 如何在 Python 中高效地将小文件上传到 Amazon S3 - How to upload small files to Amazon S3 efficiently in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM