如何高效地统计每个文件的WordCount？

Question

我在 dir demotxt中有数万个文件，例如：

demotxt/
    aa.txt
            this is aaa1
            this is aaa2
            this is aaa3
    bb.txt
            this is bbb1
            this is bbb2
            this is bbb3
            this is bbb4
    cc.txt
            this is ccc1
            this is ccc2

我想使用 Spark2.4（scala 或 python）有效地为这个目录中的每个.txt创建一个WordCount

# target result is:
aa.txt:  (this,3), (is,3), (aaa1,1), (aaa2,1), (aaa3,1) 
bb.txt:  (this,3), (is,3), (bbb1,1), (bbb2,1), (bbb3,1) 
cc.txt:  (this,3), (is,3), (ccc1,1), (ccc2,1), (ccc3,1)

代码可能像？

def dealWithOneFile(path2File):
  res = wordcountFor(path2File)
  saveResultToDB(res)
sc.wholeTextFile(rooDir).map(dealWithOneFile)

似乎使用sc.textFile(".../demotxt/") spark 会加载所有可能导致 memory 问题的文件，而且它会将所有文件视为非预期文件。

所以我想知道我应该怎么做？ 非常感谢！

Answer 1

这是一种方法。 可以与 DF 或 RDD 一起使用。 在这里，我使用 Databricks 显示 RDD，因为您也没有使用 state。Scala。

这很难解释，但有效。 尝试一些输入。

%scala
val paths = Seq("/FileStore/tables/fff_1.txt", "/FileStore/tables/fff_2.txt")
val rdd = spark.read.format("text").load(paths: _*).select(input_file_name, $"value").as[(String, String)].rdd  
val rdd2 = rdd.flatMap(x=>x._2.split("\\s+").map(y => ((x._1, y), 1)))
val rdd3 = rdd2.reduceByKey(_+_).map( { case (x, y) => (x._1, (x._2, y)) } )
rdd3.collect
val rdd4 = rdd3.groupByKey() 
rdd4.collect

如何高效地统计每个文件的WordCount？

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-12-05 21:30:41

如何高效地统计每个文件的WordCount？

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-12-05 21:30:41

解决方案1
1 已采纳 2022-12-05 21:30:41