如何高效地統計每個文件的WordCount？

Question

我在 dir demotxt中有數萬個文件，例如：

demotxt/
    aa.txt
            this is aaa1
            this is aaa2
            this is aaa3
    bb.txt
            this is bbb1
            this is bbb2
            this is bbb3
            this is bbb4
    cc.txt
            this is ccc1
            this is ccc2

我想使用 Spark2.4（scala 或 python）有效地為這個目錄中的每個.txt創建一個WordCount

# target result is:
aa.txt:  (this,3), (is,3), (aaa1,1), (aaa2,1), (aaa3,1) 
bb.txt:  (this,3), (is,3), (bbb1,1), (bbb2,1), (bbb3,1) 
cc.txt:  (this,3), (is,3), (ccc1,1), (ccc2,1), (ccc3,1)

代碼可能像？

def dealWithOneFile(path2File):
  res = wordcountFor(path2File)
  saveResultToDB(res)
sc.wholeTextFile(rooDir).map(dealWithOneFile)

似乎使用sc.textFile(".../demotxt/") spark 會加載所有可能導致 memory 問題的文件，而且它會將所有文件視為非預期文件。

所以我想知道我應該怎么做？ 非常感謝！

Answer 1

這是一種方法。 可以與 DF 或 RDD 一起使用。 在這里，我使用 Databricks 顯示 RDD，因為您也沒有使用 state。Scala。

這很難解釋，但有效。 嘗試一些輸入。

%scala
val paths = Seq("/FileStore/tables/fff_1.txt", "/FileStore/tables/fff_2.txt")
val rdd = spark.read.format("text").load(paths: _*).select(input_file_name, $"value").as[(String, String)].rdd  
val rdd2 = rdd.flatMap(x=>x._2.split("\\s+").map(y => ((x._1, y), 1)))
val rdd3 = rdd2.reduceByKey(_+_).map( { case (x, y) => (x._1, (x._2, y)) } )
rdd3.collect
val rdd4 = rdd3.groupByKey() 
rdd4.collect

如何高效地統計每個文件的WordCount？

問題描述

1 個解決方案

解決方案1
1 已采納 2022-12-05 21:30:41

如何高效地統計每個文件的WordCount？

問題描述

1 個解決方案

解決方案1 1 已采納 2022-12-05 21:30:41

解決方案1
1 已采納 2022-12-05 21:30:41