[英]Load a .csv file from HDFS in Scala
因此,我基本上有以下代碼來讀取.csv文件並將其存儲在Array[Array[String]]
:
def load(filepath: String): Array[Array[String]] = {
var data = Array[Array[String]]()
val bufferedSource = io.Source.fromFile(filepath)
for (line <- bufferedSource.getLines) {
data :+ line.split(",").map(_.trim)
}
bufferedSource.close
return data.slice(1,data.length-1) //skip header
}
適用於未存儲在HDFS上的文件。 但是,當我在HDFS上嘗試相同的操作時,我得到了
找不到這樣的文件或目錄
在HDFS上寫入文件時,我還必須更改原始代碼,並向PrintWriter
添加了一些FileSystem
和Path
參數,但是這次我完全不知道該怎么做。
我到此為止:
def load(filepath: String, sc: SparkContext): Array[Array[String]] = {
var data = Array[Array[String]]()
val fs = FileSystem.get(sc.hadoopConfiguration)
val stream = fs.open(new Path(filepath))
var line = ""
while ((line = stream.readLine()) != null) {
data :+ line.split(",").map(_.trim)
}
return data.slice(1,data.length-1) //skip header
}
這應該可以,但是在將行與null進行比較或者其長度超過0時,我得到了NullPointerException
。
這段代碼將從HDFS中讀取一個.csv文件:
def read(filepath: String, sc: SparkContext): ArrayBuffer[Array[String]] = {
var data = ArrayBuffer[Array[String]]()
val fs = FileSystem.get(sc.hadoopConfiguration)
val stream = fs.open(new Path(filepath))
var line = stream.readLine()
while (line != null) {
val row = line.split(",").map(_.trim)
data += row
line = stream.readLine()
}
stream.close()
return data // or return data.slice(1,data.length-1) to skip header
}
請閱讀Scala Cookbook的作者Alvin Alexander撰寫的有關閱讀CSV的文章 :
object CSVDemo extends App {
println("Month, Income, Expenses, Profit")
val bufferedSource = io.Source.fromFile("/tmp/finance.csv")
for (line <- bufferedSource.getLines) {
val cols = line.split(",").map(_.trim)
// do whatever you want with the columns here
println(s"${cols(0)}|${cols(1)}|${cols(2)}|${cols(3)}")
}
bufferedSource.close
}
您只需要從HDFS中獲取InputStream並替換此代碼段即可
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.