AWS EMR Spark-獲取CSV並與SparkSql API一起使用

Question

//download file  csv
ByteArrayOutputStream downloadedFile = downloadFile();

//save file in temp folder csv   (
java.io.File tmpCsvFile = save(downloadedFile);

//reading
Dataset<Row> ds = session
        .read()
        .option("header", "true") 
        .csv(tmpCsvFile.getAbsolutePath())

tmpCsvFile保存在以下路徑中 ：

/mnt/yarn/usercache/hadoop/appcache/application_1511379756333_0001/container_1511379756333_0001_02_000001/tmp/1OkYaovxMsmR7iPoPnb8mx45MWvwr6k1y9xIdh8g7K0Q3118887242212394029.csv

閱讀時例外 ：

org.apache.spark.sql.AnalysisException：路徑不存在：hdfs：//ip-33-33-33-33.ec2.internal：8020 / mnt / yarn / usercache / hadoop / appcache / application_1511379756333_0001 / container_1511379756333_0001_02_000001 / tmp /1OkYaovxMsmR7iPoPnb8mx45MWvwr6k1y9xIdh8g7K0Q3118887242212394029.csv;

我認為問題是該文件保存在本地，當我嘗試通過spark-sql api讀取時找不到該文件。 我已經嘗試過sparkContext.addFile（），但是不起作用。

有什么辦法嗎？

謝謝

Answer 1

Spark支持大量的文件系統，用於讀寫。

本地/常規（文件：//）
S3（s3：//）
HDFS（HDFS：//）

作為標准行為，如果未指定URI，spark-sql將使用hdfs：// driver_address：port / path。

將file：///添加到路徑的解決方案只能在客戶端模式下工作 ，而在我的情況下（集群）則不行。 當驅動程序創建讀取文件的任務時，該任務將傳遞給沒有文件的節點之一的執行程序。

我們能做什么？ 在Hadoop上寫入文件。

   Configuration conf = new Configuration();
   ByteArrayOutputStream downloadedFile = downloadFile();
   //convert outputstream in inputstream
   InputStream is=Functions.FROM_BAOS_TO_IS.apply(fileOutputStream);
   String myfile="miofile.csv";
   //acquiring the filesystem
   FileSystem fs = FileSystem.get(URI.create(dest),conf);
   //openoutputstream to hadoop
   OutputStream outf = fs.create( new Path(dest));
   //write file 
   IOUtils.copyBytes(tmpIS, outf, 4096, true);
   //commit the read task
   Dataset<Row> ds = session
    .read()
    .option("header", "true") 
    .csv(myfile)

謝謝，歡迎任何更好的解決方案

AWS EMR Spark-獲取CSV並與SparkSql API一起使用

問題描述

1 個解決方案

解決方案1
2 已采納 2017-11-23 07:36:24

AWS EMR Spark-獲取CSV並與SparkSql API一起使用

問題描述

1 個解決方案

解決方案1 2 已采納 2017-11-23 07:36:24

解決方案1
2 已采納 2017-11-23 07:36:24