无法在Java Spark中读取文件

Question

I am trying to run the spark program on java using eclipse. 我正在尝试使用Eclipse在Java上运行spark程序。 Its is running if i simply print something on console but I am not able to read any file using textFile function. 如果我只是在控制台上打印一些内容，但我无法使用textFile函数读取任何文件，则它正在运行。 I have read somewhere that reading a file can only be done using HDFS but I am not able to do in my local system. 我在某处读到，只能使用HDFS读取文件，但在本地系统中却无法读取。 Do let me know how to access/read file , if using HDFS then how to install HDFS in my local system so that i can rad the text file. 请让我知道如何访问/读取文件（如果使用HDFS），然后如何在本地系统中安装HDFS，以便可以存储文本文件。

Here's a code on which I am testing , though this program is working fine but it is unable to read file saying Input path does not exist. 这是我正在测试的代码，尽管该程序可以正常工作，但无法读取文件，提示输入路径不存在。

package spark;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;

import org.apache.spark.api.java.function.Function;

public class TestSpark {

    public static void main(String args[])
    {
        String[] jars = {"D:\\customJars\\spark.jar"};
        System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master");
        SparkConf sparkConf = new SparkConf().setAppName("spark.TestSpark")
                .setMaster("spark://10.1.50.165:7077")
                .setJars(jars);

        JavaSparkContext jsc = new JavaSparkContext(sparkConf);
        SQLContext sqlcon = new SQLContext(jsc);
        String inputFileName = "./forecaster.txt" ;
        JavaRDD<String> logData = jsc.textFile(inputFileName);
        long numAs = logData.filter(new Function<String, Boolean>() {

            @Override
            public Boolean call(String s) throws Exception {
                return s.contains("a");
            }
        }).count();

        long numBs = logData.filter(new Function<String, Boolean>() {
              public Boolean call(String s) { return s.contains("b"); }
            }).count();

         System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
        System.out.println("sadasdasdf");

        jsc.stop();
        jsc.close();
    }

}

My File Struture : 我的文件结构：

Answer 1

Update : you don't have .txt extension in file name and you are using it in your application. 更新：您的文件名没有.txt扩展名，您正在应用程序中使用它。 You should use it as String inputFileName = "forecaster" ; 您应该将其用作String inputFileName = "forecaster" ;

If file is in same folder as java class TestSpark ( $APP_HOME ): 如果文件与Java类TestSpark （ $APP_HOME ）位于同一文件夹中：

String inputFileName = "forecaster.txt" ;

If file is in Data dir under your project of spark: 如果文件在您的spark项目下的Data dir中：

String inputFileName = "Data\\forecaster.txt" ;

Or use fully qualified Path log says from below testing: 或使用完全合格的Path日志从以下测试中说：

16/08/03 08:25:25 INFO HadoopRDD: Input split: file:/C:/Users/user123/worksapce/spark-java/forecaster.txt
~~~~~~~
String inputFileName = "file:/C:/Users/user123/worksapce/spark-java/forecaster.txt" ;

For example: I copied your code and ran on my local environment: 例如：我复制了您的代码并在我的本地环境中运行：

this is how my project step up is, and I run it as: 这就是我的项目提升的方式，我将其运行为：

 String inputFileName = "forecaster.txt" ;

Test File: 测试文件：

this is test file
aaa
bbb
ddddaaee
ewwww
aaaa
a
a
aaaa
bb

Code that I used: 我使用的代码：

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class TestSpark {

    public static void main(String args[])
    {
       // String[] jars = {"D:\\customJars\\spark.jar"};
       // System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master");
        SparkConf sparkConf = new SparkConf().setAppName("spark.TestSpark").setMaster("local");
                //.setMaster("spark://10.1.50.165:7077")
                //.setJars(jars);

        JavaSparkContext jsc = new JavaSparkContext(sparkConf);
        //SQLContext sqlcon = new SQLContext(jsc);
        String inputFileName = "forecaster.txt" ;
        JavaRDD<String> logData = jsc.textFile(inputFileName);
        long numAs = logData.filter(new Function<String, Boolean>() {

            @Override
            public Boolean call(String s) throws Exception {
                return s.contains("a");
            }
        }).count();

        long numBs = logData.filter(new Function<String, Boolean>() {

            public Boolean call(String s) { return s.contains("b"); }
            }).count();

         System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
        System.out.println("sadasdasdf");

        jsc.stop();
        jsc.close();
    }

}

Answer 2

Spark needs schema and proper path in order to understand how to read the file. Spark需要架构和正确的路径，以了解如何读取文件。 So if you are reading from HDFS, you should use: 因此，如果您正在阅读HDFS，则应使用：

jsc.textFile("hdfs://host/path/to/hdfs/file/input.txt");

If you are reading local file (local to the worker node, not the machine the driver is running), you should use: 如果您正在读取本地文件（对于工作节点是本地文件，而不是驱动程序正在运行的机器），则应使用：

jsc.textFile("file://path/to/hdfs/file/input.txt");

For reading Hadoop Archive File (HAR), you should use: 要读取Hadoop存档文件（HAR），应使用：

jsc.textFile("har://archive/path/to/hdfs/file/input.txt");

And so on. 等等。

无法在Java Spark中读取文件

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-08-03 12:21:20

解决方案2
1 2016-08-03 12:21:32

无法在Java Spark中读取文件

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-08-03 12:21:20

解决方案2 1 2016-08-03 12:21:32

解决方案1
1 已采纳 2016-08-03 12:21:20

解决方案2
1 2016-08-03 12:21:32