Apache Spark无法找到类CSVReader

Question

My code to try and parse a simple csv file looks like this: 我的代码尝试解析一个简单的csv文件，如下所示：

SparkConf conf = new SparkConf().setMaster("local").setAppName("word_count");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> csv = sc.textFile("/home/user/data.csv");

JavaRDD<String[]> parsed = csv.map(x-> new CSVReader(new StringReader(x)).readNext());
parsed.foreach(x->System.out.println(x));

However, Spark job ends with a class not found exception saying that CSVReader cannot be found. 但是，Spark作业以找不到类的异常结束，该异常表示CSVReader 。 My pom.xml looks like this: 我的pom.xml看起来像这样：

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.1.0</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>com.opencsv</groupId>
        <artifactId>opencsv</artifactId>
        <version>3.8</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

How do I fix this? 我该如何解决？

Answer 1

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. 如果您的代码依赖于其他项目，则需要将它们打包在您的应用程序旁边，以便将代码分发到Spark集群。 To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. 为此，创建一个包含您的代码及其依赖项的程序集jar（或“超级” jar）。 Both sbt and Maven have assembly plugins. sbt和Maven都有程序集插件。 When creating assembly jars, list Spark and Hadoop as provided dependencies; 创建程序集jar时，将Spark和Hadoop列为提供的依赖项； these need not be bundled since they are provided by the cluster manager at runtime. 这些不需要捆绑在一起，因为它们是由集群管理器在运行时提供的。
Source: http://spark.apache.org/docs/latest/submitting-applications.html 来源： http : //spark.apache.org/docs/latest/submitting-applications.html

Maven does not ship dependency JARs when it packages the project into a JAR. 当Maven将项目打包到JAR中时，它不会提供依赖项JAR。 To ship the dependency JARs along, I added the Maven Shade plugin. 为了附带提供依赖关系JAR，我添加了Maven Shade插件。

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.3</version>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
        <finalName>${project.artifactId}-${project.version}</finalName>
    </configuration>
</plugin>

also see: How to make it easier to deploy my Jar to Spark Cluster in standalone mode? 另请参阅：如何简化以独立模式将Jar部署到Spark Cluster的过程？

Apache Spark无法找到类CSVReader

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-09-25 08:04:00

Apache Spark无法找到类CSVReader

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-09-25 08:04:00

解决方案1
1 已采纳 2016-09-25 08:04:00