简体   繁体   English

Apache Spark无法找到类CSVReader

[英]Apache Spark cannot find class CSVReader

My code to try and parse a simple csv file looks like this: 我的代码尝试解析一个简单的csv文件,如下所示:

SparkConf conf = new SparkConf().setMaster("local").setAppName("word_count");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> csv = sc.textFile("/home/user/data.csv");

JavaRDD<String[]> parsed = csv.map(x-> new CSVReader(new StringReader(x)).readNext());
parsed.foreach(x->System.out.println(x));  

However, Spark job ends with a class not found exception saying that CSVReader cannot be found. 但是,Spark作业以找不到类的异常结束,该异常表示CSVReader My pom.xml looks like this: 我的pom.xml看起来像这样:

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.1.0</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>com.opencsv</groupId>
        <artifactId>opencsv</artifactId>
        <version>3.8</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

How do I fix this? 我该如何解决?

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. 如果您的代码依赖于其他项目,则需要将它们打包在您的应用程序旁边,以便将代码分发到Spark集群。 To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. 为此,创建一个包含您的代码及其依赖项的程序集jar(或“超级” jar)。 Both sbt and Maven have assembly plugins. sbt和Maven都有程序集插件。 When creating assembly jars, list Spark and Hadoop as provided dependencies; 创建程序集jar时,将Spark和Hadoop列为提供的依赖项; these need not be bundled since they are provided by the cluster manager at runtime. 这些不需要捆绑在一起,因为它们是由集群管理器在运行时提供的。
Source: http://spark.apache.org/docs/latest/submitting-applications.html 来源: http : //spark.apache.org/docs/latest/submitting-applications.html

Maven does not ship dependency JARs when it packages the project into a JAR. 当Maven将项目打包到JAR中时,它不会提供依赖项JAR。 To ship the dependency JARs along, I added the Maven Shade plugin. 为了附带提供依赖关系JAR,我添加了Maven Shade插件。

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.3</version>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
        <finalName>${project.artifactId}-${project.version}</finalName>
    </configuration>
</plugin>  

also see: How to make it easier to deploy my Jar to Spark Cluster in standalone mode? 另请参阅: 如何简化以独立模式将Jar部署到Spark Cluster的过程?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM