Apache Spark cannot find class CSVReader

Question

My code to try and parse a simple csv file looks like this:

SparkConf conf = new SparkConf().setMaster("local").setAppName("word_count");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> csv = sc.textFile("/home/user/data.csv");

JavaRDD<String[]> parsed = csv.map(x-> new CSVReader(new StringReader(x)).readNext());
parsed.foreach(x->System.out.println(x));

However, Spark job ends with a class not found exception saying that CSVReader cannot be found. My pom.xml looks like this:

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.10</artifactId>
        <version>1.1.0</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>com.opencsv</groupId>
        <artifactId>opencsv</artifactId>
        <version>3.8</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

How do I fix this?

Answer 1

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime.
Source: http://spark.apache.org/docs/latest/submitting-applications.html

Maven does not ship dependency JARs when it packages the project into a JAR. To ship the dependency JARs along, I added the Maven Shade plugin.

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.3</version>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
        <finalName>${project.artifactId}-${project.version}</finalName>
    </configuration>
</plugin>

also see: How to make it easier to deploy my Jar to Spark Cluster in standalone mode?

Apache Spark cannot find class CSVReader

Question

1 answers

solution1
1 ACCPTED 2016-09-25 08:04:00

Apache Spark cannot find class CSVReader

Question

1 answers

solution1 1 ACCPTED 2016-09-25 08:04:00

solution1
1 ACCPTED 2016-09-25 08:04:00