简体   繁体   中英

Read csv file from Hadoop using Spark

I'm using spark-shell to read csv files from hdfs. I can read those csv file using the following code in bash:

bin/hadoop fs -cat /input/housing.csv |tail -5

so this suggest the housing.csv is indeed in hdfs right now. How can I read it using spark-shell? Thanks in advance.

sc.textFile("hdfs://input/housing.csv").first()

I tried this way, but failed.

Include the csv package in the shell and

var df = spark.read.format("csv").option("header", "true").load("hdfs://x.x.x.x:8020/folder/file.csv")

8020 is the default port.

Thanks, Ash

You can read this easily with spark using csv method or by specifying format("csv") . In your case either you should not specify hdfs:// or you should specify complete path hdfs://localhost:8020/input/housing.csv .

Here is a snippet of code that can read csv.

val df = spark.
        read.
        schema(dataSchema).
        csv(s"/input/housing.csv")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM