简体   繁体   English

Spark&Scala:以DataFrame / Dataset的形式读取CSV文件

[英]Spark & Scala: Read in CSV file as DataFrame / Dataset

coming from the R world I want to import an .csv into Spark (v.1.6.1) using the Scala Shell ( ./spark-shell ) 来自R世界我想使用Scala Shell( ./spark-shell shell)将.csv导入Spark( ./spark-shell

My .csv has a header and looks like 我的.csv有标题,看起来像

"col1","col2","col3"
1.4,"abc",91
1.3,"def",105
1.35,"gh1",104

Thanks. 谢谢。

Spark 2.0+ Spark 2.0+

Since the databricks/spark-csv has been integrated into Spark, reading .CSVs is pretty straight forward using the SparkSession 由于databricks/spark-csv已集成到Spark中,因此使用SparkSession读取.CSVs非常简单

val spark = .builder()
   .master("local")
   .appName("Word Count")
   .getOrCreate()
val df = spark.read.option("header", true).csv(path)

Older versions 旧版本

After restarting my spark-shell I figured it out by myself - may be of help for others: 重新启动我的Spark-shell之后,我自己搞清楚了-可能对其他人有帮助:

After installing like described here and starting the spark-shell using ./spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 : 按照此处所述安装并使用./spark-shell --packages com.databricks:spark-csv_2.11:1.4.0启动spark-shell之后:

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val df = sqlContext.read.format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("/home/vb/opt/spark/data/mllib/mydata.csv")
scala> df.printSchema()
root
 |-- col1: double (nullable = true)
 |-- col2: string (nullable = true)
 |-- col3: integer (nullable = true)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM