简体   繁体   English

Spark - 将 CSV 文件加载为 DataFrame?

[英]Spark - load CSV file as DataFrame?

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name")我想在 spark 中读取 CSV 并将其转换为 DataFrame 并使用df.registerTempTable("table_name")将其存储在 HDFS 中

I have tried:我试过了:

scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv")

Error which I got:我得到的错误:

java.lang.RuntimeException: hdfs:///csv/file/dir/file.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 59, 54, 10]
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:277)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:276)
    at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
    at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
    at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
    at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

What is the right command to load CSV file as DataFrame in Apache Spark?在 Apache Spark 中将 CSV 文件加载为 DataFrame 的正确命令是什么?

spark-csv is part of core Spark functionality and doesn't require a separate library. spark-csv 是核心 Spark 功能的一部分,不需要单独的库。 So you could just do for example所以你可以做例如

df = spark.read.format("csv").option("header", "true").load("csvfile.csv")

In scala,(this works for any format-in delimiter mention "," for csv, "\\t" for tsv etc)在 Scala 中,(这适用于任何格式的分隔符提及“,”对于 csv,“\\t”对于 tsv 等)

val df = sqlContext.read.format("com.databricks.spark.csv") .option("delimiter", ",") .load("csvfile.csv")

Parse CSV and load as DataFrame/DataSet with Spark 2.x使用 Spark 2.x 解析 CSV 并加载为 DataFrame/DataSet

First, initialize SparkSession object by default it will available in shells as spark首先,默认初始化SparkSession对象,它将在 shell 中作为spark可用

val spark = org.apache.spark.sql.SparkSession.builder
        .master("local") # Change it as per your cluster
        .appName("Spark CSV Reader")
        .getOrCreate;

Use any one of the following ways to load CSV as DataFrame/DataSet使用以下任一方式将CSV加载为DataFrame/DataSet

1. Do it in a programmatic way 1. 以程序化的方式进行

 val df = spark.read
         .format("csv")
         .option("header", "true") //first line in file has headers
         .option("mode", "DROPMALFORMED")
         .load("hdfs:///csv/file/dir/file.csv")

Update: Adding all options from here in case the link will be broken in future更新:从这里添加所有选项,以防将来链接断开

  • path : location of files.路径:文件的位置。 Similar to Spark can accept standard Hadoop globbing expressions.类似于 Spark 可以接受标准的 Hadoop globbing 表达式。
  • header : when set to true the first line of files will be used to name columns and will not be included in data. header : 当设置为 true 时,文件的第一行将用于命名列并且不会包含在数据中。 All types will be assumed string.所有类型都将假定为字符串。 The default value is false.默认值为假。
  • delimiter : by default columns are delimited using, but delimiter can be set to any character delimiter : 默认情况下使用分隔列,但分隔符可以设置为任何字符
  • quote : by default the quote character is ", but can be set to any character. Delimiters inside quotes are ignored quote : 默认情况下,引号字符是 ",但可以设置为任何字符。引号内的分隔符将被忽略
  • escape : by default, the escape character is , but can be set to any character.转义:默认情况下,转义字符是 ,但可以设置为任何字符。 Escaped quote characters are ignored忽略转义的引号字符
  • parserLib : by default, it is " commons " that can be set to " univocity " to use that library for CSV parsing. parserLib :默认情况下,可以将“ commons ”设置为“ univocity ”以使用该库进行CSV解析。
  • mode : determines the parsing mode. mode :确定解析模式。 By default it is PERMISSIVE.默认情况下它是允许的。 Possible values are:可能的值为:
    • PERMISSIVE : tries to parse all lines: nulls are inserted for missing tokens and extra tokens are ignored. PERMISSIVE :尝试解析所有行:为丢失的标记插入空值,并忽略额外的标记。
    • DROPMALFORMED : drops lines which have fewer or more tokens than expected or tokens which do not match the schema DROPMALFORMED :删除令牌比预期少或多的行或与模式不匹配的令牌
    • FAILFAST : aborts with a RuntimeException if encounters any malformed line charset: defaults to 'UTF-8' but can be set to other valid charset names FAILFAST :如果遇到任何格式错误的行字符集,则以 RuntimeException 中止:默认为“UTF-8”,但可以设置为其他有效的字符集名称
  • inferSchema : automatically infers column types. inferSchema :自动推断列类型。 It requires one extra pass over the data and is false by default comment: skip lines beginning with this character.它需要对数据进行一次额外的传递,并且默认情况下为 false 注释:跳过以此字符开头的行。 Default is "#".默认值为“#”。 Disable comments by setting this to null.通过将其设置为 null 来禁用评论。
  • nullValue : specifies a string that indicates a null value, any fields matching this string will be set as nulls in the DataFrame nullValue : 指定一个表示空值的字符串,任何匹配该字符串的字段都将在 DataFrame 中设置为空值
  • dateFormat : specifies a string that indicates the date format to use when reading dates or timestamps. dateFormat :指定一个字符串,指示读取日期或时间戳时要使用的日期格式。 Custom date formats follow the formats at java.text.SimpleDateFormat.自定义日期格式遵循 java.text.SimpleDateFormat 中的格式。 This applies to both DateType and TimestampType.这适用于 DateType 和 TimestampType。 By default, it is null which means trying to parse times and date by java.sql.Timestamp.valueOf() and java.sql.Date.valueOf().默认情况下,它是空的,这意味着尝试通过 java.sql.Timestamp.valueOf() 和 java.sql.Date.valueOf() 解析时间和日期。

2. You can do this SQL way as well 2.你也可以用这种SQL方式

 val df = spark.sql("SELECT * FROM csv.`hdfs:///csv/file/dir/file.csv`")

Dependencies :依赖项

 "org.apache.spark" % "spark-core_2.11" % 2.0.0,
 "org.apache.spark" % "spark-sql_2.11" % 2.0.0,

Spark version < 2.0 Spark 版本 < 2.0

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") 
    .option("mode", "DROPMALFORMED")
    .load("csv/file/path"); 

Dependencies:依赖项:

"org.apache.spark" % "spark-sql_2.10" % 1.6.0,
"com.databricks" % "spark-csv_2.10" % 1.6.0,
"com.univocity" % "univocity-parsers" % LATEST,

It's for whose Hadoop is 2.6 and Spark is 1.6 and without "databricks" package.它的 Hadoop 是 2.6,Spark 是 1.6,并且没有“databricks”包。

import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType};
import org.apache.spark.sql.Row;

val csv = sc.textFile("/path/to/file.csv")
val rows = csv.map(line => line.split(",").map(_.trim))
val header = rows.first
val data = rows.filter(_(0) != header(0))
val rdd = data.map(row => Row(row(0),row(1).toInt))

val schema = new StructType()
    .add(StructField("id", StringType, true))
    .add(StructField("val", IntegerType, true))

val df = sqlContext.createDataFrame(rdd, schema)

With Spark 2.0, following is how you can read CSV使用 Spark 2.0,以下是读取 CSV 的方法

val conf = new SparkConf().setMaster("local[2]").setAppName("my app")
val sc = new SparkContext(conf)
val sparkSession = SparkSession.builder
  .config(conf = conf)
  .appName("spark session example")
  .getOrCreate()

val path = "/Users/xxx/Downloads/usermsg.csv"
val base_df = sparkSession.read.option("header","true").
  csv(path)

In Java 1.8 This code snippet perfectly working to read CSV files在 Java 1.8 中,此代码片段非常适合读取 CSV 文件

POM.xml POM文件

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.10</artifactId>
    <version>2.0.0</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>2.11.8</version>
</dependency>
<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-csv_2.10</artifactId>
    <version>1.4.0</version>
</dependency>

Java爪哇

SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
// create Spark Context
SparkContext context = new SparkContext(conf);
// create spark Session
SparkSession sparkSession = new SparkSession(context);

Dataset<Row> df = sparkSession.read().format("com.databricks.spark.csv").option("header", true).option("inferSchema", true).load("hdfs://localhost:9000/usr/local/hadoop_data/loan_100.csv");

        //("hdfs://localhost:9000/usr/local/hadoop_data/loan_100.csv");
System.out.println("========== Print Schema ============");
df.printSchema();
System.out.println("========== Print Data ==============");
df.show();
System.out.println("========== Print title ==============");
df.select("title").show();

There are a lot of challenges to parsing a CSV file, it keeps adding up if the file size is bigger, if there are non-english/escape/separator/other characters in the column values, that could cause parsing errors.解析 CSV 文件有很多挑战,如果文件大小更大,它就会不断增加,如果列值中有非英文/转义符/分隔符/其他字符,则可能导致解析错误。

The magic then is in the options that are used.神奇之处在于所使用的选项。 The ones that worked for me and hope should cover most of the edge cases are in code below:对我有用并希望涵盖大多数边缘情况的代码如下:

### Create a Spark Session
spark = SparkSession.builder.master("local").appName("Classify Urls").getOrCreate()

### Note the options that are used. You may have to tweak these in case of error
html_df = spark.read.csv(html_csv_file_path, 
                         header=True, 
                         multiLine=True, 
                         ignoreLeadingWhiteSpace=True, 
                         ignoreTrailingWhiteSpace=True, 
                         encoding="UTF-8",
                         sep=',',
                         quote='"', 
                         escape='"',
                         maxColumns=2,
                         inferSchema=True)

Hope that helps.希望有帮助。 For more refer: Using PySpark 2 to read CSV having HTML source code有关更多信息,请参阅: 使用 PySpark 2 读取具有 HTML 源代码的 CSV

Note: The code above is from Spark 2 API, where the CSV file reading API comes bundled with built-in packages of Spark installable.注意:上面的代码来自 Spark 2 API,其中 CSV 文件读取 API 与 Spark 可安装的内置包捆绑在一起。

Note: PySpark is a Python wrapper for Spark and shares the same API as Scala/Java.注意:PySpark 是 Spark 的 Python 包装器,与 Scala/Java 共享相同的 API。

Penny's Spark 2 example is the way to do it in spark2. Penny 的 Spark 2 示例是在 spark2 中实现的方法。 There's one more trick: have that header generated for you by doing an initial scan of the data, by setting the option inferSchema to true还有一个技巧:通过对数据进行初始扫描,通过将选项inferSchema设置为true ,为您生成该标头

Here, then, assumming that spark is a spark session you have set up, is the operation to load in the CSV index file of all the Landsat images which amazon host on S3.在这里,假设spark是您已设置的 spark 会话,则是将亚马逊托管在 S3 上的所有 Landsat 图像的 CSV 索引文件加载到 CSV 索引文件中的操作。

  /*
   * Licensed to the Apache Software Foundation (ASF) under one or more
   * contributor license agreements.  See the NOTICE file distributed with
   * this work for additional information regarding copyright ownership.
   * The ASF licenses this file to You under the Apache License, Version 2.0
   * (the "License"); you may not use this file except in compliance with
   * the License.  You may obtain a copy of the License at
   *
   *    http://www.apache.org/licenses/LICENSE-2.0
   *
   * Unless required by applicable law or agreed to in writing, software
   * distributed under the License is distributed on an "AS IS" BASIS,
   * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   * See the License for the specific language governing permissions and
   * limitations under the License.
   */

val csvdata = spark.read.options(Map(
    "header" -> "true",
    "ignoreLeadingWhiteSpace" -> "true",
    "ignoreTrailingWhiteSpace" -> "true",
    "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
    "inferSchema" -> "true",
    "mode" -> "FAILFAST"))
  .csv("s3a://landsat-pds/scene_list.gz")

The bad news is: this triggers a scan through the file;坏消息是:这会触发对文件的扫描; for something large like this 20+MB zipped CSV file, that can take 30s over a long haul connection.对于像这个 20+MB 的压缩 CSV 文件这样的大文件,长途连接可能需要 30 秒。 Bear that in mind: you are better off manually coding up the schema once you've got it coming in.记住这一点:一旦你得到它,你最好手动编码模式。

(code snippet Apache Software License 2.0 licensed to avoid all ambiguity; something I've done as a demo/integration test of S3 integration) (代码片段 Apache 软件许可证 2.0 已获得许可以避免所有歧义;我作为 S3 集成的演示/集成测试所做的一些事情)

In case you are building a jar with scala 2.11 and Apache 2.0 or higher.如果您正在使用 scala 2.11 和 Apache 2.0 或更高版本构建 jar。

There is no need to create a sqlContext or sparkContext object.无需创建sqlContextsparkContext对象。 Just a SparkSession object suffices the requirement for all needs.只需一个SparkSession对象就足以满足所有需求。

Following is mycode which works fine:以下是我的代码,它工作正常:

import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
import org.apache.log4j.{Level, LogManager, Logger}

object driver {

  def main(args: Array[String]) {

    val log = LogManager.getRootLogger

    log.info("**********JAR EXECUTION STARTED**********")

    val spark = SparkSession.builder().master("local").appName("ValidationFrameWork").getOrCreate()
    val df = spark.read.format("csv")
      .option("header", "true")
      .option("delimiter","|")
      .option("inferSchema","true")
      .load("d:/small_projects/spark/test.pos")
    df.show()
  }
}

In case you are running in cluster just change .master("local") to .master("yarn") while defining the sparkBuilder object如果您在集群中运行,只需在定义sparkBuilder对象时将sparkBuilder .master("local")更改为sparkBuilder .master("yarn")

The Spark Doc covers this: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html Spark 文档涵盖了这一点: https : //spark.apache.org/docs/2.2.0/sql-programming-guide.html

Add following Spark dependencies to POM file :将以下 Spark 依赖项添加到 POM 文件:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.2.0</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.2.0</version>
</dependency>

Spark configuration:火花配置:

val spark = SparkSession.builder().master("local").appName("Sample App").getOrCreate()

Read csv file:读取csv文件:

val df = spark.read.option("header", "true").csv("FILE_PATH")

Display output:显示输出:

df.show()

With Spark 2.4+, if you want to load a csv from a local directory, then you can use 2 sessions and load that into hive.使用 Spark 2.4+,如果要从本地目录加载 csv,则可以使用 2 个会话并将其加载到 hive 中。 The first session should be created with master() config as "local[*]" and the second session with "yarn" and Hive enabled.第一个会话应使用 master() 配置创建为“local[*]”,第二个会话应使用“yarn”和 Hive 启用。

The below one worked for me.下面一个对我有用。

import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.sql._

object testCSV { 

  def main(args: Array[String]) {
    Logger.getLogger("org").setLevel(Level.ERROR)
    val spark_local = SparkSession.builder().appName("CSV local files reader").master("local[*]").getOrCreate()

    import spark_local.implicits._
    spark_local.sql("SET").show(100,false)
    val local_path="/tmp/data/spend_diversity.csv"  // Local file
    val df_local = spark_local.read.format("csv").option("inferSchema","true").load("file://"+local_path) // "file://" is mandatory
    df_local.show(false)

    val spark = SparkSession.builder().appName("CSV HDFS").config("spark.sql.warehouse.dir", "/apps/hive/warehouse").enableHiveSupport().getOrCreate()

    import spark.implicits._
    spark.sql("SET").show(100,false)
    val df = df_local
    df.createOrReplaceTempView("lcsv")
    spark.sql(" drop table if exists work.local_csv ")
    spark.sql(" create table work.local_csv as select * from lcsv ")

   }

When ran with spark2-submit --master "yarn" --conf spark.ui.enabled=false testCSV.jar it went fine and created the table in hive.当使用spark2-submit --master "yarn" --conf spark.ui.enabled=false testCSV.jar它运行良好并在 hive 中创建了表。

To read from relative path on the system use System.getProperty method to get current directory and further uses to load the file using relative path.要从系统上的相对路径读取,请使用 System.getProperty 方法获取当前目录,并进一步使用相对路径加载文件。

scala> val path = System.getProperty("user.dir").concat("/../2015-summary.csv")
scala> val csvDf = spark.read.option("inferSchema","true").option("header", "true").csv(path)
scala> csvDf.take(3)

spark:2.4.4 scala:2.11.12火花:2.4.4 斯卡拉:2.11.12

Default file format is Parquet with spark.read.. and file reading csv that why you are getting the exception.默认文件格式是带有 spark.read.. 和文件读取 csv 的 Parquet,这就是为什么会出现异常。 Specify csv format with api you are trying to use使用您尝试使用的 api 指定 csv 格式

Loads a CSV file and returns the result as a DataFrame. 加载CSV文件并将结果作为DataFrame返回。

df=sparksession.read.option("header", true).csv("file_name.csv")

Dataframe treated a file as csv format. Dataframe将文件视为csv格式。

Try this if using spark 2.0+如果使用 spark 2.0+ 试试这个

For non-hdfs file:
df = spark.read.csv("file:///csvfile.csv")


For hdfs file:
df = spark.read.csv("hdfs:///csvfile.csv")

For hdfs file (with different delimiter than comma:
df = spark.read.option("delimiter","|")csv("hdfs:///csvfile.csv")

Note:- this work for any delimited file.注意:- 这适用于任何分隔文件。 Just use option(“delimiter”,) to change value.只需使用 option(“delimiter”,) 来更改值。

Hope this is helpful.希望这是有帮助的。

With in-built Spark csv, you can get it done easily with new SparkSession object for Spark > 2.0.使用内置的 Spark csv,您可以使用 Spark > 2.0 的新 SparkSession 对象轻松完成。

val df = spark.
        read.
        option("inferSchema", "false").
        option("header","true").
        option("mode","DROPMALFORMED").
        option("delimiter", ";").
        schema(dataSchema).
        csv("/csv/file/dir/file.csv")
df.show()
df.printSchema()

There are various options you can set.您可以设置各种选项。

  • header : whether your file includes header line at the top header :您的文件是否在顶部包含标题行
  • inferSchema : whether you want to infer schema automatically or not. inferSchema :是否要自动推断模式。 Default is true .默认为true I always prefer to provide schema to ensure proper datatypes.我总是喜欢提供模式来确保正确的数据类型。
  • mode : parsing mode, PERMISSIVE, DROPMALFORMED or FAILFAST mode : 解析模式、PERMISSIVE、DROPMALFORMED 或 FAILFAST
  • delimiter : to specify delimiter, default is comma(',') delimiter : 指定分隔符,默认为逗号(',')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM