简体   繁体   English

RDD [Array [String]]到数据框

[英]RDD[Array[String]] to Dataframe

I am new to Spark and Hive and my goal is to load a delimited(lets say csv) to Hive table. 我是Spark和Hive的新手,我的目标是将定界的(让说csv)加载到Hive表中。 After a bit of reading I found out that the path to load the data into Hive is csv->dataframe->Hive .(Please correct me if I am wrong). 经过一番阅读后,我发现将数据加载到Hive的路径是csv->dataframe->Hive (如果我错了,请纠正我)。

CSV:
1,Alex,70000,Columbus
2,Ryan,80000,New York
3,Johny,90000,Banglore
4,Cook, 65000,Glasgow
5,Starc, 70000,Aus

I read the csv file be using below command: 我使用以下命令读取了csv文件:

val csv =sc.textFile("employee_data.txt").map(line => line.split(",").map(elem => elem.trim))
csv: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[29] at map at <console>:39

Now I am trying to convert this RDD to Dataframe and using below code: 现在,我正在尝试将此RDD转换为Dataframe并使用以下代码:

scala> val df = csv.map { case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3) }.toDF()
df: org.apache.spark.sql.DataFrame = [eid: string, name: string, salary: string, destination: string]

employee is a case class and I am using it as a schema definition. employee是一个案例类,我正在将其用作架构定义。

case class employee(eid: String, name: String, salary: String, destination: String)

However when I do df.show I am getting below error: 但是当我执行df.show时,出现以下错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 22, user.hostname): scala.MatchError: [Ljava.lang.String;@88ba3cb (of class [Ljava.lang.String;) org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段10.0中的任务0失败4次,最近一次失败:阶段10.0中的任务0.3丢失(TID 22,用户名):scala.MatchError:[Ljava。 lang.String; @ 88ba3cb(类别[Ljava.lang.String;)

I was expecting a dataframe as a output. 我期待一个数据框作为输出。 I know why I might be getting this error because the values in RDD are stored in Ljava.lang.String;@88ba3cb format and I need to use mkString to get the actual values but I am not able to find how to do it. 我知道为什么可能会收到此错误,因为RDD中的值以Ljava.lang.String;@88ba3cb格式存储,并且我需要使用mkString来获取实际值,但是我找不到解决方法。 I appreciate your time. 感谢您的宝贵时间。

If you fix your case class then it should work: 如果您修复案例类,那么它应该可以工作:

scala> case class employee(eid: String, name: String, salary: String, destination: String)
defined class employee

scala> val txtRDD = sc.textFile("data.txt").map(line => line.split(",").map(_.trim))
txtRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:24

scala> txtRDD.map{case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3)}.toDF.show
+---+-----+------+-----------+
|eid| name|salary|destination|
+---+-----+------+-----------+
|  1| Alex| 70000|   Columbus|
|  2| Ryan| 80000|   New York|
|  3|Johny| 90000|   Banglore|
|  4| Cook| 65000|    Glasgow|
|  5|Starc| 70000|        Aus|
+---+-----+------+-----------+

Otherwise you could convert the String to an Int : 否则,您可以将String转换为Int

scala> case class employee(eid: Int, name: String, salary: String, destination: String)
defined class employee

scala> val df = txtRDD.map{case Array(s0, s1, s2, s3) => employee(s0.toInt, s1, s2, s3)}.toDF
df: org.apache.spark.sql.DataFrame = [eid: int, name: string ... 2 more fields]

scala> df.show
+---+-----+------+-----------+
|eid| name|salary|destination|
+---+-----+------+-----------+
|  1| Alex| 70000|   Columbus|
|  2| Ryan| 80000|   New York|
|  3|Johny| 90000|   Banglore|
|  4| Cook| 65000|    Glasgow|
|  5|Starc| 70000|        Aus|
+---+-----+------+-----------+

However the best solution would be to use spark-csv (which would treat the salary as an Int as well). 但是,最好的解决方案是使用spark-csv (将薪金也视为Int )。

Also note that the error was thrown when you ran df.show because everything was being lazily evaluated up until that point. 还要注意,当您运行df.show时会引发该错误,因为直到那时为止所有内容都被懒惰地评估了。 df.show is an action which will cause all of the queued transformations to execute (see this article for more). df.show是一项将导致所有排队的转换执行的动作(有关更多信息,请参阅本文 )。

Use map on array elements, not on array: 在数组元素而不是数组上使用map:

val csv = sc.textFile("employee_data.txt")
    .map(line => line
                     .split(",")
                     .map(e => e.map(_.trim))
     )
val df = csv.map { case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3) }.toDF()

But, why you are reading CSV and then converting RDD to DF? 但是,为什么要先读取CSV然后将RDD转换为DF? Spark 1.5 already can read CSV via spark-csv package: Spark 1.5已经可以通过spark-csv包读取CSV:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") 
    .option("inferSchema", "true") 
    .option("delimiter", ";") 
    .load("employee_data.txt")

As you said in your comment, your case class employee, which should be named Employee , receives an Int as first argument of its constructor, but you are passing a String . 正如您在评论中所说,您的案例类雇员(应命名为Employee )接收一个Int作为其构造函数的第一个参数,但是您正在传递String Thus, you should convert it to an Int before instantiating or modify your case defining eid as a String . 因此,您应该在实例化或修改将eid定义为String大小写之前将其转换为Int

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM