简体   繁体   English

将RDD [List [AnyRef]]转换为RDD [List [String,Date,String,String]]

[英]Convert RDD[List[AnyRef]] to RDD[List[String, Date, String, String]]

I want to set return type of RDD. 我想设置RDD的返回类型。 But it is RDD[List[AnyRef]]. 但是它是RDD [List [AnyRef]]。 So I am not able to specify anything directly. 因此,我无法直接指定任何内容。 Like, 喜欢,

val rdd2 = rdd1.filter(! _.isEmpty).filter(x => x(0) != null)

This returns RDD[List[String, Date, String, String]] type of RDD but it was RDD[List[AnyRef]]. 这将返回RDD的RDD [List [String,Date,String,String]]类型,但它是RDD [List [AnyRef]]。

EDIT 编辑

rdd1:
List(Sun Jul 31 10:21:53 PDT 2016, pm1, 11, ri1)
List(Mon Aug 01 12:57:09 PDT 2016, pm3, 5, ri1)
List(Mon Aug 01 01:11:16 PDT 2016, pm1, 1, ri2)

This rdd1 is RDD[List[AnyRef]] type. 此rdd1是RDD [List [AnyRef]]类型。

Now I want rdd2 in this type: 现在我想要rdd2这种类型:

RDD[List[Date, String, Long, String]]

The reason is that I am facing issues with date while converting RDD to Data Frame using schema. 原因是我在使用架构将RDD转换为Data Frame时遇到日期问题。 To deal with that firstly I have to fix the RDD type. 首先,我必须修复RDD类型。 That problem's solution is : Spark rdd correct date format in scala? 那个问题的解决方法是: 在Scala中Spark rdd正确的日期格式?

Here is a small example which leads to the same problem (I omitted Date , replaced it by String , that's not the point): 这是一个导致相同问题的小例子(我省略了Date ,将其替换为String ,这不是重点):

val myRdd = sc.makeRDD(List(
  List[AnyRef]("date 1", "blah2", (11: java.lang.Integer), "baz1"),
  List[AnyRef]("date 2", "blah3", (5: java.lang.Integer),  "baz2"),
  List[AnyRef]("date 3", "blah4", (1: java.lang.Integer),  "baz3") 
))

// myRdd: org.apache.spark.rdd.RDD[List[AnyRef]] = ParallelCollectionRDD[0]

Here is how you can recover the types: 这是恢复类型的方法:

val unmessTypes = myRdd.map{
  case List(a: String, b: String, c: java.lang.Integer, d: String) => (a, b, (c: Int), d)
}

// unmessTypes: org.apache.spark.rdd.RDD[(String, String, Int, String)] = MapPartitionsRDD[1]

You simply apply a partial function that matches lists of length 4 with elements of specified types, and constructs the tuples of expected type out of it. 您只需应用一个将长度为4的列表与指定类型的元素进行匹配的部分函数,​​然后从中构造期望类型的元组。 If your RDD indeed contains only lists of length 4 with the expected types, the partial function will never fail. 如果您的RDD确实仅包含长度为4且具有预期类型的​​列表,则部分函数将永远不会失败。

By looking at your Spark rdd correct date format in scala? 通过在Scala中查看您的Spark rdd正确的日期格式? , it seems that you are having issue in converting your rdd to dataframe. ,似乎在将rdd转换为数据帧时遇到问题。 Tzach has already answered it correctly to convert the java.util.Date to java.sql.Date and that should solve your issue. Tzach已经正确回答了将java.util.Date转换为java.sql.Date的问题,这应该可以解决您的问题。

First of all a List cannot have separate dataType for each element in the list as we do have for Tuples . 首先,对于List中的每个元素, List不能像Tuples那样具有单独的dataType List have only one dataType and if mixed dataTypes are used then the dataType of the list is represented as Any or AnyRef . List仅具有一个数据类型 ,且如果使用混合数据类型则列表的数据类型被表示为AnyAnyRef

I guess you must have created data as below 我想你一定已经创建了如下数据

val list = List(
  List[AnyRef](new SimpleDateFormat("EEE MMM dd HH:mm:ss Z yyyy", Locale.ENGLISH).parse("Sun Jul 31 10:21:53 PDT 2016"), "pm1", 11L: java.lang.Long,"ri1"),
  List[AnyRef](new SimpleDateFormat("EEE MMM dd HH:mm:ss Z yyyy", Locale.ENGLISH).parse("Mon Aug 01 12:57:09 PDT 2016"), "pm3", 5L: java.lang.Long, "ri1"),
  List[AnyRef](new SimpleDateFormat("EEE MMM dd HH:mm:ss Z yyyy", Locale.ENGLISH).parse("Mon Aug 01 01:11:16 PDT 2016"), "pm1", 1L: java.lang.Long, "ri2")
)

val rdd1 = spark.sparkContext.parallelize(list)

which would give 这会给

rdd1: org.apache.spark.rdd.RDD[List[AnyRef]]

but in fact its real datatypes are [java.util.Date, String, java.lang.Long, String] 但实际上它的实际数据类型[java.util.Date, String, java.lang.Long, String]

And looking at your other question you must be having problem converting the rdd to dataframe having following schema 并查看您的其他问题,您必须将rdd转换为具有以下schema dataframe遇到问题

val schema =
  StructType(
    StructField("lotStartDate", DateType, false) ::
      StructField("pm", StringType, false) ::
      StructField("wc", LongType, false) ::
      StructField("ri", StringType, false) :: Nil)

What you can do is utilize java.sql.Date api as answered in your other question and then create dataframe as 您可以做的是利用java.sql.Date api作为其他问题的答案,然后创建dataframe作为

val rdd1 = sc.parallelize(list).map(lis => Row.fromSeq(new java.sql.Date((lis.head.asInstanceOf[java.util.Date]).getTime)::lis.tail))
val df = sqlContext.createDataFrame(rdd1,schema)

which should give you 这应该给你

+------------+---+---+---+
|lotStartDate|pm |wc |ri |
+------------+---+---+---+
|2016-07-31  |pm1|11 |ri1|
|2016-08-02  |pm3|5  |ri1|
|2016-08-01  |pm1|1  |ri2|
+------------+---+---+---+

I hope the answer is helpful 我希望答案是有帮助的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM