简体   繁体   English

Spark Scala中从字符串到日期的转换

[英]Conversion from string to Date in Spark Scala

The question is reframed by giving more details. 通过提供更多详细信息来重新构建问题。

I have a dataframe "dailyshow" Schema is: 我有一个数据框“ dailyshow”模式是:

scala> dailyshow.printSchema
root
 |-- year: integer (nullable = true)
 |-- occupation: string (nullable = true)
 |-- showdate: string (nullable = true)
 |-- group: string (nullable = true)
 |-- guest: string (nullable = true)

Sample Data is: 样本数据为:

scala> dailyshow.show(5)
+----+------------------+---------+------+----------------+
|year|        occupation| showdate| group|           guest|
+----+------------------+---------+------+----------------+
|1999|             actor|1/11/1999|Acting|  Michael J. Fox|
|1999|          Comedian|1/12/1999|Comedy| Sandra Bernhard|
|1999|television actress|1/13/1999|Acting|   Tracey Ullman|
|1999|      film actress|1/14/1999|Acting|Gillian Anderson|
|1999|             actor|1/18/1999|Acting|David Alan Grier|
+----+------------------+---------+------+----------------+

Below code is used to transform and generate results which return the top 5 occupations between the time period "01/11/1999" and "06/11/1999" 下面的代码用于转换和生成结果,该结果返回时间段“ 01/11/1999”和“ 06/11/1999”之间的前5个职业

scala> dailyshow.
    withColumn("showdate",to_date(unix_timestamp(col("showdate"),"MM/dd/yyyy").
    cast("timestamp"))).
    where((col("showdate") >= "1999-01-11") and (col("showdate") <= "1999-06-11")).
    groupBy(col("occupation")).agg(count("*").alias("count")).
    orderBy(desc("count")).
    limit(5).show
        +------------------+-----+                                                      
        |        occupation|count|
        +------------------+-----+
        |             actor|   29|
        |           actress|   20|
        |          comedian|    4|
        |television actress|    3|
        | stand-up comedian|    2|
        +------------------+-----+

My question is how to code and get the same result when using RDD? 我的问题是使用RDD时如何编码并获得相同的结果?

scala> dailyshow.first
res12: org.apache.spark.sql.Row = [1999,actor,1/11/1999,Acting,Michael J. Fox]

I used SimpleDateFormat to parse the string to date in a DataFrame . 我使用SimpleDateFormat解析了DataFrame日期的字符串。

Below is the code: 下面是代码:

val format = new java.text.SimpleDateFormat("MM/dd/yyyy")

dailyshow.
  map(x => x.mkString(",")).
  map(x => x.split(",")).
  map(x => format.parse(x(2))).first // returns Mon Jan 11 00:00:00 PST 1999

If I were you I would use spark's internal date functions as defined in org.apache.spark.sql.functions instead of manually doing it myself with simple date and mapping. 如果您是我,我将使用org.apache.spark.sql.functions中定义的spark内部日期函数,而不是自己通过简单的日期和映射手动进行操作。 This is because using dataframe functions is much simpler, much more idiomatic, less error prone and performs much better. 这是因为使用数据框函数更简单,更惯用,更不易出错并且性能更好。

Lets assume you have a dataframe df which has column called dateString which contains a date string in the format MM/dd/yyyy. 假设您有一个数据框df,其数据列名为dateString,其中包含日期字符串,格式为MM / dd / yyyy。

Let's also assume you want to convert it to a date in order to extract the year and then display it in the format yyyy.MMMMM.dd 我们还假设您要将其转换为日期以便提取年份,然后以yyyy.MMMMM.dd格式显示

What you can do is: 您可以做的是:

val dfWithDate = df.withColumn("date", to_date($"dateString")
val dfWithYear = dfWithDate.withColumn("year", year($"date"))
val dfWithOutput = dfWithYear.withColumn("dateOutput", date_format("$date", "yyyy.MMMMM.dd")

Now the year column would contain the year and the dateOutput column would contain the string representation with your format. 现在,year列将包含year,而dateOutput列将包含具有您格式的字符串表示形式。

Got a lot of deprecation warning while writing this :D 编写此:D时,收到了很多弃用警告

So we have this data in a RDD 因此我们将这些数据存储在RDD中

val rdd = sc.parallelize(Array(
     Array("1999","actor","1/11/1999","Acting","  Michael J. Fox"),
     Array("1999","Comedian","1/12/1999","Comedy"," Sandra Bernhard"),
     Array("1999","television actress","1/13/1999","Acting","Tracey Ullman"),
     Array("1999","film actress","1/14/1999","Acting","Gillian Anderson"),
     Array("1999","actor","1/18/1999","Acting","David Alan Grier")))

Then as per your question, we do a filter on date: 然后根据您的问题,我们对日期进行过滤:

val filtered = rdd.filter{ x => 
    format.parse(x(2)).after( new java.util.Date("01/10/1999")) && 
    format.parse(x(2)).before(new java.util.Date("01/14/1999")) 
}

Then we get this : 然后我们得到这个:

Array[Array[String]] = Array(
Array(1999, actor, 1/11/1999, Acting, "  Michael J. Fox"), 
Array(1999, Comedian, 1/12/1999, Comedy, " Sandra Bernhard"), 
Array(1999, television actress, 1/13/1999, Acting, Tracey Ullman))

Then we group them with the second element as the key and count the number of occurrences : 然后,我们将它们与第二个元素作为关键字进行分组,并计算出现的次数:

filtered.keyBy(x => x(1) ).map((_, 1) ).reduceByKey(_+_).map{ case ((a, b) ,c) => (a,c) }

If everything goes right , you should get : 如果一切顺利,您应该得到:

Array[(String, Int)] = Array((television actress,1), (Comedian,1), (actor,1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM