[英]Conversion from string to Date in Spark Scala
The question is reframed by giving more details. 通过提供更多详细信息来重新构建问题。
I have a dataframe "dailyshow" Schema is: 我有一个数据框“ dailyshow”模式是:
scala> dailyshow.printSchema
root
|-- year: integer (nullable = true)
|-- occupation: string (nullable = true)
|-- showdate: string (nullable = true)
|-- group: string (nullable = true)
|-- guest: string (nullable = true)
Sample Data is: 样本数据为:
scala> dailyshow.show(5)
+----+------------------+---------+------+----------------+
|year| occupation| showdate| group| guest|
+----+------------------+---------+------+----------------+
|1999| actor|1/11/1999|Acting| Michael J. Fox|
|1999| Comedian|1/12/1999|Comedy| Sandra Bernhard|
|1999|television actress|1/13/1999|Acting| Tracey Ullman|
|1999| film actress|1/14/1999|Acting|Gillian Anderson|
|1999| actor|1/18/1999|Acting|David Alan Grier|
+----+------------------+---------+------+----------------+
Below code is used to transform and generate results which return the top 5 occupations between the time period "01/11/1999" and "06/11/1999" 下面的代码用于转换和生成结果,该结果返回时间段“ 01/11/1999”和“ 06/11/1999”之间的前5个职业
scala> dailyshow.
withColumn("showdate",to_date(unix_timestamp(col("showdate"),"MM/dd/yyyy").
cast("timestamp"))).
where((col("showdate") >= "1999-01-11") and (col("showdate") <= "1999-06-11")).
groupBy(col("occupation")).agg(count("*").alias("count")).
orderBy(desc("count")).
limit(5).show
+------------------+-----+
| occupation|count|
+------------------+-----+
| actor| 29|
| actress| 20|
| comedian| 4|
|television actress| 3|
| stand-up comedian| 2|
+------------------+-----+
My question is how to code and get the same result when using RDD? 我的问题是使用RDD时如何编码并获得相同的结果?
scala> dailyshow.first
res12: org.apache.spark.sql.Row = [1999,actor,1/11/1999,Acting,Michael J. Fox]
I used SimpleDateFormat
to parse the string to date in a DataFrame
. 我使用
SimpleDateFormat
解析了DataFrame
日期的字符串。
Below is the code: 下面是代码:
val format = new java.text.SimpleDateFormat("MM/dd/yyyy")
dailyshow.
map(x => x.mkString(",")).
map(x => x.split(",")).
map(x => format.parse(x(2))).first // returns Mon Jan 11 00:00:00 PST 1999
If I were you I would use spark's internal date functions as defined in org.apache.spark.sql.functions instead of manually doing it myself with simple date and mapping. 如果您是我,我将使用org.apache.spark.sql.functions中定义的spark内部日期函数,而不是自己通过简单的日期和映射手动进行操作。 This is because using dataframe functions is much simpler, much more idiomatic, less error prone and performs much better.
这是因为使用数据框函数更简单,更惯用,更不易出错并且性能更好。
Lets assume you have a dataframe df which has column called dateString which contains a date string in the format MM/dd/yyyy. 假设您有一个数据框df,其数据列名为dateString,其中包含日期字符串,格式为MM / dd / yyyy。
Let's also assume you want to convert it to a date in order to extract the year and then display it in the format yyyy.MMMMM.dd 我们还假设您要将其转换为日期以便提取年份,然后以yyyy.MMMMM.dd格式显示
What you can do is: 您可以做的是:
val dfWithDate = df.withColumn("date", to_date($"dateString")
val dfWithYear = dfWithDate.withColumn("year", year($"date"))
val dfWithOutput = dfWithYear.withColumn("dateOutput", date_format("$date", "yyyy.MMMMM.dd")
Now the year column would contain the year and the dateOutput column would contain the string representation with your format. 现在,year列将包含year,而dateOutput列将包含具有您格式的字符串表示形式。
Got a lot of deprecation warning while writing this :D 编写此:D时,收到了很多弃用警告
So we have this data in a RDD 因此我们将这些数据存储在RDD中
val rdd = sc.parallelize(Array(
Array("1999","actor","1/11/1999","Acting"," Michael J. Fox"),
Array("1999","Comedian","1/12/1999","Comedy"," Sandra Bernhard"),
Array("1999","television actress","1/13/1999","Acting","Tracey Ullman"),
Array("1999","film actress","1/14/1999","Acting","Gillian Anderson"),
Array("1999","actor","1/18/1999","Acting","David Alan Grier")))
Then as per your question, we do a filter on date: 然后根据您的问题,我们对日期进行过滤:
val filtered = rdd.filter{ x =>
format.parse(x(2)).after( new java.util.Date("01/10/1999")) &&
format.parse(x(2)).before(new java.util.Date("01/14/1999"))
}
Then we get this : 然后我们得到这个:
Array[Array[String]] = Array(
Array(1999, actor, 1/11/1999, Acting, " Michael J. Fox"),
Array(1999, Comedian, 1/12/1999, Comedy, " Sandra Bernhard"),
Array(1999, television actress, 1/13/1999, Acting, Tracey Ullman))
Then we group them with the second element as the key and count the number of occurrences : 然后,我们将它们与第二个元素作为关键字进行分组,并计算出现的次数:
filtered.keyBy(x => x(1) ).map((_, 1) ).reduceByKey(_+_).map{ case ((a, b) ,c) => (a,c) }
If everything goes right , you should get : 如果一切顺利,您应该得到:
Array[(String, Int)] = Array((television actress,1), (Comedian,1), (actor,1))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.