[英]Scala — transform and map data
The raw data file I will be reading is a tab-delimited and one of the fields is a timestamp: 我将要读取的原始数据文件是制表符分隔的,其中一个字段是时间戳:
timestamp userId keyword
1405377264 A google
1405378945 B yahoo
1405377264 C facebook
I got a case class defined as: 我有一个案例类定义为:
case class Event(date: String, userId: Int, keyword: String)
How do I go about transforming the timestamp to Date format and then map to the Event case class? 如何将时间戳转换为日期格式,然后映射到事件案例类? I have the logic to convert the timestamp to Date: 我有将时间戳转换为日期的逻辑:
import java.text.SimpleDateFormat
import java.util.Date
val df = new SimpleDateFormat("yyyy-MM-dd")
val dt = new Date(timestamp*1000L)
val date = df.format(dt)
What is the right way to convert the raw data and map it to the case class? 转换原始数据并将其映射到案例类的正确方法是什么?
Thanks! 谢谢!
I don't know if I'd say this is the right way, but one way to read each line would be to use regex extraction. 我不知道我是否会说这是正确的方法,但是读取每一行的一种方法是使用正则表达式提取。 Assuming you already have the data as a string, each line tab delimited, and each line separated by a line feed ( \\n
): 假设您已经有数据作为字符串,每行选项卡定界,每行由换行符( \\n
)分隔:
val data: String = ...
val regex = "(\\d+)\t([A-z])\t([A-z]+)".r
data.split('\n').map { line =>
val regex(timestamp, userId, keyword) = line
Event(df.format(new Date(timestamp.toLong*1000L), userId, keyword)
}
As is, this is not fault tolerant if there are any deviations from the regex (which would have to be tweaked to your needs, I only followed the example above to the letter). 照原样,如果与正则表达式有任何差异(这必须根据您的需要进行调整,我仅遵循上面的示例),这不是容错的。 If for example you wanted to discard the lines that didn't conform, you could use Try
and collect
: 例如,如果您想丢弃不符合要求的行,则可以使用Try
and collect
:
data.split('\n').map { line =>
Try {
// same as above
}
}.collect {
case Success(event) => event
}
How about reading the CSV file using scala.io.Source.fromFile(myFile.csv).getLines ? 如何使用scala.io.Source.fromFile(myFile.csv).getLines读取CSV文件? This should return an Iterator[String] which is a lazy collection! 这应该返回一个Iterator [String]这是一个惰性集合!
You can map over each line to create an Event. 您可以映射每行以创建一个事件。 But what you want is to convert the timestamp to a java.util.Date as a first step before you create your Event objects. 但是,您想要的是在创建Event对象之前将时间戳转换为java.util.Date的第一步。
I would suggest something along these lines: (This may not compile but it should give you the basic idea) 我会按照以下思路提出一些建议:(这可能无法编译,但应该可以为您提供基本思路)
scala.io.Source.fromFile(myFile.csv).getLines flatMap { line =>
splitAtDelimiter(line).toList match {
case ts :: id :: kw :: Nil =>
val date: Option[String] =
try { Some(convertToDateString(ts)) } catch { case e: Throwable => None }
date.map(Event(_, getUserIdColumn(line), getKeyWordColumn(line)) // returns an Option[Event]
case _ => None
}
where your convertToDateString
would be a function which takes the timestamp value and converts it to a java.util.Date and then does a toString on it (looking at what you need for the date type in Event case class) and the splitAtDelimiter
is an imaginary CSV parser function! 其中您的convertToDateString
将是一个获取时间戳记值并将其转换为java.util.Date然后在其上执行toString的函数(查看事件案例类中的日期类型需要什么),而splitAtDelimiter
是一个虚构的CSV解析器功能!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.