简体   繁体   English

Scala —转换和映射数据

[英]Scala — transform and map data

The raw data file I will be reading is a tab-delimited and one of the fields is a timestamp: 我将要读取的原始数据文件是制表符分隔的,其中一个字段是时间戳:

timestamp  userId  keyword
1405377264  A      google
1405378945  B      yahoo
1405377264  C      facebook

I got a case class defined as: 我有一个案例类定义为:

case class Event(date: String, userId: Int, keyword: String)

How do I go about transforming the timestamp to Date format and then map to the Event case class? 如何将时间戳转换为日期格式,然后映射到事件案例类? I have the logic to convert the timestamp to Date: 我有将时间戳转换为日期的逻辑:

import java.text.SimpleDateFormat
import java.util.Date

val df = new SimpleDateFormat("yyyy-MM-dd")
val dt = new Date(timestamp*1000L)
val date = df.format(dt) 

What is the right way to convert the raw data and map it to the case class? 转换原始数据并将其映射到案例类的正确方法是什么?

Thanks! 谢谢!

I don't know if I'd say this is the right way, but one way to read each line would be to use regex extraction. 我不知道我是否会说这是正确的方法,但是读取每一行的一种方法是使用正则表达式提取。 Assuming you already have the data as a string, each line tab delimited, and each line separated by a line feed ( \\n ): 假设您已经有数据作为字符串,每行选项卡定界,每行由换行符( \\n )分隔:

val data: String = ...
val regex = "(\\d+)\t([A-z])\t([A-z]+)".r

data.split('\n').map { line =>
    val regex(timestamp, userId, keyword) = line 
    Event(df.format(new Date(timestamp.toLong*1000L), userId, keyword)
}

As is, this is not fault tolerant if there are any deviations from the regex (which would have to be tweaked to your needs, I only followed the example above to the letter). 照原样,如果与正则表达式有任何差异(这必须根据您的需要进行调整,我仅遵循上面的示例),这不是容错的。 If for example you wanted to discard the lines that didn't conform, you could use Try and collect : 例如,如果您想丢弃不符合要求的行,则可以使用Try and collect

data.split('\n').map { line =>
    Try {
       // same as above
    }
}.collect {
    case Success(event) => event
}

How about reading the CSV file using scala.io.Source.fromFile(myFile.csv).getLines ? 如何使用scala.io.Source.fromFile(myFile.csv).getLines读取CSV文件? This should return an Iterator[String] which is a lazy collection! 这应该返回一个Iterator [String]这是一个惰性集合!

You can map over each line to create an Event. 您可以映射每行以创建一个事件。 But what you want is to convert the timestamp to a java.util.Date as a first step before you create your Event objects. 但是,您想要的是在创建Event对象之前将时间戳转换为java.util.Date的第一步。

I would suggest something along these lines: (This may not compile but it should give you the basic idea) 我会按照以下思路提出一些建议:(这可能无法编译,但应该可以为您提供基本思路)

scala.io.Source.fromFile(myFile.csv).getLines flatMap { line =>
  splitAtDelimiter(line).toList match {
  case ts :: id :: kw :: Nil => 
    val date: Option[String] = 
      try { Some(convertToDateString(ts)) } catch { case e: Throwable => None }
    date.map(Event(_, getUserIdColumn(line), getKeyWordColumn(line)) // returns an Option[Event]
  case _ => None
}

where your convertToDateString would be a function which takes the timestamp value and converts it to a java.util.Date and then does a toString on it (looking at what you need for the date type in Event case class) and the splitAtDelimiter is an imaginary CSV parser function! 其中您的convertToDateString将是一个获取时间戳记值并将其转换为java.util.Date然后在其上执行toString的函数(查看事件案例类中的日期类型需要什么),而splitAtDelimiter是一个虚构的CSV解析器功能!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM