I'm new to spark / scala and have encountered a problem I need to have some data manipulations on a full month of data. to achieve this I defined a class
scala> case class zahiro(request_datetime: String, ip: String, host: String, request_uri: String, referer: String, useragent: String, uuid: String, country: String)
scala> val lines = sc.textFile("s3://{bucket}/whatever/2015/05/*.*").map(_.split(",")).map(p => zahiro(p(0),p(1),p(2),p(3),p(4),p(5),p(6),p(7))).toDF()
however, the useragent field may or may not contain "," and is enclosed in double quotes.
what i would like to achieve is conditionally changing the "," with ";" based on the condition that it is enclosed/encapsulated with dbl quotes.
on PIG, I can use : xxx = FOREACH xxx GENERATE request_datetime, ip, host, uuid, country, impression_id, impression_datetime,REPLACE(useragent,',',';');
directly after defining the scheme, but I really want to do it on Scala and not via some pre-processing regex work
any help would be appreciated...
Add fallowing map before splitting
val lines = sc.textFile("s3://{bucket}/whatever/2015/05/*.*")
//add the map
.map(line => line.replaceAll("\",\"", ";"))
.map(_.split(","))
.map(p=> zahiro(p(0),p(1),p(2),p(3),p(4),p(5),p(6),p(7)))
expanding https://stackoverflow.com/users/3187921/user52045 's comment -
val lines = sc.textFile("s3://{bucket}/whatever/2015/05/*.*")
//add the map
.map(line => line.replaceAll("\".*,.*\"", ";"))
.map(_.split(","))
.map(p=> zahiro(p(0),p(1),p(2),p(3),p(4),p(5),p(6),p(7)))
works perfectly.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.