简体   繁体   中英

how to conditionally change commas into semicolon with spark-scala map / split

I'm new to spark / scala and have encountered a problem I need to have some data manipulations on a full month of data. to achieve this I defined a class

scala> case class zahiro(request_datetime: String, ip: String, host: String, request_uri: String, referer: String, useragent: String, uuid: String, country: String)

scala> val lines = sc.textFile("s3://{bucket}/whatever/2015/05/*.*").map(_.split(",")).map(p => zahiro(p(0),p(1),p(2),p(3),p(4),p(5),p(6),p(7))).toDF()

however, the useragent field may or may not contain "," and is enclosed in double quotes.

what i would like to achieve is conditionally changing the "," with ";" based on the condition that it is enclosed/encapsulated with dbl quotes.

on PIG, I can use : xxx = FOREACH xxx GENERATE request_datetime, ip, host, uuid, country, impression_id, impression_datetime,REPLACE(useragent,',',';');

directly after defining the scheme, but I really want to do it on Scala and not via some pre-processing regex work

any help would be appreciated...

Add fallowing map before splitting

val lines = sc.textFile("s3://{bucket}/whatever/2015/05/*.*")
  //add the map
  .map(line => line.replaceAll("\",\"", ";"))
  .map(_.split(","))
  .map(p=> zahiro(p(0),p(1),p(2),p(3),p(4),p(5),p(6),p(7)))

expanding https://stackoverflow.com/users/3187921/user52045 's comment -

val lines = sc.textFile("s3://{bucket}/whatever/2015/05/*.*") 
 //add the map
 .map(line => line.replaceAll("\".*,.*\"", ";"))
 .map(_.split(","))   
 .map(p=> zahiro(p(0),p(1),p(2),p(3),p(4),p(5),p(6),p(7)))

works perfectly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM