简体   繁体   中英

Apache Spark RDD Split “|”

I am trying to produce a formatted CSV file from pipe("|") delimited file using Apache Spark . input file contains:

apple|ball|cat

Blacktown| Bela vista| Greenacre

x|y|z

I am trying with:

val name= sc.textFile(input.txt")
val split=name.map(line=>line.split("|")).map( x => (x(0),x(2)) )
split.foreach(println)

Output:

(x,y)

(a,p)

(B,a)

My required output is:

(apple,cat)

(Blacktown, Greenacre)

(x,z)

A String argument for split function is a regular expression so if you want to use pipe it has to be escaped:

line.split("\\|")

otherwise it is interpreted as an alternation between two empty patterns.

You can also use variant which accepts Character literal :

line.split('|')

or an Array of Character literals :

line.split(Array('|'))

It is also better to validate the input:

names.map(_.split("\\|")).collect {
  case Array(x, _, y) => (x, y)
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM