简体   繁体   中英

Dataframe for textfile Spark

I'm trying to create a data frame on a text file. For a sample input ( Input1 ) below Code is working fine

Input1

1,5

2,6

3,7

4,8

Output1

+---+----+
| id|name|
+---+----+
|  1|   5|
|  2|   6|
|  3|   7|
|  4|   8|
+---+----+

However when I changed the input( Input2 ), I'm not getting any output.

Input2

1,"a,b,c"

2,"d,e,f"

3,"a,b,c"

4,"a,d,f"

Output2

+---+----+
| id|name|
+---+----+
+---+----+

Code

    {
        val input = sc.textFile(inputFile).map(x=>x.split(",")).collect {
                            case Array(id,name) => Record(id.toInt, name)
                    }
                    input.toDF().show()
    }
case class Record(id: Int, name: String)

Expected output format for Input2

+---+-----+------+-----+
| id|name1| name2|name3|
+---+-----+------+-----+
|  1|    a|     b|    c|
|  2|    d|     e|    d|    
|  3|    a|     b|    c|
|  4|    a|     d|    f|
+---+-----+------+-----+

I should make changes to the code and case class as well so that compiler understands the data format for Input2 , but I'm not getting what changes I need to do. Please advice.

Assuming you are using Spark2, you can simply do

val df = spark.read.csv(inputFile)

And you can split apart the second column in following steps.

At the moment, you're trying to read an entire line containing more than one comma, and only matching on an Array of two elements

You are trying to make the first digit as id column and rest of the comma separated chars inside inverted comma as name column. For that you have to change a little bit of your logic and you should be fine as below

val input = sc.textFile(inputFile).map(x=>x.split(",")).map(x => Record(x.head.toInt, x.tail.mkString(",")))
input.toDF().show()

and of course case class is as you have

case class Record(id: Int, name: String)

You should have following dataframe

+---+-------+
| id|   name|
+---+-------+
|  1|"a,b,c"|
|  2|"d,e,f"|
|  3|"a,b,c"|
|  4|"a,d,f"|
+---+-------+

If you don't want the inverted comma you can add replace api as

val input = sc.textFile(inputFile).map(x=>x.replace("\"", "").split(",")).map(x => Record(x.head.toInt, x.tail.mkString(",")))
input.toDF().show()

you should have

+---+-----+
| id| name|
+---+-----+
|  1|a,b,c|
|  2|d,e,f|
|  3|a,b,c|
|  4|a,d,f|
+---+-----+

I hope the answer is helpful.

By the way its better to use sqlContext to read such files where you want to ignore the commas inside inverted commas.

sqlContext.read.format(inputFile).toDF("id", "name").show(false)

you should have above output dataframe

I tried with the below code and got the output as per the need.

    {
        val input = sc.textFile(inputFile).map(x=>x.replaceAll("\"","")) 
        val input1 = input.map(x=>x.split(",")).collect { case Array(id,name,name1, name2) => Record(id.toInt, name, name1, name2) }
    }
    case class Record(id: Int, name: String, name1 : String, name2 : String)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM