I'm trying to create a data frame on a text file. For a sample input ( Input1 ) below Code is working fine
Input1
1,5
2,6
3,7
4,8
Output1
+---+----+
| id|name|
+---+----+
| 1| 5|
| 2| 6|
| 3| 7|
| 4| 8|
+---+----+
However when I changed the input( Input2 ), I'm not getting any output.
Input2
1,"a,b,c"
2,"d,e,f"
3,"a,b,c"
4,"a,d,f"
Output2
+---+----+
| id|name|
+---+----+
+---+----+
Code
{
val input = sc.textFile(inputFile).map(x=>x.split(",")).collect {
case Array(id,name) => Record(id.toInt, name)
}
input.toDF().show()
}
case class Record(id: Int, name: String)
Expected output format for Input2
+---+-----+------+-----+
| id|name1| name2|name3|
+---+-----+------+-----+
| 1| a| b| c|
| 2| d| e| d|
| 3| a| b| c|
| 4| a| d| f|
+---+-----+------+-----+
I should make changes to the code and case class as well so that compiler understands the data format for Input2 , but I'm not getting what changes I need to do. Please advice.
Assuming you are using Spark2, you can simply do
val df = spark.read.csv(inputFile)
And you can split apart the second column in following steps.
At the moment, you're trying to read an entire line containing more than one comma, and only matching on an Array of two elements
You are trying to make the first digit as id
column and rest of the comma separated chars inside inverted comma as name
column. For that you have to change a little bit of your logic and you should be fine as below
val input = sc.textFile(inputFile).map(x=>x.split(",")).map(x => Record(x.head.toInt, x.tail.mkString(",")))
input.toDF().show()
and of course case class
is as you have
case class Record(id: Int, name: String)
You should have following dataframe
+---+-------+
| id| name|
+---+-------+
| 1|"a,b,c"|
| 2|"d,e,f"|
| 3|"a,b,c"|
| 4|"a,d,f"|
+---+-------+
If you don't want the inverted comma you can add replace
api as
val input = sc.textFile(inputFile).map(x=>x.replace("\"", "").split(",")).map(x => Record(x.head.toInt, x.tail.mkString(",")))
input.toDF().show()
you should have
+---+-----+
| id| name|
+---+-----+
| 1|a,b,c|
| 2|d,e,f|
| 3|a,b,c|
| 4|a,d,f|
+---+-----+
I hope the answer is helpful.
By the way its better to use sqlContext to read such files where you want to ignore the commas inside inverted commas.
sqlContext.read.format(inputFile).toDF("id", "name").show(false)
you should have above output dataframe
I tried with the below code and got the output as per the need.
{
val input = sc.textFile(inputFile).map(x=>x.replaceAll("\"",""))
val input1 = input.map(x=>x.split(",")).collect { case Array(id,name,name1, name2) => Record(id.toInt, name, name1, name2) }
}
case class Record(id: Int, name: String, name1 : String, name2 : String)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.