简体   繁体   English

文本文件Spark的数据框

[英]Dataframe for textfile Spark

I'm trying to create a data frame on a text file. 我正在尝试在文本文件上创建数据框。 For a sample input ( Input1 ) below Code is working fine 对于下面的示例输入( Input1 ), 代码工作正常

Input1 输入1

1,5

2,6

3,7

4,8

Output1 输出1

+---+----+
| id|name|
+---+----+
|  1|   5|
|  2|   6|
|  3|   7|
|  4|   8|
+---+----+

However when I changed the input( Input2 ), I'm not getting any output. 但是,当我更改输入( Input2 )时,没有任何输出。

Input2 输入2

1,"a,b,c"

2,"d,e,f"

3,"a,b,c"

4,"a,d,f"

Output2 输出2

+---+----+
| id|name|
+---+----+
+---+----+

Code

    {
        val input = sc.textFile(inputFile).map(x=>x.split(",")).collect {
                            case Array(id,name) => Record(id.toInt, name)
                    }
                    input.toDF().show()
    }
case class Record(id: Int, name: String)

Expected output format for Input2 Input2的预期输出格式

+---+-----+------+-----+
| id|name1| name2|name3|
+---+-----+------+-----+
|  1|    a|     b|    c|
|  2|    d|     e|    d|    
|  3|    a|     b|    c|
|  4|    a|     d|    f|
+---+-----+------+-----+

I should make changes to the code and case class as well so that compiler understands the data format for Input2 , but I'm not getting what changes I need to do. 我还应该对代码和case类进行更改,以便编译器理解Input2的数据格式,但是我没有得到需要做的更改。 Please advice. 请指教。

Assuming you are using Spark2, you can simply do 假设您使用的是Spark2,则只需

val df = spark.read.csv(inputFile)

And you can split apart the second column in following steps. 您可以按照以下步骤将第二列分开。

At the moment, you're trying to read an entire line containing more than one comma, and only matching on an Array of two elements 目前,您正在尝试读取包含多个逗号且仅匹配两个元素的数组的整行

You are trying to make the first digit as id column and rest of the comma separated chars inside inverted comma as name column. 您正在尝试将第一个数字作为id列,并将其余逗号分隔为反向逗号作为name列。 For that you have to change a little bit of your logic and you should be fine as below 为此,您需要更改一点逻辑,并且应该如下所示

val input = sc.textFile(inputFile).map(x=>x.split(",")).map(x => Record(x.head.toInt, x.tail.mkString(",")))
input.toDF().show()

and of course case class is as you have 当然, case class是你所拥有的

case class Record(id: Int, name: String)

You should have following dataframe 您应该具有以下dataframe

+---+-------+
| id|   name|
+---+-------+
|  1|"a,b,c"|
|  2|"d,e,f"|
|  3|"a,b,c"|
|  4|"a,d,f"|
+---+-------+

If you don't want the inverted comma you can add replace api as 如果您不希望使用反逗号,则可以添加replace api作为

val input = sc.textFile(inputFile).map(x=>x.replace("\"", "").split(",")).map(x => Record(x.head.toInt, x.tail.mkString(",")))
input.toDF().show()

you should have 你应该有

+---+-----+
| id| name|
+---+-----+
|  1|a,b,c|
|  2|d,e,f|
|  3|a,b,c|
|  4|a,d,f|
+---+-----+

I hope the answer is helpful. 我希望答案会有所帮助。

By the way its better to use sqlContext to read such files where you want to ignore the commas inside inverted commas. 顺便说一句,最好使用sqlContext来读取此类文件,而这些文件您要忽略反向逗号内的逗号。

sqlContext.read.format(inputFile).toDF("id", "name").show(false)

you should have above output dataframe 你应该有上面的输出数据框

I tried with the below code and got the output as per the need. 我尝试了以下代码,并根据需要获得了输出。

    {
        val input = sc.textFile(inputFile).map(x=>x.replaceAll("\"","")) 
        val input1 = input.map(x=>x.split(",")).collect { case Array(id,name,name1, name2) => Record(id.toInt, name, name1, name2) }
    }
    case class Record(id: Int, name: String, name1 : String, name2 : String)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM