简体   繁体   中英

Reading ambiguous column name in Spark sql Dataframe using scala

I have duplicate columns in text file and when I try to load that text file using spark scala code, it gets loaded successfully into data frame and I can see the first 20 rows by df.Show()

Full Code:-

 val sc = new SparkContext(conf)
 val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
 val rdd = sc.textFile("/...FilePath.../*")
 val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
 val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
 val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
 val schema = StructType(fields)
 val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))

val df = hivesql.createDataFrame(rowRDD, schema)
df.registerTempTable("Sample_File")
df.Show()

Till this point my code works fine. But as soon as I try below code then it gives me error.

val results = hivesql.sql("Select id,sequence,sequence from Sample_File")

so I have 2 columns with same name in text file ie sequence How can I access that two columns.. I tried with sequence#2 but still not working Spark Version:-1.6.0 Scala Version:- 2.10.5

result of df.printschema()
|-- id: string (nullable = true)
|-- sequence: string (nullable = true)
|-- sequence: string (nullable = true)

The below code might help you to resolve your problem. I have tested this in Spark 1.6.3.

val sc = new SparkContext(conf)
val hivesql = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sc.textFile("/...FilePath.../*")
val fieldCount = rdd.map(_.split("[|]")).map(x => x.size).first()
val field = rdd.zipWithIndex.filter(_._2==0).map(_._1).first()
val fields = field.split("[|]").map(fieldName =>StructField(fieldName, StringType, nullable=true))
val schema = StructType(fields)
val rowRDD = rdd.map(_.split("[|]")).map(attributes => getARow(attributes,fieldCount))

val df = hivesql.createDataFrame(rowRDD, schema)

val colNames = Seq("id","sequence1","sequence2")
val df1 = df.toDF(colNames: _*)

df1.registerTempTable("Sample_File")

val results = hivesql.sql("select id,sequence1,sequence2 from Sample_File")

I second @smart_coder's approach, I have a slightly different approach though. Please find it below.

You need to have unique column names to do query from hivesql.sql.

you can rename the column names dynamically by using below code:

Your code:

val df = hivesql.createDataFrame(rowRDD, schema)

After this point, we need to remove ambiguity, below is the solution:

var list = df.schema.map(_.name).toList

for(i <- 0 to list.size -1){
    val cont = list.count(_ == list(i))
    val col = list(i)
    
    if(cont != 1){
        list = list.take(i) ++ List(col+i) ++ list.drop(i+1)
    }
}

val df1 = df.toDF(list: _*)

// you would get the output as below: result of df1.printschema()

|-- id: string (nullable = true)
|-- sequence1: string (nullable = true)
|-- sequence: string (nullable = true)

So basically, we are getting all the column names as a list, then checking if any column is repeating more than once, if a column is repeating, we are appending the column name with the index, then we create a new dataframe d1 with the new list with renamed column names.

I have tested this in Spark 2.4, but it should work in 1.6 as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM