简体   繁体   中英

Using column after renaming it in Apache Spark

I am trying to understand why Spark is behaving differently in somewhat same scenario.I renamed two columns and tried to use both of them in some calculation but one statement is throwing en error with unable to find the renamed column.Below is the code

intermediateDF = intermediateDF.drop("GEO.id")
                                       .withColumnRenamed("GEO.id2", "id")
                                       .withColumnRenamed("GEO.display-label", "label")
                                       .withColumn("stateid", functions.expr("int(id/1000)"))
                                       .withColumn("countyId", functions.expr("id%1000"))
                                       //.withColumn("countyState", functions.split(intermediateDF.col("label"), ","))
                                       .withColumnRenamed("rescen42010", "real2010")
                                       .drop("resbase42010")
                                       .withColumnRenamed("respop72010", "est2010")
                                       .withColumnRenamed("respop72011", "est2011")
                                       .withColumnRenamed("respop72012", "est2012")
                                       .withColumnRenamed("respop72013", "est2013")
                                       .withColumnRenamed("respop72014", "est2014")
                                       .withColumnRenamed("respop72015", "est2015")
                                       .withColumnRenamed("respop72016", "est2016")
                                       .withColumnRenamed("respop72017", "est2017")

The line commented out is the one that is throwing below error

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "label" among (GEO.id, GEO.id2, GEO.display-label, rescen42010, resbase42010, respop72010, respop72011, respop72012, respop72013, respop72014, respop72015, respop72016, respop72017);

Can someone please help me out in understanding why Spark can find one renamed column(from GEO.id2 to id ), runs calculations on it but fails on other (from GEO.display-label to label). I am using Apache Spark 3 with Java.Thanks

Try this syntax:

 .withColumn("countyState", functions.split(col("label"), ","))

It should work just fine.

Check below code.

  intermediateDF.select( \
      col("GEO.id2").alias("id"), \
      functions.expr("int(id/1000)").alias("stateid"), \
      functions.expr("id%1000").alias("countyId"), \
      split(col("GEO.display-label"),",").alias("countyState"), \
      col("rescen42010").as("real2010"), \
      col("respop72010").alias("est2010"), \
      col("respop72011").alias("est2011"), \
      col("respop72012").alias("est2012"), \
      col("respop72013").alias("est2013"), \
      col("respop72014").alias("est2014"), \
      col("respop72015").alias("est2015"), \
      col("respop72016").alias("est2016"), \
      col("respop72017").alias("est2017"))


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM