简体   繁体   English

在 Apache Spark 中重命名后使用列

[英]Using column after renaming it in Apache Spark

I am trying to understand why Spark is behaving differently in somewhat same scenario.I renamed two columns and tried to use both of them in some calculation but one statement is throwing en error with unable to find the renamed column.Below is the code我试图理解为什么 Spark 在相同的场景中表现不同。我重命名了两列并尝试在某些计算中使用它们,但是一个语句抛出错误,无法找到重命名的列。下面是代码

intermediateDF = intermediateDF.drop("GEO.id")
                                       .withColumnRenamed("GEO.id2", "id")
                                       .withColumnRenamed("GEO.display-label", "label")
                                       .withColumn("stateid", functions.expr("int(id/1000)"))
                                       .withColumn("countyId", functions.expr("id%1000"))
                                       //.withColumn("countyState", functions.split(intermediateDF.col("label"), ","))
                                       .withColumnRenamed("rescen42010", "real2010")
                                       .drop("resbase42010")
                                       .withColumnRenamed("respop72010", "est2010")
                                       .withColumnRenamed("respop72011", "est2011")
                                       .withColumnRenamed("respop72012", "est2012")
                                       .withColumnRenamed("respop72013", "est2013")
                                       .withColumnRenamed("respop72014", "est2014")
                                       .withColumnRenamed("respop72015", "est2015")
                                       .withColumnRenamed("respop72016", "est2016")
                                       .withColumnRenamed("respop72017", "est2017")

The line commented out is the one that is throwing below error注释掉的行是抛出错误的行

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "label" among (GEO.id, GEO.id2, GEO.display-label, rescen42010, resbase42010, respop72010, respop72011, respop72012, respop72013, respop72014, respop72015, respop72016, respop72017);

Can someone please help me out in understanding why Spark can find one renamed column(from GEO.id2 to id ), runs calculations on it but fails on other (from GEO.display-label to label).有人可以帮我理解为什么 Spark 可以找到一个重命名的列(从GEO.id2id ),对其运行计算但在其他列上失败(从 GEO.display-label 到标签)。 I am using Apache Spark 3 with Java.Thanks我正在使用 Apache Spark 3 和 Java。谢谢

Try this syntax:试试这个语法:

 .withColumn("countyState", functions.split(col("label"), ","))

It should work just fine.它应该工作得很好。

Check below code.检查下面的代码。

  intermediateDF.select( \
      col("GEO.id2").alias("id"), \
      functions.expr("int(id/1000)").alias("stateid"), \
      functions.expr("id%1000").alias("countyId"), \
      split(col("GEO.display-label"),",").alias("countyState"), \
      col("rescen42010").as("real2010"), \
      col("respop72010").alias("est2010"), \
      col("respop72011").alias("est2011"), \
      col("respop72012").alias("est2012"), \
      col("respop72013").alias("est2013"), \
      col("respop72014").alias("est2014"), \
      col("respop72015").alias("est2015"), \
      col("respop72016").alias("est2016"), \
      col("respop72017").alias("est2017"))


声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM