在 Apache Spark 中重命名后使用列

Question

I am trying to understand why Spark is behaving differently in somewhat same scenario.I renamed two columns and tried to use both of them in some calculation but one statement is throwing en error with unable to find the renamed column.Below is the code我试图理解为什么 Spark 在相同的场景中表现不同。我重命名了两列并尝试在某些计算中使用它们，但是一个语句抛出错误，无法找到重命名的列。下面是代码

intermediateDF = intermediateDF.drop("GEO.id")
                                       .withColumnRenamed("GEO.id2", "id")
                                       .withColumnRenamed("GEO.display-label", "label")
                                       .withColumn("stateid", functions.expr("int(id/1000)"))
                                       .withColumn("countyId", functions.expr("id%1000"))
                                       //.withColumn("countyState", functions.split(intermediateDF.col("label"), ","))
                                       .withColumnRenamed("rescen42010", "real2010")
                                       .drop("resbase42010")
                                       .withColumnRenamed("respop72010", "est2010")
                                       .withColumnRenamed("respop72011", "est2011")
                                       .withColumnRenamed("respop72012", "est2012")
                                       .withColumnRenamed("respop72013", "est2013")
                                       .withColumnRenamed("respop72014", "est2014")
                                       .withColumnRenamed("respop72015", "est2015")
                                       .withColumnRenamed("respop72016", "est2016")
                                       .withColumnRenamed("respop72017", "est2017")

The line commented out is the one that is throwing below error注释掉的行是抛出错误的行

Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "label" among (GEO.id, GEO.id2, GEO.display-label, rescen42010, resbase42010, respop72010, respop72011, respop72012, respop72013, respop72014, respop72015, respop72016, respop72017);

Can someone please help me out in understanding why Spark can find one renamed column(from GEO.id2 to id ), runs calculations on it but fails on other (from GEO.display-label to label).有人可以帮我理解为什么 Spark 可以找到一个重命名的列（从GEO.id2到id ），对其运行计算但在其他列上失败（从 GEO.display-label 到标签）。 I am using Apache Spark 3 with Java.Thanks我正在使用 Apache Spark 3 和 Java。谢谢

Answer 1

Try this syntax:试试这个语法：

 .withColumn("countyState", functions.split(col("label"), ","))

It should work just fine.它应该工作得很好。

Answer 2

Check below code.检查下面的代码。

  intermediateDF.select( \
      col("GEO.id2").alias("id"), \
      functions.expr("int(id/1000)").alias("stateid"), \
      functions.expr("id%1000").alias("countyId"), \
      split(col("GEO.display-label"),",").alias("countyState"), \
      col("rescen42010").as("real2010"), \
      col("respop72010").alias("est2010"), \
      col("respop72011").alias("est2011"), \
      col("respop72012").alias("est2012"), \
      col("respop72013").alias("est2013"), \
      col("respop72014").alias("est2014"), \
      col("respop72015").alias("est2015"), \
      col("respop72016").alias("est2016"), \
      col("respop72017").alias("est2017"))

在 Apache Spark 中重命名后使用列

问题描述

2 个解决方案

解决方案1
0 已采纳 2020-07-31 05:20:40

解决方案2
0 2020-07-31 06:10:20

在 Apache Spark 中重命名后使用列

问题描述

2 个解决方案

解决方案1 0 已采纳 2020-07-31 05:20:40

解决方案2 0 2020-07-31 06:10:20

解决方案1
0 已采纳 2020-07-31 05:20:40

解决方案2
0 2020-07-31 06:10:20