簡體   English   中英

如何重命名數據框中的列

[英]How to rename a column in a dataframe

我有一個名為 d2 的數據框,有 2 列(DEST_COUNTRY_NAME,count)

我創建了一個新的數據框,如下所示:

df3 = df2.groupBy("DEST_COUNTRY_NAME").sum('count')

我打算將“sum(count)”列的名稱更改為“destination_total”:

df5 = df3.selectExpr("cast(DEST_COUNTRY_NAME as string) DEST_COUNTRY_NAME", "cast(sum(count) as int) destination_total")

但是,出現以下錯誤:

Py4JJavaError                             Traceback (most recent call last)
/content/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

4 frames
Py4JJavaError: An error occurred while calling o1195.selectExpr.
: org.apache.spark.sql.AnalysisException: cannot resolve '`count`' given input columns: [DEST_COUNTRY_NAME, sum(count)]; line 1 pos 5;
'Project [cast(DEST_COUNTRY_NAME#1110 as string) AS DEST_COUNTRY_NAME#1154, cast('count as int) AS count#1155]
+- AnalysisBarrier
      +- Aggregate [DEST_COUNTRY_NAME#1110], [DEST_COUNTRY_NAME#1110, sum(cast(count#1112 as bigint)) AS sum(count)#1120L]
         +- Project [cast(DEST_COUNTRY_NAME#1090 as string) AS DEST_COUNTRY_NAME#1110, cast(ORIGIN_COUNTRY_NAME#1091 as string) AS ORIGIN_COUNTRY_NAME#1111, cast(count#1092 as int) AS count#1112]
            +- Relation[DEST_COUNTRY_NAME#1090,ORIGIN_COUNTRY_NAME#1091,count#1092] csv

    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:122)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:127)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:127)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:95)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
    at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3296)
    at org.apache.spark.sql.Dataset.select(Dataset.scala:1307)
    at org.apache.spark.sql.Dataset.selectExpr(Dataset.scala:1342)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)


During handling of the above exception, another exception occurred:

AnalysisException                         Traceback (most recent call last)
/content/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: "cannot resolve '`count`' given input columns: [DEST_COUNTRY_NAME, sum(count)]; line 1 pos 5;\n'Project [cast(DEST_COUNTRY_NAME#1110 as string) AS DEST_COUNTRY_NAME#1154, cast('count as int) AS count#1155]\n+- AnalysisBarrier\n      +- Aggregate [DEST_COUNTRY_NAME#1110], [DEST_COUNTRY_NAME#1110, sum(cast(count#1112 as bigint)) AS sum(count)#1120L]\n         +- Project [cast(DEST_COUNTRY_NAME#1090 as string) AS DEST_COUNTRY_NAME#1110, cast(ORIGIN_COUNTRY_NAME#1091 as string) AS ORIGIN_COUNTRY_NAME#1111, cast(count#1092 as int) AS count#1112]\n            +- Relation[DEST_COUNTRY_NAME#1090,ORIGIN_COUNTRY_NAME#1091,count#1092] csv\n"

我打算將“sum(count)”列重命名為“destination_total”。 我怎么解決這個問題? 我不是在用熊貓,而是在用火花。

df5 = df3.withColumnRenamed("sum(count)","destination_total")

或者您可以進行聚合而不是直接求和。

df3 = df2.groupBy("DEST_COUNTRY_NAME").agg(sum('count').alias('count'))
from pyspark.sql.functions import *
df3 = df2.groupBy("DEST_COUNTRY_NAME") \
         .agg(sum("count").alias("destination_total"))

以下是您可以用來重命名的兩種方法,假設您的數據框中只有兩列。

df3 = df2.groupBy("DEST_COUNTRY_NAME").sum('count').toDF(*['DEST_COUNTRY_NAME', 'destination_total'])

或者您可以在調用別名函數時重命名它,如下所示:

df3.select("DEST_COUNTRY_NAME", col("sum(count)").alias("destination_total"))

PS:不要忘記導入 col 。

from pyspark.sql.functions import col

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM