[英]Modifying column value of a dataframe in Pyspark
我有这个 dataframe,它存储了一家公司的股票价格:
| _c0| _c1| _c2| _c3| _c4| _c5| _c6|
+----------+----------+----------+----------+----------+----------+--------+
|2009-01-02|153.302917|159.870193|151.762924|159.621811|159.621811| 7267900|
|2009-01-05|159.462845|164.549759|156.482239|162.965073|162.965073| 9841400|
|2009-01-06|165.414139|169.298874|162.140427|165.950653|165.950653|12933900|
|2009-01-07|163.099197|164.385834|158.345123|159.964584|159.964584| 9047400|
|2009-01-08|158.111633|161.544312|157.644669|161.544312|161.544312| 7248100|
|2009-01-09|162.691849|162.691849|155.687408|156.517014|156.517014| 8737300|
我想修改第一列以仅保留年份,因此我可以使用groupBy
计算每年的平均股价,如下所示:
| _c0| _c1| _c2| _c3| _c4| _c5| _c6|
+----+----------+----------+----------+----------+----------+--------+
|2009|153.302917|159.870193|151.762924|159.621811|159.621811| 7267900|
|2009|159.462845|164.549759|156.482239|162.965073|162.965073| 9841400|
|2009|165.414139|169.298874|162.140427|165.950653|165.950653|12933900|
|2009|163.099197|164.385834|158.345123|159.964584|159.964584| 9047400|
|2009|158.111633|161.544312|157.644669|161.544312|161.544312| 7248100|
|2009|162.691849|162.691849|155.687408|156.517014|156.517014| 8737300|
但我不知道如何在 pyspark 中做到这一点。 有人可以帮助我吗? 谢谢!
使用year
function:
import pyspark.sql.functions as F
df = df.withColumn("_c0", F.year(F.col("_c0")))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.