简体   繁体   中英

Modifying column value of a dataframe in Pyspark

I have this dataframe, which stores the stock prices of a company:

|       _c0|       _c1|       _c2|       _c3|       _c4|       _c5|     _c6|
+----------+----------+----------+----------+----------+----------+--------+
|2009-01-02|153.302917|159.870193|151.762924|159.621811|159.621811| 7267900|
|2009-01-05|159.462845|164.549759|156.482239|162.965073|162.965073| 9841400|
|2009-01-06|165.414139|169.298874|162.140427|165.950653|165.950653|12933900|
|2009-01-07|163.099197|164.385834|158.345123|159.964584|159.964584| 9047400|
|2009-01-08|158.111633|161.544312|157.644669|161.544312|161.544312| 7248100|
|2009-01-09|162.691849|162.691849|155.687408|156.517014|156.517014| 8737300|

I would like to modify the first column in order to keep just the year, so I could then calculate the average stock price per year using groupBy , like this:

| _c0|       _c1|       _c2|       _c3|       _c4|       _c5|     _c6|
+----+----------+----------+----------+----------+----------+--------+
|2009|153.302917|159.870193|151.762924|159.621811|159.621811| 7267900|
|2009|159.462845|164.549759|156.482239|162.965073|162.965073| 9841400|
|2009|165.414139|169.298874|162.140427|165.950653|165.950653|12933900|
|2009|163.099197|164.385834|158.345123|159.964584|159.964584| 9047400|
|2009|158.111633|161.544312|157.644669|161.544312|161.544312| 7248100|
|2009|162.691849|162.691849|155.687408|156.517014|156.517014| 8737300|

But I have no idea about how to do it in pyspark. Could someone help me? Thanks!

Use the year function:

import pyspark.sql.functions as F

df = df.withColumn("_c0", F.year(F.col("_c0")))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM