I have this dataframe, which stores the stock prices of a company:
| _c0| _c1| _c2| _c3| _c4| _c5| _c6|
+----------+----------+----------+----------+----------+----------+--------+
|2009-01-02|153.302917|159.870193|151.762924|159.621811|159.621811| 7267900|
|2009-01-05|159.462845|164.549759|156.482239|162.965073|162.965073| 9841400|
|2009-01-06|165.414139|169.298874|162.140427|165.950653|165.950653|12933900|
|2009-01-07|163.099197|164.385834|158.345123|159.964584|159.964584| 9047400|
|2009-01-08|158.111633|161.544312|157.644669|161.544312|161.544312| 7248100|
|2009-01-09|162.691849|162.691849|155.687408|156.517014|156.517014| 8737300|
I would like to modify the first column in order to keep just the year, so I could then calculate the average stock price per year using groupBy
, like this:
| _c0| _c1| _c2| _c3| _c4| _c5| _c6|
+----+----------+----------+----------+----------+----------+--------+
|2009|153.302917|159.870193|151.762924|159.621811|159.621811| 7267900|
|2009|159.462845|164.549759|156.482239|162.965073|162.965073| 9841400|
|2009|165.414139|169.298874|162.140427|165.950653|165.950653|12933900|
|2009|163.099197|164.385834|158.345123|159.964584|159.964584| 9047400|
|2009|158.111633|161.544312|157.644669|161.544312|161.544312| 7248100|
|2009|162.691849|162.691849|155.687408|156.517014|156.517014| 8737300|
But I have no idea about how to do it in pyspark. Could someone help me? Thanks!
Use the year
function:
import pyspark.sql.functions as F
df = df.withColumn("_c0", F.year(F.col("_c0")))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.