Modifying column value of a dataframe in Pyspark

Question

I have this dataframe, which stores the stock prices of a company:

|       _c0|       _c1|       _c2|       _c3|       _c4|       _c5|     _c6|
+----------+----------+----------+----------+----------+----------+--------+
|2009-01-02|153.302917|159.870193|151.762924|159.621811|159.621811| 7267900|
|2009-01-05|159.462845|164.549759|156.482239|162.965073|162.965073| 9841400|
|2009-01-06|165.414139|169.298874|162.140427|165.950653|165.950653|12933900|
|2009-01-07|163.099197|164.385834|158.345123|159.964584|159.964584| 9047400|
|2009-01-08|158.111633|161.544312|157.644669|161.544312|161.544312| 7248100|
|2009-01-09|162.691849|162.691849|155.687408|156.517014|156.517014| 8737300|

I would like to modify the first column in order to keep just the year, so I could then calculate the average stock price per year using groupBy , like this:

| _c0|       _c1|       _c2|       _c3|       _c4|       _c5|     _c6|
+----+----------+----------+----------+----------+----------+--------+
|2009|153.302917|159.870193|151.762924|159.621811|159.621811| 7267900|
|2009|159.462845|164.549759|156.482239|162.965073|162.965073| 9841400|
|2009|165.414139|169.298874|162.140427|165.950653|165.950653|12933900|
|2009|163.099197|164.385834|158.345123|159.964584|159.964584| 9047400|
|2009|158.111633|161.544312|157.644669|161.544312|161.544312| 7248100|
|2009|162.691849|162.691849|155.687408|156.517014|156.517014| 8737300|

But I have no idea about how to do it in pyspark. Could someone help me? Thanks!

Answer 1

Use the year function:

import pyspark.sql.functions as F

df = df.withColumn("_c0", F.year(F.col("_c0")))

Modifying column value of a dataframe in Pyspark

Question

1 answers

solution1
0 ACCPTED 2020-12-09 07:59:04

Modifying column value of a dataframe in Pyspark

Question

1 answers

solution1 0 ACCPTED 2020-12-09 07:59:04

solution1
0 ACCPTED 2020-12-09 07:59:04