how can I assign a row with Pyspark Dataframe?

Question

Can you please convert this expression below from Pandas to Pyspark Dataframe, I try to see the equivalent of the loc in Pyspark?

import pandas as pd

df3 = pd.DataFrame(columns=["Devices","months"])
new_entry = {'Devices': 'device1', 'months': 'month1'}

df3.loc[len(df3)] = new_entry

Answer 1

In pyspark you need to union to add a new row to an existing data frame. But Spark data frame are unordered and there no index as in pandas so there no such equivalent. For the example you gave, you write it like this in Pyspark:

from pyspark.sql.types import *

schema = StructType([
    StructField('Devices', StringType(), True),
    StructField('months', TimestampType(), True)
])

df = spark.createDataFrame(sc.emptyRDD(), schema)
new_row_df = spark.createDataFrame([("device1", "month1")], ["Devices", "months"])

# or using spark.sql()
# new_row_df = spark.sql("select 'device1' as Devices, 'month1' as months")

df = df.union(new_row_df)

df.show()

#+-------+------+
#|Devices|months|
#+-------+------+
#|device1|month1|
#+-------+------+

If you want to add a row at " specific position ", you can create a column index using for example row_number function by defining an ordering, then filter the row number you want to assign the new row into before doing union:

from pyspark.sql import functions as F
from pyspark.sql import Window

df = df.withColumn("rn", F.row_number().over(Window.orderBy("Devices")))

# df.loc[1] = ... 
df = df.filter("rn <> 1").drop("rn").union(new_row_df)

how can I assign a row with Pyspark Dataframe?

Question

1 answers

solution1
2 ACCPTED 2021-02-04 22:11:46

how can I assign a row with Pyspark Dataframe?

Question

1 answers

solution1 2 ACCPTED 2021-02-04 22:11:46

solution1
2 ACCPTED 2021-02-04 22:11:46