简体   繁体   中英

how can I assign a row with Pyspark Dataframe?

Can you please convert this expression below from Pandas to Pyspark Dataframe, I try to see the equivalent of the loc in Pyspark?

import pandas as pd

df3 = pd.DataFrame(columns=["Devices","months"])
new_entry = {'Devices': 'device1', 'months': 'month1'}

df3.loc[len(df3)] = new_entry

In pyspark you need to union to add a new row to an existing data frame. But Spark data frame are unordered and there no index as in pandas so there no such equivalent. For the example you gave, you write it like this in Pyspark:

from pyspark.sql.types import *

schema = StructType([
    StructField('Devices', StringType(), True),
    StructField('months', TimestampType(), True)
])

df = spark.createDataFrame(sc.emptyRDD(), schema)
new_row_df = spark.createDataFrame([("device1", "month1")], ["Devices", "months"])

# or using spark.sql()
# new_row_df = spark.sql("select 'device1' as Devices, 'month1' as months")

df = df.union(new_row_df)

df.show()

#+-------+------+
#|Devices|months|
#+-------+------+
#|device1|month1|
#+-------+------+

If you want to add a row at " specific position ", you can create a column index using for example row_number function by defining an ordering, then filter the row number you want to assign the new row into before doing union:

from pyspark.sql import functions as F
from pyspark.sql import Window

df = df.withColumn("rn", F.row_number().over(Window.orderBy("Devices")))

# df.loc[1] = ... 
df = df.filter("rn <> 1").drop("rn").union(new_row_df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM