I have a function named "inside". I want to apply this function to a pyspark dataframe. For this purpose i call the "foreachPartition(inside)" method on the dataframe I create. The "inside" function needs the values of the dataframe.
The dataframe looks like this:
>>> small_df
DataFrame[lon: double, lat: double, t: bigint]
The code looks like this:
def inside(iterator):
row=iterator
x=row.lon
y=row.lat
i=row.t
#do more stuff
small=pliades.iloc[0:20000,:] #take sample of rows from big dataset
small_df=sqlContext.createDataFrame(small) #create dataframe
test=small_df.foreachPartition(inside)
My question is: how can x,y,i get the values of the first(lon),second(lat) and third(t) columns of the dataframe respectively?
I tried also doing it with row.lon, row.select, treating it as a list but couldn't get the result needed.
foreach
operates on RDD[Row]
and each partitions is Iterator[Row]
. If you want to have list of all values (not recommended due to possible memory issues
def inside(iterator):
x, y, i = zip(*iterator)
...
yield ...
In general it is better to just iterate over rows one by one, without keeping all in memory:
def inside(iterator):
for x, y, i in iterator:
yield ...
You can also consider using pandas_udf
:
If function returns the same number of values and only a single column you can use scalar type which takes pandas.Series
and returns pandas.Series
from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf(schema, PandasUDFType.SCALAR) def f(*cols: pandas.Series) -> pandas.Series: ... df.select(f("col1", "col2", ...))
Grouped variant which takes pandas.DataFrame
and returns pandas.DataFrame
with the same or different number of rows:
from pyspark.sql.functions import spark_partition_id @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def g(df: pandas.DataFrame) -> pandas.DataFrame: ... df.groupby(spark_partition_id()).apply(g)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.