简体   繁体   中英

PySpark Access DataFrame columns at foreachPartition() custom function

I have a function named "inside". I want to apply this function to a pyspark dataframe. For this purpose i call the "foreachPartition(inside)" method on the dataframe I create. The "inside" function needs the values of the dataframe.

The dataframe looks like this:

>>> small_df
DataFrame[lon: double, lat: double, t: bigint]

The code looks like this:

def inside(iterator):
    row=iterator
    x=row.lon
    y=row.lat
    i=row.t 
    #do more stuff

small=pliades.iloc[0:20000,:] #take sample of rows from big dataset
small_df=sqlContext.createDataFrame(small) #create dataframe
test=small_df.foreachPartition(inside)

My question is: how can x,y,i get the values of the first(lon),second(lat) and third(t) columns of the dataframe respectively?

I tried also doing it with row.lon, row.select, treating it as a list but couldn't get the result needed.

foreach operates on RDD[Row] and each partitions is Iterator[Row] . If you want to have list of all values (not recommended due to possible memory issues

def inside(iterator):
    x, y, i = zip(*iterator)
    ...
    yield ...

In general it is better to just iterate over rows one by one, without keeping all in memory:

def inside(iterator):
    for x, y, i in iterator:
        yield ...

You can also consider using pandas_udf :

  • If function returns the same number of values and only a single column you can use scalar type which takes pandas.Series and returns pandas.Series

     from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf(schema, PandasUDFType.SCALAR) def f(*cols: pandas.Series) -> pandas.Series: ... df.select(f("col1", "col2", ...)) 
  • Grouped variant which takes pandas.DataFrame and returns pandas.DataFrame with the same or different number of rows:

     from pyspark.sql.functions import spark_partition_id @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def g(df: pandas.DataFrame) -> pandas.DataFrame: ... df.groupby(spark_partition_id()).apply(g) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM