简体   繁体   English

foreachPartition()自定义函数中的PySpark Access DataFrame列

[英]PySpark Access DataFrame columns at foreachPartition() custom function

I have a function named "inside". 我有一个名为“内部”的函数。 I want to apply this function to a pyspark dataframe. 我想将此功能应用于pyspark数据框。 For this purpose i call the "foreachPartition(inside)" method on the dataframe I create. 为此,我在创建的数据帧上调用“ foreachPartition(inside)”方法。 The "inside" function needs the values of the dataframe. “内部”功能需要数据框的值。

The dataframe looks like this: 数据框如下所示:

>>> small_df
DataFrame[lon: double, lat: double, t: bigint]

The code looks like this: 代码如下:

def inside(iterator):
    row=iterator
    x=row.lon
    y=row.lat
    i=row.t 
    #do more stuff

small=pliades.iloc[0:20000,:] #take sample of rows from big dataset
small_df=sqlContext.createDataFrame(small) #create dataframe
test=small_df.foreachPartition(inside)

My question is: how can x,y,i get the values of the first(lon),second(lat) and third(t) columns of the dataframe respectively? 我的问题是:x,y,i如何分别获取数据帧的第一列(lon),第二列(lat)和第三列(t)的值?

I tried also doing it with row.lon, row.select, treating it as a list but couldn't get the result needed. 我也尝试使用row.lon,row.select进行处理,将其视为列表,但无法获得所需的结果。

foreach operates on RDD[Row] and each partitions is Iterator[Row] . foreachRDD[Row] ,每个分区都是Iterator[Row] If you want to have list of all values (not recommended due to possible memory issues 如果您想获得所有值的列表(由于可能的内存问题,不建议使用

def inside(iterator):
    x, y, i = zip(*iterator)
    ...
    yield ...

In general it is better to just iterate over rows one by one, without keeping all in memory: 通常,最好只逐行遍历所有行,而不将所有行保留在内存中:

def inside(iterator):
    for x, y, i in iterator:
        yield ...

You can also consider using pandas_udf : 您也可以考虑使用pandas_udf

  • If function returns the same number of values and only a single column you can use scalar type which takes pandas.Series and returns pandas.Series 如果函数返回相同数量的值并且只有一个列,则可以使用标量类型,该标量类型采用pandas.Series并返回pandas.Series

     from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf(schema, PandasUDFType.SCALAR) def f(*cols: pandas.Series) -> pandas.Series: ... df.select(f("col1", "col2", ...)) 
  • Grouped variant which takes pandas.DataFrame and returns pandas.DataFrame with the same or different number of rows: 分组后的变体,采用pandas.DataFrame并返回具有相同或不同行数的pandas.DataFrame

     from pyspark.sql.functions import spark_partition_id @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def g(df: pandas.DataFrame) -> pandas.DataFrame: ... df.groupby(spark_partition_id()).apply(g) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM