foreachPartition（）自定义函数中的PySpark Access DataFrame列

Question

I have a function named "inside". 我有一个名为“内部”的函数。 I want to apply this function to a pyspark dataframe. 我想将此功能应用于pyspark数据框。 For this purpose i call the "foreachPartition(inside)" method on the dataframe I create. 为此，我在创建的数据帧上调用“ foreachPartition（inside）”方法。 The "inside" function needs the values of the dataframe. “内部”功能需要数据框的值。

The dataframe looks like this: 数据框如下所示：

>>> small_df
DataFrame[lon: double, lat: double, t: bigint]

The code looks like this: 代码如下：

def inside(iterator):
    row=iterator
    x=row.lon
    y=row.lat
    i=row.t 
    #do more stuff

small=pliades.iloc[0:20000,:] #take sample of rows from big dataset
small_df=sqlContext.createDataFrame(small) #create dataframe
test=small_df.foreachPartition(inside)

My question is: how can x,y,i get the values of the first(lon),second(lat) and third(t) columns of the dataframe respectively? 我的问题是：x，y，i如何分别获取数据帧的第一列（lon），第二列（lat）和第三列（t）的值？

I tried also doing it with row.lon, row.select, treating it as a list but couldn't get the result needed. 我也尝试使用row.lon，row.select进行处理，将其视为列表，但无法获得所需的结果。

Answer 1

foreach operates on RDD[Row] and each partitions is Iterator[Row] . foreach在RDD[Row] ，每个分区都是Iterator[Row] 。 If you want to have list of all values (not recommended due to possible memory issues 如果您想获得所有值的列表（由于可能的内存问题，不建议使用

def inside(iterator):
    x, y, i = zip(*iterator)
    ...
    yield ...

In general it is better to just iterate over rows one by one, without keeping all in memory: 通常，最好只逐行遍历所有行，而不将所有行保留在内存中：

def inside(iterator):
    for x, y, i in iterator:
        yield ...

You can also consider using pandas_udf : 您也可以考虑使用pandas_udf ：

If function returns the same number of values and only a single column you can use scalar type which takes pandas.Series and returns pandas.Series 如果函数返回相同数量的值并且只有一个列，则可以使用标量类型，该标量类型采用pandas.Series并返回pandas.Series
```
 from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf(schema, PandasUDFType.SCALAR) def f(*cols: pandas.Series) -> pandas.Series: ... df.select(f("col1", "col2", ...)) 
```
Grouped variant which takes pandas.DataFrame and returns pandas.DataFrame with the same or different number of rows: 分组后的变体，采用pandas.DataFrame并返回具有相同或不同行数的pandas.DataFrame ：
```
 from pyspark.sql.functions import spark_partition_id @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def g(df: pandas.DataFrame) -> pandas.DataFrame: ... df.groupby(spark_partition_id()).apply(g) 
```

foreachPartition（）自定义函数中的PySpark Access DataFrame列

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-05-22 07:15:57

foreachPartition（）自定义函数中的PySpark Access DataFrame列

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-05-22 07:15:57

解决方案1
1 已采纳 2018-05-22 07:15:57