[英]PySpark Access DataFrame columns at foreachPartition() custom function
I have a function named "inside". 我有一个名为“内部”的函数。 I want to apply this function to a pyspark dataframe.
我想将此功能应用于pyspark数据框。 For this purpose i call the "foreachPartition(inside)" method on the dataframe I create.
为此,我在创建的数据帧上调用“ foreachPartition(inside)”方法。 The "inside" function needs the values of the dataframe.
“内部”功能需要数据框的值。
The dataframe looks like this: 数据框如下所示:
>>> small_df
DataFrame[lon: double, lat: double, t: bigint]
The code looks like this: 代码如下:
def inside(iterator):
row=iterator
x=row.lon
y=row.lat
i=row.t
#do more stuff
small=pliades.iloc[0:20000,:] #take sample of rows from big dataset
small_df=sqlContext.createDataFrame(small) #create dataframe
test=small_df.foreachPartition(inside)
My question is: how can x,y,i get the values of the first(lon),second(lat) and third(t) columns of the dataframe respectively? 我的问题是:x,y,i如何分别获取数据帧的第一列(lon),第二列(lat)和第三列(t)的值?
I tried also doing it with row.lon, row.select, treating it as a list but couldn't get the result needed. 我也尝试使用row.lon,row.select进行处理,将其视为列表,但无法获得所需的结果。
foreach
operates on RDD[Row]
and each partitions is Iterator[Row]
. foreach
在RDD[Row]
,每个分区都是Iterator[Row]
。 If you want to have list of all values (not recommended due to possible memory issues 如果您想获得所有值的列表(由于可能的内存问题,不建议使用
def inside(iterator):
x, y, i = zip(*iterator)
...
yield ...
In general it is better to just iterate over rows one by one, without keeping all in memory: 通常,最好只逐行遍历所有行,而不将所有行保留在内存中:
def inside(iterator):
for x, y, i in iterator:
yield ...
You can also consider using pandas_udf
: 您也可以考虑使用
pandas_udf
:
If function returns the same number of values and only a single column you can use scalar type which takes pandas.Series
and returns pandas.Series
如果函数返回相同数量的值并且只有一个列,则可以使用标量类型,该标量类型采用
pandas.Series
并返回pandas.Series
from pyspark.sql.functions import pandas_udf, PandasUDFType @pandas_udf(schema, PandasUDFType.SCALAR) def f(*cols: pandas.Series) -> pandas.Series: ... df.select(f("col1", "col2", ...))
Grouped variant which takes pandas.DataFrame
and returns pandas.DataFrame
with the same or different number of rows: 分组后的变体,采用
pandas.DataFrame
并返回具有相同或不同行数的pandas.DataFrame
:
from pyspark.sql.functions import spark_partition_id @pandas_udf(schema, PandasUDFType.GROUPED_MAP) def g(df: pandas.DataFrame) -> pandas.DataFrame: ... df.groupby(spark_partition_id()).apply(g)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.