For example, now I have this DataFrame.
+--------+------+
| id|number|
+--------+------+
|19891201| 1|
|19891201| 4|
+--------+------+
But I want this DataFrame to be like this.
+--------+------+
| id|number|
+--------+------+
|19891201| 1|
|19891201| 2|
|19891201| 3|
|19891201| 4|
+--------+------+
I want to create the new rows which have the numbers range from the min() and max() values from the column "number".
In this example, I want to have rows whose values in column "number" are 2 and 3.
Use sequence(start, stop, step)
function from spark 2.4+
version.
scala> df
.groupBy($"id")
.agg(
min($"number").as("start"),
max($"number").as("end")
)
.selectExpr(
"id",
"explode_outer(sequence(start,end,1)) as number"
)
.show(false)
Output
+--------+------+
|id |number|
+--------+------+
|19891201|1 |
|19891201|2 |
|19891201|3 |
|19891201|4 |
+--------+------+
Try this code
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import ArrayType, FloatType, StringType, IntegerType
from pyspark.sql.functions import min, max , udf, explode
schema = StructType([StructField("id", IntegerType(), True),StructField("number", IntegerType(), True)])
my_list = [(19891201, 1), (19891201,4)]
rdd = sc.parallelize(my_list)
df = sqlContext.createDataFrame(rdd, schema)
df.show()
df2 = df.groupby("id").agg(min("number").alias("min"),max("number").alias("max"))
def my_udf(min, max):
return list(range(min,max+1))
label_udf = udf(my_udf, ArrayType(IntegerType()))
df3 = df2.withColumn("l", label_udf(df2.min, df2.max)
df4 = df3.withColumn("ll", explode("l"))
df5 = df4.select("id", "ll")
df5.show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.