[英]how to groupby rows and create new columns on pyspark
原装 dataframe
ID | 名称 | |
---|---|---|
1个 | id1@first.com | 约翰 |
2个 | id2@first.com | 迈科 |
2个 | id2@second | 迈科 |
1个 | id1@second.com | 约翰 |
我想转换成这个
ID | 邮箱1 | 名称 | |
---|---|---|---|
1个 | id1@first.com | id1@second.com | 约翰 |
2个 | id2@first.com | id2@second | 迈科 |
这只是一个例子,我有非常大的文件和超过 60 列
我正在使用
df = spark.read.option("header",True) \
.csv("contatcs.csv", sep =',')
但适用于 pyspark.pandas api
import pyspark.pandas as ps
df = ps.read_csv('contacts.csv', sep=',')
df.head()
但我更喜欢 spark.read 因为它是一个懒惰的评估而 pandas API 不是
为了在 Spark 中确定性地执行此操作,您必须有一些规则来确定哪个 email 是第一个,哪个是第二个。 当您使用 Spark 时,CSV 文件中的行顺序(没有指定的行号列)是一个错误的规则,因为每一行可能 go 到不同的节点,然后您将看不到哪一行是第一行或第二行.
在以下示例中,我假设规则是按字母顺序排列的,因此我使用 collect_set 将所有电子邮件收集到一个数组中,然后使用collect_set
对它们进行array_sort
。
输入:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('1', 'id1@first.com', 'john'),
('2', 'id2@first.com', 'Maike'),
('2', 'id2@second', 'Maike'),
('1', 'id1@second.com', 'john')],
['id', 'email', 'name'])
脚本:
emails = F.array_sort(F.collect_set('email'))
df = df.groupBy('id', 'name').agg(
emails[0].alias('email0'),
emails[1].alias('email1'),
)
df.show()
# +---+-----+-------------+--------------+
# | id| name| email0| email1|
# +---+-----+-------------+--------------+
# | 2|Maike|id2@first.com| id2@second|
# | 1| john|id1@first.com|id1@second.com|
# +---+-----+-------------+--------------+
如果你有一个行号,像...
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('1', '1', 'id1@first.com', 'john'),
('2', '2', 'id2@first.com', 'Maike'),
('3', '2', 'id2@second', 'Maike'),
('4', '1', 'id1@second.com', 'john')],
['row_number', 'id', 'email', 'name'])
您可以使用类似以下选项的内容:
emails = F.array_sort(F.collect_set(F.struct(F.col('row_number').cast('long'), 'email')))
df = df.groupBy('id', 'name').agg(
emails[0]['email'].alias('email0'),
emails[1]['email'].alias('email1'),
)
df.show()
# +---+-----+-------------+--------------+
# | id| name| email0| email1|
# +---+-----+-------------+--------------+
# | 2|Maike|id2@first.com| id2@second|
# | 1| john|id1@first.com|id1@second.com|
# +---+-----+-------------+--------------+
from pyspark.sql import Window as W
w = W.partitionBy('id', 'name').orderBy('row_number')
df = (df
.withColumn('_rn', F.row_number().over(w))
.filter('_rn <= 2')
.withColumn('_rn', F.concat(F.lit('email'), '_rn'))
.groupBy('id', 'name')
.pivot('_rn')
.agg(F.first('email'))
)
df.show()
# +---+-----+-------------+--------------+
# | id| name| email1| email2|
# +---+-----+-------------+--------------+
# | 1| john|id1@first.com|id1@second.com|
# | 2|Maike|id2@first.com| id2@second|
# +---+-----+-------------+--------------+
pyspark
当 email id 的数量为奇数时,我包含了一个极端情况。 为此,找到最大长度并迭代以在每个索引处获取 email:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 'id1@first.com', 'john'),(2, 'id2@first.com', 'Maike'),(2, 'id2@second', 'Maike'),(1, 'id1@second.com', 'john'),(3, 'id3@third.com', 'amy'),], ['id', 'email', 'name'])
df = df.groupby("id", "name").agg(F.collect_list("email").alias("email"))
max_len = df.select(F.size("email").alias("size")).collect()[0]["size"]
for i in range(1, max_len + 1):
df = df.withColumn(f"email{i}", F.when(F.size("email") >= i, F.element_at("email", i)).otherwise(F.lit("")))
df = df.drop("email")
Output:
+---+-----+-------------+--------------+
|id |name |email1 |email2 |
+---+-----+-------------+--------------+
|2 |Maike|id2@first.com|id2@second |
|3 |amy |id3@third.com| |
|1 |john |id1@first.com|id1@second.com|
+---+-----+-------------+--------------+
pandas
由于你在标签中提到了pandas,下面是pandas中的解决方案:
df = pd.DataFrame(data=[(1, 'id1@first.com', 'john'),(2, 'id2@first.com', 'Maike'),(2, 'id2@second', 'Maike'),(1, 'id1@second.com', 'john'),(3, 'id3@third.com', 'amy'),], columns=["id","email","name"])
df = df.groupby("id").agg(email=("email",list), name=("name",pd.unique))
df2 = df.apply(lambda row: pd.Series(data={f"email{i+1}":v for i,v in enumerate(row["email"])}, dtype="object"), axis=1)
df = df.drop("email", axis=1).merge(df2, on="id")
Output:
name email1 email2
id
1 john id1@first.com id1@second.com
2 Maike id2@first.com id2@second
3 amy id3@third.com NaN
如果你想让它动态化,以便它根据最大 email 计数创建新的 email 计数,你可以尝试下面的逻辑和代码
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('1', 'id1@first.com', 'john'),
('2', 'id2@first.com', 'Maike'),
('2', 'id2_3@first.com', 'Maike'),
('2', 'id2@second', 'Maike'),
('1', 'id1@second.com', 'john')],
['id', 'email', 'name'])
df.show()
+---+---------------+-----+
| id| email| name|
+---+---------------+-----+
| 1| id1@first.com| john|
| 2| id2@first.com|Maike|
| 2|id2_3@first.com|Maike|
| 2| id2@second|Maike|
| 1| id1@second.com| john|
解决方案
new = ( df.groupBy('id','name').agg(collect_set('email').alias('email') )#Collect unique emails
.withColumn('x',max(size('email')).over(Window.partitionBy()))#Find the group with maximum emails, for use in email column count
)
new = (new.withColumn('email',F.struct(*[ F.col("email")[i].alias(f"email{i+1}") for i in range(new.select('x').collect()[0][0])]))#Convert email column to struct type
.selectExpr('x','id','name','email.*') #Select all columns
)
new.show(truncate=False)
结果
+---+---+-----+-------------+--------------+---------------+
|x |id |name |email1 |email2 |email3 |
+---+---+-----+-------------+--------------+---------------+
|3 |1 |john |id1@first.com|id1@second.com|null |
|3 |2 |Maike|id2@second |id2@first.com |id2_3@first.com|
+---+---+-----+-------------+--------------+---------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.