[英]how to groupby rows and create new columns on pyspark
original dataframe原装 dataframe
id ID | email email | name名称 |
---|---|---|
1 1个 | id1@first.com id1@first.com | john约翰 |
2 2个 | id2@first.com id2@first.com | Maike迈科 |
2 2个 | id2@second id2@second | Maike迈科 |
1 1个 | id1@second.com id1@second.com | john约翰 |
I want to convert to this我想转换成这个
id ID | email email | email1邮箱1 | name名称 |
---|---|---|---|
1 1个 | id1@first.com id1@first.com | id1@second.com id1@second.com | john约翰 |
2 2个 | id2@first.com id2@first.com | id2@second id2@second | Maike迈科 |
it's only an example, I have very large file and more than 60 columns这只是一个例子,我有非常大的文件和超过 60 列
im using我正在使用
df = spark.read.option("header",True) \
.csv("contatcs.csv", sep =',')
but works to with pyspark.pandas api但适用于 pyspark.pandas api
import pyspark.pandas as ps
df = ps.read_csv('contacts.csv', sep=',')
df.head()
but I prefer spark.read because it's a Lazy Evaluation and the pandas API is not但我更喜欢 spark.read 因为它是一个懒惰的评估而 pandas API 不是
In order to do it deterministically in Spark, you must have some rule to determine which email is first and which is second.为了在 Spark 中确定性地执行此操作,您必须有一些规则来确定哪个 email 是第一个,哪个是第二个。 The row order in the CSV file (not having a specified column for row number) is a bad rule when you work with Spark, because every row may go to a different node, and then you will cannot see which of rows was first or second.当您使用 Spark 时,CSV 文件中的行顺序(没有指定的行号列)是一个错误的规则,因为每一行可能 go 到不同的节点,然后您将看不到哪一行是第一行或第二行.
In the following example, I assume that the rule is the alphabetical order, so I collect all the emails into one array using collect_set
and then sort them using array_sort
.在以下示例中,我假设规则是按字母顺序排列的,因此我使用 collect_set 将所有电子邮件收集到一个数组中,然后使用collect_set
对它们进行array_sort
。
Input:输入:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('1', 'id1@first.com', 'john'),
('2', 'id2@first.com', 'Maike'),
('2', 'id2@second', 'Maike'),
('1', 'id1@second.com', 'john')],
['id', 'email', 'name'])
Script:脚本:
emails = F.array_sort(F.collect_set('email'))
df = df.groupBy('id', 'name').agg(
emails[0].alias('email0'),
emails[1].alias('email1'),
)
df.show()
# +---+-----+-------------+--------------+
# | id| name| email0| email1|
# +---+-----+-------------+--------------+
# | 2|Maike|id2@first.com| id2@second|
# | 1| john|id1@first.com|id1@second.com|
# +---+-----+-------------+--------------+
If you had a row number, something like...如果你有一个行号,像...
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('1', '1', 'id1@first.com', 'john'),
('2', '2', 'id2@first.com', 'Maike'),
('3', '2', 'id2@second', 'Maike'),
('4', '1', 'id1@second.com', 'john')],
['row_number', 'id', 'email', 'name'])
You could use something like below options:您可以使用类似以下选项的内容:
emails = F.array_sort(F.collect_set(F.struct(F.col('row_number').cast('long'), 'email')))
df = df.groupBy('id', 'name').agg(
emails[0]['email'].alias('email0'),
emails[1]['email'].alias('email1'),
)
df.show()
# +---+-----+-------------+--------------+
# | id| name| email0| email1|
# +---+-----+-------------+--------------+
# | 2|Maike|id2@first.com| id2@second|
# | 1| john|id1@first.com|id1@second.com|
# +---+-----+-------------+--------------+
from pyspark.sql import Window as W
w = W.partitionBy('id', 'name').orderBy('row_number')
df = (df
.withColumn('_rn', F.row_number().over(w))
.filter('_rn <= 2')
.withColumn('_rn', F.concat(F.lit('email'), '_rn'))
.groupBy('id', 'name')
.pivot('_rn')
.agg(F.first('email'))
)
df.show()
# +---+-----+-------------+--------------+
# | id| name| email1| email2|
# +---+-----+-------------+--------------+
# | 1| john|id1@first.com|id1@second.com|
# | 2|Maike|id2@first.com| id2@second|
# +---+-----+-------------+--------------+
pyspark pyspark
I have included a corner case when there is uneven number of email ids.当 email id 的数量为奇数时,我包含了一个极端情况。 For that, find the max length and iterate to fetch email at each index:为此,找到最大长度并迭代以在每个索引处获取 email:
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 'id1@first.com', 'john'),(2, 'id2@first.com', 'Maike'),(2, 'id2@second', 'Maike'),(1, 'id1@second.com', 'john'),(3, 'id3@third.com', 'amy'),], ['id', 'email', 'name'])
df = df.groupby("id", "name").agg(F.collect_list("email").alias("email"))
max_len = df.select(F.size("email").alias("size")).collect()[0]["size"]
for i in range(1, max_len + 1):
df = df.withColumn(f"email{i}", F.when(F.size("email") >= i, F.element_at("email", i)).otherwise(F.lit("")))
df = df.drop("email")
Output: Output:
+---+-----+-------------+--------------+
|id |name |email1 |email2 |
+---+-----+-------------+--------------+
|2 |Maike|id2@first.com|id2@second |
|3 |amy |id3@third.com| |
|1 |john |id1@first.com|id1@second.com|
+---+-----+-------------+--------------+
pandas pandas
Since you have mentioned pandas in the tags, following is the solution in pandas:由于你在标签中提到了pandas,下面是pandas中的解决方案:
df = pd.DataFrame(data=[(1, 'id1@first.com', 'john'),(2, 'id2@first.com', 'Maike'),(2, 'id2@second', 'Maike'),(1, 'id1@second.com', 'john'),(3, 'id3@third.com', 'amy'),], columns=["id","email","name"])
df = df.groupby("id").agg(email=("email",list), name=("name",pd.unique))
df2 = df.apply(lambda row: pd.Series(data={f"email{i+1}":v for i,v in enumerate(row["email"])}, dtype="object"), axis=1)
df = df.drop("email", axis=1).merge(df2, on="id")
Output: Output:
name email1 email2
id
1 john id1@first.com id1@second.com
2 Maike id2@first.com id2@second
3 amy id3@third.com NaN
If you wanted to make it dynamic so that it creates new email counts based on maximum email count, you can try logic and code below如果你想让它动态化,以便它根据最大 email 计数创建新的 email 计数,你可以尝试下面的逻辑和代码
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('1', 'id1@first.com', 'john'),
('2', 'id2@first.com', 'Maike'),
('2', 'id2_3@first.com', 'Maike'),
('2', 'id2@second', 'Maike'),
('1', 'id1@second.com', 'john')],
['id', 'email', 'name'])
df.show()
+---+---------------+-----+
| id| email| name|
+---+---------------+-----+
| 1| id1@first.com| john|
| 2| id2@first.com|Maike|
| 2|id2_3@first.com|Maike|
| 2| id2@second|Maike|
| 1| id1@second.com| john|
Solution解决方案
new = ( df.groupBy('id','name').agg(collect_set('email').alias('email') )#Collect unique emails
.withColumn('x',max(size('email')).over(Window.partitionBy()))#Find the group with maximum emails, for use in email column count
)
new = (new.withColumn('email',F.struct(*[ F.col("email")[i].alias(f"email{i+1}") for i in range(new.select('x').collect()[0][0])]))#Convert email column to struct type
.selectExpr('x','id','name','email.*') #Select all columns
)
new.show(truncate=False)
Outcome结果
+---+---+-----+-------------+--------------+---------------+
|x |id |name |email1 |email2 |email3 |
+---+---+-----+-------------+--------------+---------------+
|3 |1 |john |id1@first.com|id1@second.com|null |
|3 |2 |Maike|id2@second |id2@first.com |id2_3@first.com|
+---+---+-----+-------------+--------------+---------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.