简体   繁体   English

如何在 pyspark 上按行分组并创建新列

[英]how to groupby rows and create new columns on pyspark

original dataframe原装 dataframe

id ID email email name名称
1 1个 id1@first.com id1@first.com john约翰
2 2个 id2@first.com id2@first.com Maike迈科
2 2个 id2@second id2@second Maike迈科
1 1个 id1@second.com id1@second.com john约翰

I want to convert to this我想转换成这个

id ID email email email1邮箱1 name名称
1 1个 id1@first.com id1@first.com id1@second.com id1@second.com john约翰
2 2个 id2@first.com id2@first.com id2@second id2@second Maike迈科

it's only an example, I have very large file and more than 60 columns这只是一个例子,我有非常大的文件和超过 60 列

im using我正在使用

df = spark.read.option("header",True) \
        .csv("contatcs.csv", sep =',')

but works to with pyspark.pandas api但适用于 pyspark.pandas api

import pyspark.pandas as ps    

df = ps.read_csv('contacts.csv', sep=',')
df.head()

but I prefer spark.read because it's a Lazy Evaluation and the pandas API is not但我更喜欢 spark.read 因为它是一个懒惰的评估而 pandas API 不是

In order to do it deterministically in Spark, you must have some rule to determine which email is first and which is second.为了在 Spark 中确定性地执行此操作,您必须有一些规则来确定哪个 email 是第一个,哪个是第二个。 The row order in the CSV file (not having a specified column for row number) is a bad rule when you work with Spark, because every row may go to a different node, and then you will cannot see which of rows was first or second.当您使用 Spark 时,CSV 文件中的行顺序(没有指定的行号列)是一个错误的规则,因为每一行可能 go 到不同的节点,然后您将看不到哪一行是第一行或第二行.

In the following example, I assume that the rule is the alphabetical order, so I collect all the emails into one array using collect_set and then sort them using array_sort .在以下示例中,我假设规则是按字母顺序排列的,因此我使用 collect_set 将所有电子邮件收集到一个数组中,然后使用collect_set对它们进行array_sort

Input:输入:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('1', 'id1@first.com', 'john'),
     ('2', 'id2@first.com', 'Maike'),
     ('2', 'id2@second', 'Maike'),
     ('1', 'id1@second.com', 'john')],
    ['id', 'email', 'name'])

Script:脚本:

emails = F.array_sort(F.collect_set('email'))
df = df.groupBy('id', 'name').agg(
    emails[0].alias('email0'),
    emails[1].alias('email1'),
)
df.show()
# +---+-----+-------------+--------------+
# | id| name|       email0|        email1|
# +---+-----+-------------+--------------+
# |  2|Maike|id2@first.com|    id2@second|
# |  1| john|id1@first.com|id1@second.com|
# +---+-----+-------------+--------------+

If you had a row number, something like...如果你有一个行号,像...

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('1', '1', 'id1@first.com', 'john'),
     ('2', '2', 'id2@first.com', 'Maike'),
     ('3', '2', 'id2@second', 'Maike'),
     ('4', '1', 'id1@second.com', 'john')],
    ['row_number', 'id', 'email', 'name'])

You could use something like below options:您可以使用类似以下选项的内容:

emails = F.array_sort(F.collect_set(F.struct(F.col('row_number').cast('long'), 'email')))
df = df.groupBy('id', 'name').agg(
    emails[0]['email'].alias('email0'),
    emails[1]['email'].alias('email1'),
)
df.show()
# +---+-----+-------------+--------------+
# | id| name|       email0|        email1|
# +---+-----+-------------+--------------+
# |  2|Maike|id2@first.com|    id2@second|
# |  1| john|id1@first.com|id1@second.com|
# +---+-----+-------------+--------------+
from pyspark.sql import Window as W

w = W.partitionBy('id', 'name').orderBy('row_number')
df = (df
    .withColumn('_rn', F.row_number().over(w))
    .filter('_rn <= 2')
    .withColumn('_rn', F.concat(F.lit('email'), '_rn'))
    .groupBy('id', 'name')
    .pivot('_rn')
    .agg(F.first('email'))
)
df.show()
# +---+-----+-------------+--------------+
# | id| name|       email1|        email2|
# +---+-----+-------------+--------------+
# |  1| john|id1@first.com|id1@second.com|
# |  2|Maike|id2@first.com|    id2@second|
# +---+-----+-------------+--------------+

pyspark pyspark

I have included a corner case when there is uneven number of email ids.当 email id 的数量为奇数时,我包含了一个极端情况。 For that, find the max length and iterate to fetch email at each index:为此,找到最大长度并迭代以在每个索引处获取 email:

from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 'id1@first.com', 'john'),(2, 'id2@first.com', 'Maike'),(2, 'id2@second', 'Maike'),(1, 'id1@second.com', 'john'),(3, 'id3@third.com', 'amy'),], ['id', 'email', 'name'])

df = df.groupby("id", "name").agg(F.collect_list("email").alias("email"))
max_len = df.select(F.size("email").alias("size")).collect()[0]["size"]
for i in range(1, max_len + 1):
  df = df.withColumn(f"email{i}", F.when(F.size("email") >= i, F.element_at("email", i)).otherwise(F.lit("")))
df = df.drop("email")

Output: Output:

+---+-----+-------------+--------------+
|id |name |email1       |email2        |
+---+-----+-------------+--------------+
|2  |Maike|id2@first.com|id2@second    |
|3  |amy  |id3@third.com|              |
|1  |john |id1@first.com|id1@second.com|
+---+-----+-------------+--------------+

pandas pandas

Since you have mentioned pandas in the tags, following is the solution in pandas:由于你在标签中提到了pandas,下面是pandas中的解决方案:

df = pd.DataFrame(data=[(1, 'id1@first.com', 'john'),(2, 'id2@first.com', 'Maike'),(2, 'id2@second', 'Maike'),(1, 'id1@second.com', 'john'),(3, 'id3@third.com', 'amy'),], columns=["id","email","name"])

df = df.groupby("id").agg(email=("email",list), name=("name",pd.unique))
df2 = df.apply(lambda row: pd.Series(data={f"email{i+1}":v for i,v in enumerate(row["email"])}, dtype="object"), axis=1)
df = df.drop("email", axis=1).merge(df2, on="id")

Output: Output:

     name         email1          email2
id                                      
1    john  id1@first.com  id1@second.com
2   Maike  id2@first.com      id2@second
3     amy  id3@third.com             NaN

If you wanted to make it dynamic so that it creates new email counts based on maximum email count, you can try logic and code below如果你想让它动态化,以便它根据最大 email 计数创建新的 email 计数,你可以尝试下面的逻辑和代码

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('1', 'id1@first.com', 'john'),
     ('2', 'id2@first.com', 'Maike'),
     ('2', 'id2_3@first.com', 'Maike'),
     ('2', 'id2@second', 'Maike'),
     ('1', 'id1@second.com', 'john')],
    ['id', 'email', 'name'])

df.show()



+---+---------------+-----+
| id|          email| name|
+---+---------------+-----+
|  1|  id1@first.com| john|
|  2|  id2@first.com|Maike|
|  2|id2_3@first.com|Maike|
|  2|     id2@second|Maike|
|  1| id1@second.com| john|

Solution解决方案

new = (   df.groupBy('id','name').agg(collect_set('email').alias('email') )#Collect unique emails
        .withColumn('x',max(size('email')).over(Window.partitionBy()))#Find the group with maximum emails, for use in email column count
    )
     
new = (new.withColumn('email',F.struct(*[ F.col("email")[i].alias(f"email{i+1}") for i in range(new.select('x').collect()[0][0])]))#Convert email column to struct type
      .selectExpr('x','id','name','email.*') #Select all columns
     )
new.show(truncate=False)

Outcome结果

+---+---+-----+-------------+--------------+---------------+
|x  |id |name |email1       |email2        |email3         |
+---+---+-----+-------------+--------------+---------------+
|3  |1  |john |id1@first.com|id1@second.com|null           |
|3  |2  |Maike|id2@second   |id2@first.com |id2_3@first.com|
+---+---+-----+-------------+--------------+---------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM