简体   繁体   English

如何自行加入 dataframe 在组内创建所有组合

[英]How to join dataframe on itself creating all combinations inside groups

Some mock data:一些模拟数据:

pd.DataFrame({'date': {0: Timestamp('2021-08-01 '),
  1: Timestamp('2022-08-01 '),
  2: Timestamp('2021-08-02 '),
  3: Timestamp('2021-08-01 '),
  4: Timestamp('2022-08-01 '),
  5: Timestamp('2022-08-01 '),
  6: Timestamp('2022-08-01 ')                   },
 'product_nr': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7},
 'Category': {0:  'Cars', 1: 'Cars', 2: 'Cats', 3: 'Dogs', 4: 'Dogs', 5: 'Cats', 6 :'Cats'},
 'price': {0: '34',
  1: '24',
  2: '244',
  3: '284',
  4: '274',
  5: '354',
  6 : '250'}} )

How do I do an inner join on the same dataframe with a specific condition?如何在特定条件下对同一 dataframe 进行内部连接? I want to compare prices between rows that are the same category.我想比较同一类别的行之间的价格。 Desired output:所需的 output:

pd.DataFrame({
 'product_nr': {0: 1,  1: 3,  2: 5, 3: 7, 4:7},
 'Category': {0:  'Cars',  1: 'Cats', 2: 'Dogs', 3:'Cats', 4:'Cats'},
 'price': {0: '34',
  1: '244',
  2: '274',
  3: '250',
  4: '250'},
 'product_to_be_compared' : {0: 2,  1: 6,  2: 4, 3:3 , 4:6}
} )

Ie, I want to do an inner join / cross join (not sure what's most suitable).即,我想做一个内部连接/交叉连接(不确定什么最合适)。 I have a large dataframe and I want to pair rows together if they are the same category and date.我有一个大的 dataframe,如果它们是相同的类别和日期,我想将它们配对在一起。 Ideally, I would prefer to remove duplicated pairs, meaning my desired output would be 4 rows.理想情况下,我宁愿删除重复的对,这意味着我想要的 output 将是 4 行。

From your questions I know you're familiar with PySpark.根据您的问题,我知道您熟悉 PySpark。 This is how it could be done using PySpark dataframes.这就是使用 PySpark 数据帧完成的方法。 Even though it uses external itertools library, it should perform well, because that part resides inside a pandas_udf which is vectorized for performance.即使它使用外部的itertools库,它也应该表现良好,因为该部分位于pandas_udf中,它为性能而进行了矢量化。

Input df:输入 df:

import pandas as pd

pdf = pd.DataFrame({
    'date': {
        0: pd.Timestamp('2021-08-01'),
        1: pd.Timestamp('2021-08-01'),
        2: pd.Timestamp('2021-08-02'),
        3: pd.Timestamp('2021-08-03'),
        4: pd.Timestamp('2021-08-03'),
        5: pd.Timestamp('2021-08-02'),
        6: pd.Timestamp('2021-08-02')
    },
    'product_nr': {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', 6: '7'},
    'Category': {0:  'Cars', 1: 'Cars', 2: 'Cats', 3: 'Dogs', 4: 'Dogs', 5: 'Cats', 6 :'Cats'},
    'price': {
        0: '34',
        1: '24',
        2: '244',
        3: '284',
        4: '274',
        5: '354',
        6 : '250'
    }
})
df = spark.createDataFrame(pdf)

Script:脚本:

from pyspark.sql import functions as F
from itertools import combinations

@F.pandas_udf('array<array<string>>')
def arr_combinations(c: pd.Series) -> pd.Series:
    return c.apply(lambda x: list(combinations(x, 2)))

df2 = df.groupBy('Category', 'date').agg(F.collect_list('product_nr').alias('ps'))
df2 = df2.withColumn('ps', F.explode(arr_combinations('ps')))
df2 = df2.select(
    'Category', 'date',
    F.col('ps')[0].alias('product_nr'),
    F.col('ps')[1].alias('product_to_be_compared')
)
df3 = df.join(df2, ['product_nr', 'Category', 'date'])

df3.show()
# +----------+--------+-------------------+-----+----------------------+
# |product_nr|Category|               date|price|product_to_be_compared|
# +----------+--------+-------------------+-----+----------------------+
# |         3|    Cats|2021-08-02 00:00:00|  244|                     7|
# |         3|    Cats|2021-08-02 00:00:00|  244|                     6|
# |         1|    Cars|2021-08-01 00:00:00|   34|                     2|
# |         6|    Cats|2021-08-02 00:00:00|  354|                     7|
# |         4|    Dogs|2021-08-03 00:00:00|  284|                     5|
# +----------+--------+-------------------+-----+----------------------+

If you want to compare prices directly in this table, use the following:如果您想直接在此表中比较价格,请使用以下命令:

from pyspark.sql import functions as F
from itertools import combinations

@F.pandas_udf('array<array<array<string>>>')
def arr_combinations(c: pd.Series) -> pd.Series:
    return c.apply(lambda x: list(combinations(x, 2)))

df2 = df.groupBy('Category', 'date').agg(F.collect_list(F.array('product_nr', 'price')).alias('ps'))
df2 = df2.withColumn('ps', F.explode(arr_combinations('ps')))
df2 = df2.select(
    F.col('ps')[0][0].alias('product_nr'),
    'Category',
    'date',
    F.col('ps')[0][1].alias('product_price'),
    F.col('ps')[1][0].alias('product_to_be_compared'),
    F.col('ps')[1][1].alias('product_to_be_compared_price'),
)
df2.show()
# +----------+--------+-------------------+-------------+----------------------+----------------------------+
# |product_nr|Category|               date|product_price|product_to_be_compared|product_to_be_compared_price|
# +----------+--------+-------------------+-------------+----------------------+----------------------------+
# |         1|    Cars|2021-08-01 00:00:00|           34|                     2|                          24|
# |         3|    Cats|2021-08-02 00:00:00|          244|                     6|                         354|
# |         3|    Cats|2021-08-02 00:00:00|          244|                     7|                         250|
# |         6|    Cats|2021-08-02 00:00:00|          354|                     7|                         250|
# |         4|    Dogs|2021-08-03 00:00:00|          284|                     5|                         274|
# +----------+--------+-------------------+-------------+----------------------+----------------------------+

Assuming that you have two products per category, you can invert the values per group:假设每个类别有两个产品,您可以反转每个组的值:

df['product_to_be_compared'] = (df.groupby('Category')['product_nr']
                                  .transform(lambda s: s[::-1].values)
                               )

output: output:

        date  product_nr Category price  product_to_be_compared
0 2021-08-01           1     Cars    34                       2
1 2022-08-01           2     Cars    24                       1
2 2021-08-02           3     Cats   244                       6
3 2021-08-01           4     Dogs   284                       5
4 2022-08-01           5     Dogs   274                       4
5 2022-08-01           6     Cats   354                       3

To swap several columns:要交换几列:

df[['prod2', 'price2']] = (df.groupby('Category')['product_nr', 'price']
                             .transform(lambda s: s[::-1].values)
                           )

output: output:

        date  product_nr Category price  prod2 price2
0 2021-08-01           1     Cars    34      2     24
1 2022-08-01           2     Cars    24      1     34
2 2021-08-02           3     Cats   244      6    354
3 2021-08-01           4     Dogs   284      5    274
4 2022-08-01           5     Dogs   274      4    284
5 2022-08-01           6     Cats   354      3    244

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM