根据条件使用 PySpark 或 Pandas 的列组合

Question

I want to generate combinations for the following 3 lists such that beds >= baths >= cars, and the combinations are then projected into 3 dataframe columns.我想为以下 3 个列表生成组合，这样床 >= 浴室 >= 汽车，然后将这些组合投影到 3 dataframe 列中。 How to achieve this?如何做到这一点？

beds = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
baths = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
cars = [0, 1, 2, 3, 4, 5, 6, 7, 8]

Desired result:期望的结果：

    nbed  nbath  ncar
0    1.0    1.0     0
1    1.0    1.0     1
2    2.0    1.0     0
3    2.0    2.0     0
4    2.0    1.0     1
5    2.0    2.0     1
6    2.0    1.0     2
7    2.0    2.0     2
8    3.0    1.0     0
9    3.0    2.0     0
10   3.0    3.0     0
11   3.0    1.0     1
12   3.0    2.0     1
13   3.0    3.0     1
14   3.0    1.0     2
15   3.0    2.0     2
16   3.0    3.0     2
17   3.0    1.0     3
18   3.0    2.0     3
19   3.0    3.0     3

Answer 1

You can do a simple join and then query:你可以做一个简单的连接然后查询：

df1,df2,df3 = [pd.DataFrame({name:lst} ) for lst,name in zip([beds,baths,cars], ['bed', 'bath', 'beyond'])]

df_123 = df1.join(df2, how='cross').join(df3, how='cross')

df_123.query("bed >= bath >= beyond").reset_index(drop=True)
#     bed  bath  beyond
#0    1.0   1.0       0
#1    1.0   1.0       1
#2    2.0   1.0       0
#3    2.0   1.0       1
#4    2.0   2.0       0
#..   ...   ...     ...
#151  8.0   8.0       4
#152  8.0   8.0       5
#153  8.0   8.0       6
#154  8.0   8.0       7
#155  8.0   8.0       8

Answer 2

One option is with conditional_join from pyjanitor and it is usually more efficient than a cartesian join for such inequality (non-equi) joins:一种选择是使用pyjanitor的conditional_join ，对于这种不等式（非等值）连接，它通常比笛卡尔连接更有效：

# pip install pyjanitor
import janitor
import pandas as pd

# convert them to integer type, 
# to make all data types uniform
beds = pd.DataFrame({'beds': beds}, dtype = int)
cars = pd.Series(cars, name = 'cars', dtype = int)
baths = pd.Series(baths, name = 'baths', dtype = int)

(beds
.conditional_join(
    baths, 
    ("beds", "baths", ">="))
.conditional_join(
    cars, 
    ("baths", "cars", ">="))
)

     beds  baths  cars
0       1      1     0
1       1      1     1
2       2      1     0
3       2      1     1
4       2      2     0
..    ...    ...   ...
151     8      8     4
152     8      8     5
153     8      8     6
154     8      8     7
155     8      8     8

[156 rows x 3 columns]

Answer 3

PySpark. It seems easy to create combinations using Python's itertools.product . PySpark。使用 Python 的itertools.product创建组合似乎很容易。 You can provide the result to spark.createDataFrame and then use your filter .您可以将结果提供给spark.createDataFrame ，然后使用您的filter 。

Input:输入：

import itertools

beds = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
baths = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
cars = [0, 1, 2, 3, 4, 5, 6, 7, 8]

Script:脚本：

df = spark.createDataFrame(
    [e for e in itertools.product(*[beds, baths, cars])],
    ['nbed', 'nbath', 'ncar']
)
df = df.filter("nbed >= nbath and nbath >= ncar")

df.show()
# +----+-----+----+
# |nbed|nbath|ncar|
# +----+-----+----+
# | 1.0|  1.0|   0|
# | 1.0|  1.0|   1|
# | 2.0|  1.0|   0|
# | 2.0|  1.0|   1|
# | 2.0|  2.0|   0|
# | 2.0|  2.0|   1|
# | 2.0|  2.0|   2|
# | 3.0|  1.0|   0|
# | 3.0|  1.0|   1|
# | 3.0|  2.0|   0|
# | 3.0|  2.0|   1|
# | 3.0|  2.0|   2|
# | 3.0|  3.0|   0|
# | 3.0|  3.0|   1|
# | 3.0|  3.0|   2|
# | 3.0|  3.0|   3|
# | 4.0|  1.0|   0|
# | 4.0|  1.0|   1|
# | 4.0|  2.0|   0|
# | 4.0|  2.0|   1|
# +----+-----+----+
# only showing top 20 rows

根据条件使用 PySpark 或 Pandas 的列组合

问题描述

3 个解决方案

解决方案1
1 2022-10-01 01:48:09

解决方案2
1 2022-10-01 09:57:34

解决方案3
0 2022-10-03 06:54:00

根据条件使用 PySpark 或 Pandas 的列组合

问题描述

3 个解决方案

解决方案1 1 2022-10-01 01:48:09

解决方案2 1 2022-10-01 09:57:34

解决方案3 0 2022-10-03 06:54:00

解决方案1
1 2022-10-01 01:48:09

解决方案2
1 2022-10-01 09:57:34

解决方案3
0 2022-10-03 06:54:00