简体   繁体   English

根据条件使用 PySpark 或 Pandas 的列组合

[英]Column combinations using PySpark or Pandas based on conditions

I want to generate combinations for the following 3 lists such that beds >= baths >= cars, and the combinations are then projected into 3 dataframe columns.我想为以下 3 个列表生成组合,这样床 >= 浴室 >= 汽车,然后将这些组合投影到 3 dataframe 列中。 How to achieve this?如何做到这一点?

beds = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
baths = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
cars = [0, 1, 2, 3, 4, 5, 6, 7, 8]

Desired result:期望的结果:

    nbed  nbath  ncar
0    1.0    1.0     0
1    1.0    1.0     1
2    2.0    1.0     0
3    2.0    2.0     0
4    2.0    1.0     1
5    2.0    2.0     1
6    2.0    1.0     2
7    2.0    2.0     2
8    3.0    1.0     0
9    3.0    2.0     0
10   3.0    3.0     0
11   3.0    1.0     1
12   3.0    2.0     1
13   3.0    3.0     1
14   3.0    1.0     2
15   3.0    2.0     2
16   3.0    3.0     2
17   3.0    1.0     3
18   3.0    2.0     3
19   3.0    3.0     3

You can do a simple join and then query:你可以做一个简单的连接然后查询:

df1,df2,df3 = [pd.DataFrame({name:lst} ) for lst,name in zip([beds,baths,cars], ['bed', 'bath', 'beyond'])]

df_123 = df1.join(df2, how='cross').join(df3, how='cross')

df_123.query("bed >= bath >= beyond").reset_index(drop=True)
#     bed  bath  beyond
#0    1.0   1.0       0
#1    1.0   1.0       1
#2    2.0   1.0       0
#3    2.0   1.0       1
#4    2.0   2.0       0
#..   ...   ...     ...
#151  8.0   8.0       4
#152  8.0   8.0       5
#153  8.0   8.0       6
#154  8.0   8.0       7
#155  8.0   8.0       8


One option is with conditional_join from pyjanitor and it is usually more efficient than a cartesian join for such inequality (non-equi) joins:一种选择是使用pyjanitorconditional_join ,对于这种不等式(非等值)连接,它通常比笛卡尔连接更有效:

# pip install pyjanitor
import janitor
import pandas as pd

# convert them to integer type, 
# to make all data types uniform
beds = pd.DataFrame({'beds': beds}, dtype = int)
cars = pd.Series(cars, name = 'cars', dtype = int)
baths = pd.Series(baths, name = 'baths', dtype = int)

(beds
.conditional_join(
    baths, 
    ("beds", "baths", ">="))
.conditional_join(
    cars, 
    ("baths", "cars", ">="))
)

     beds  baths  cars
0       1      1     0
1       1      1     1
2       2      1     0
3       2      1     1
4       2      2     0
..    ...    ...   ...
151     8      8     4
152     8      8     5
153     8      8     6
154     8      8     7
155     8      8     8

[156 rows x 3 columns]

PySpark. It seems easy to create combinations using Python's itertools.product . PySpark。使用 Python 的itertools.product创建组合似乎很容易。 You can provide the result to spark.createDataFrame and then use your filter .您可以将结果提供给spark.createDataFrame ,然后使用您的filter

Input:输入:

import itertools

beds = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
baths = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
cars = [0, 1, 2, 3, 4, 5, 6, 7, 8]

Script:脚本:

df = spark.createDataFrame(
    [e for e in itertools.product(*[beds, baths, cars])],
    ['nbed', 'nbath', 'ncar']
)
df = df.filter("nbed >= nbath and nbath >= ncar")

df.show()
# +----+-----+----+
# |nbed|nbath|ncar|
# +----+-----+----+
# | 1.0|  1.0|   0|
# | 1.0|  1.0|   1|
# | 2.0|  1.0|   0|
# | 2.0|  1.0|   1|
# | 2.0|  2.0|   0|
# | 2.0|  2.0|   1|
# | 2.0|  2.0|   2|
# | 3.0|  1.0|   0|
# | 3.0|  1.0|   1|
# | 3.0|  2.0|   0|
# | 3.0|  2.0|   1|
# | 3.0|  2.0|   2|
# | 3.0|  3.0|   0|
# | 3.0|  3.0|   1|
# | 3.0|  3.0|   2|
# | 3.0|  3.0|   3|
# | 4.0|  1.0|   0|
# | 4.0|  1.0|   1|
# | 4.0|  2.0|   0|
# | 4.0|  2.0|   1|
# +----+-----+----+
# only showing top 20 rows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM